Beware, there are robot traps built into some web servers, that if they
detected a robot trying to download all of their pages, they will stop
serving pages to the robot (some are designed to even cause the robot
harm). This is especially bad if an organization's proxy server triggers
the trap, and so prevents the entire organization for accessing that
site's pages (this happened at PARC a couple of months ago, and we had to
write to that site to be taken off their list of robots).
Ed
On Fri, 2 Aug 1996, Ong Beng Hui wrote:
> > interesting and usefull places on the net. It would be nice to have
> > a small stand alone program that could, given a base URL and a number
> > specifying how many levels deep it should go, just go out and get the
> > pages via Squid.
>
> I believe what we need here is a Spider that simply
> craw and explore a specified list of sites, be it
> extract from the access.log or otherwise.
>
> Maybe Harvest Gatherer can fit the bill.
>
> *8)
> Ong Beng Hui
> ongbh@singnet.com.sg
> ...Yet Another Day in a ISP Business
> ...and they lived happily ever after
>
>
Received on Sat Aug 03 1996 - 12:26:38 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:32:45 MST