Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.
When the Web began, it was a pretty small place. It didn't take much to keep abreast of new sites, and with subject indexes like the fledgling Yahoo! and NCSA's "What's New" page, you could actually give keeping up with newly added pages the old college try.
Now, even the biggest search engines—yes, even Google—admit they don't index the entire Web. It's simply not possible. At the same time, the Web is more compelling than ever. More information is being put online at a faster clip—be it up-to-the-minute data or large collections of old materials finding an online home. The Web is more browsable, more searchable, and more useful than it ever was when it was still small. That said, we, its users, can only go so fast when searching, processing, and taking in information.
Thankfully, spidering allows us to bring a bit of sanity to the wealth of information available. Spidering is the process of automating the grabbing and sifting of information on the Web, saving us the trouble of having to browse it all manually. Spiders range in complexity from the simplest script to grab the latest weather information from a web page, to the armies of complex spiders working in concert with one another, searching, cataloging, and indexing the Web's more than three billion resources for a search engine like Google.
This book teaches you the methodologies and algorithms behind spiders and the variety of ways that spiders can be used. Hopefully, it will inspire you to come up with some useful spiders of your own.