Today , there aremany services out there providing news feeds, and there are plenty of code libraries that can parse this data. So, with the proliferation of web services and public APIs, why is web scraping still so important to the future of the Web? It is important because of the rise ofmicroformats, Semantic Web technologies, theW3C Linking Open Data Community Project, and the Open Data Movement. Just this year at the TED conference, Tim Berners-Lee spoke of linked data saying, “We want the data. We want unadulterated data. We have to ask for raw data now. ”
The future of the Web is in providing and accessing raw data. How do we access this raw data? Through web scraping.
Yes, there are legal issues to consider when determining whether web scraping is a technique you want to employ , but the techniques this book describes are useful for accessing and parsing raw data of any kind found on the Web, whether it is from a web service API, an XML feed, RDF data embedded in a web page, microformats in HTML, or plain old HTML itself.
There is no way around it.T o be a successful web programmer , you must master these techniques. Make them part of your toolbox. Let them inform your software design decisions. I encourage you to bring us into the future of the Web. Scrape the Web within the bounds of the law, publish your raw data for public use, and demand raw data now!
Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire legitimate activity-for example, to capture data from an old version of a website for insertion into a modern CMS. This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks:
· Understanding HTTP requests
· The PHP HTTP streams wrapper
· cURL · pecl_http
· PEAR:HTTP
· Zend_Http_Client
· Building your own scraping library
· Using Tidy
· Analyzing code with the DOM, SimpleXML and XMLReader extensions
· CSS selector libraries
· PCRE pattern matching
· Tips and Tricks
· Multiprocessing / parallel processing