php|architect's Guide to Web Scraping

php|architect's Guide to Web Scraping, 9780981034515 (0981034519), Marco Tabini, 2010

Today , there aremany services out there providing news feeds, and there are plenty of code libraries that can parse this data. So, with the proliferation of web services and public APIs, why is web scraping still so important to the future of the Web? It is important because of the rise ofmicroformats, Semantic Web technologies, theW3C Linking Open Data Community Project, and the Open Data Movement. Just this year at the TED conference, Tim Berners-Lee spoke of linked data saying, “We want the data. We want unadulterated data. We have to ask for raw data now. ”

The future of the Web is in providing and accessing raw data. How do we access this raw data? Through web scraping.

Yes, there are legal issues to consider when determining whether web scraping is a technique you want to employ , but the techniques this book describes are useful for accessing and parsing raw data of any kind found on the Web, whether it is from a web service API, an XML feed, RDF data embedded in a web page, microformats in HTML, or plain old HTML itself.

There is no way around it.T o be a successful web programmer , you must master these techniques. Make them part of your toolbox. Let them inform your software design decisions. I encourage you to bring us into the future of the Web. Scrape the Web within the bounds of the law, publish your raw data for public use, and demand raw data now!

Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire legitimate activity-for example, to capture data from an old version of a website for insertion into a modern CMS. This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks:

· Understanding HTTP requests

· The PHP HTTP streams wrapper

· cURL · pecl_http

· PEAR:HTTP

· Zend_Http_Client

· Building your own scraping library

· Using Tidy

· Analyzing code with the DOM, SimpleXML and XMLReader extensions

· CSS selector libraries

· PCRE pattern matching

· Tips and Tricks

· Multiprocessing / parallel processing

Comments

Amazing Books

Professional ASP.NET 2.0 Security, Membership, and Role Management (Wrox Professional Guides)

Wrox Press, 2006

ASP.Net security covers a broad range of subjects. Concepts such as Web security features, developing in partial trust, forms authentication, and securing configuration - just to name a few p are all integral components to helping developers ensure reliable security. Addressing the ASP.NET security features from the developer's point of view, this...

It's Not What You Sign, It's How You Sign It: Politeness in American Sign Language

Gallaudet University Press, 2007

People may not always remember the specifics of a conversation, but they do remember their overall impressions of the other person, as well as how well they felt the conversation proceeded. For example, they may recall whether or not they felt the other person was cooperative, and whether or not the other person was friendly, polite,...

College Algebra

McGraw-Hill, 2010

The Barnett/Ziegler/Byleen/Sobecki College Algebra series is designed to give students a solid grounding in pre-calculus topics in a user-friendly manner. The series emphasizes computational skills, ideas, and problem solving rather than theory. Explore/Discuss boxes integrated throughout each text encourage students to think critically about...

Oxide and Nitride Semiconductors: Processing, Properties, and Applications (Advances in Materials Research)

Springer, 2009

This is a unique book devoted to the important class of both oxide and nitride semiconductors. It covers processing, properties and applications of ZnO and GaN. The aim of this book is to provide the fundamental and technological issues for both ZnO and GaN. Materials properties, bulk growth, thin and thick films growth, control of polarity and...

Quality-Driven SystemC Design

Springer, 2009

Faced with the steadily increasing complexity and rapidly shortening timeto- market requirements designing electronic systems is a very challenging task. To manage this situation effectively the level of abstraction in modeling has been raised during the past years in the computer aided design community. Meanwhile, for the so-called...

3D Printing for Architects with MakerBot

Packt Publishing, 2013

Learn how to create successful 3D architectural models that you can print out with MakerBot Replicator 2X. It brings an extra dimension to your presentations and distinguishes your practice from the rest.

Overview

Intelligently design a model to be printed on the MakerBot from the scratch