Screen Scraping Overview
Screen scraping means taking the HTML source code of a webpage and extracting data out of it. For example, one thing that I wanted to extract from CelebrityBookSigningsAndEvents.com was the titles of all the books. Looking at the source code, I could see a pattern to where the titles were located in the HTML. They all looked something like this:
<div class="modWrap"> <p> <font size="4"> <strong> <em> <u> <font color="#0000ff">Mysterious Galaxy</font> </u> </em> </strong> </font> ... </p> ... <p> <font size="4"> <strong> <em> <u> <font color="#0000ff">Called To Coach</font> </u> </em> </strong> </font> ... </p> ... more tag hierarchies like these </div>
To extract these book titles, I used XPath queries. This is a query language that allows you to pull data from specific parts of an XML document. To get all the book titles from my example, the XPath query would look like the following:
//div[@class="modWrap"]/p/font/strong/em/u/font
This will return a list all the
font
tags that are nested within this particular hierarchy of tags. The two slashes are the beginning mean that it doesn't matter what tags come before the div
tag. The [@class='modWrap']
part returns only those div
tags that have a class
attribute with a value of "modWrap". The PHP code to run this query would look like the following://load the HTML source code into a DOM $html = file_get_contents('http://www.celebritybooksigningsandevents.com/events'); $dom = new DOMDocument(); $dom->loadHTML($html); //run the XPath query $xpath = new DOMXPath($dom); $bookTitleNodes = $xpath->query('//div[@class="modWrap"]/p/font/strong/em/u/font'); foreach ($bookTitleNodes as $node){ echo $node->textContent; //the title is the text within the "font" tag }
Problems
Aside from slight inconsistencies in the structure of the HTML, which I could account for by tweaking the XPath queries, the biggest problem I ran into was the fact that the source code was littered with these two strange characters--ASCII 160 and ASCII 194. Having ASCII values greater than 127, they were not part of the normal ASCII character set, which gave me problems in places where I needed to access individual characters in a string. They appeared as spaces in my web browser, but were not treated as spaces in my PHP code. Simply replacing all of these characters with spaces before creating the DOM fixed this issue.
The
DOMDocument::loadHTML()
function was throwing warnings that didn't affect the screen scraping results, but that I didn't want appearing on my webpage. You can silence the error messages that a function generates using PHP's @
operator: @$dom->loadHTML($html);
I also ran into a problem of my web server running an earlier version of PHP than my local computer. It took me forever to track down. FYI:
DateTime::getTimestamp()
is only supported in PHP versions 5.3 and above...Implementation
After scraping the page, I save the data to an XML file and use it as a cache. If the file gets to be more than an hour old, it will refresh the cache by re-scraping the original webpage. This keeps the cache up to date with any changes that were made to the original webpage. And by using a cache, CelebrityBookSigningsAndEvents.com is not constantly harassed by requests from my website.
No comments:
Post a Comment