Wednesday, September 22, 2010

More Screen Scraping!

The website CelebrityBookSigningsAndEvents.com contains a list of book signing appearances that different celebrities are making across the country. I like to go to this website occasionally to see if there is anybody I would be interested in seeing. However, it does not contain any search functionality that allows you to see what celebrities are visiting your area. I thought that it would be fun to screen scrape the webpage and make the data more searchable.

Screen Scraping Overview

Screen scraping means taking the HTML source code of a webpage and extracting data out of it. For example, one thing that I wanted to extract from CelebrityBookSigningsAndEvents.com was the titles of all the books. Looking at the source code, I could see a pattern to where the titles were located in the HTML. They all looked something like this:

<div class="modWrap">
  <p>
    <font size="4">
      <strong>
        <em>
          <u>
            <font color="#0000ff">Mysterious Galaxy</font>
          </u>
        </em>
     </strong>
    </font>
  ...
  </p>
  ...
  <p>
    <font size="4">
      <strong>
        <em>
          <u>
            <font color="#0000ff">Called To Coach</font>
          </u>
        </em>
     </strong>
    </font>
  ...
  </p>
  ... more tag hierarchies like these
</div>

To extract these book titles, I used XPath queries. This is a query language that allows you to pull data from specific parts of an XML document. To get all the book titles from my example, the XPath query would look like the following:

//div[@class="modWrap"]/p/font/strong/em/u/font

This will return a list all the font tags that are nested within this particular hierarchy of tags. The two slashes are the beginning mean that it doesn't matter what tags come before the div tag. The [@class='modWrap'] part returns only those div tags that have a class attribute with a value of "modWrap". The PHP code to run this query would look like the following:

//load the HTML source code into a DOM
$html = file_get_contents('http://www.celebritybooksigningsandevents.com/events');
$dom = new DOMDocument();
$dom->loadHTML($html);

//run the XPath query
$xpath = new DOMXPath($dom);
$bookTitleNodes = $xpath->query('//div[@class="modWrap"]/p/font/strong/em/u/font');
foreach ($bookTitleNodes as $node){
  echo $node->textContent; //the title is the text within the "font" tag
}

Problems

Aside from slight inconsistencies in the structure of the HTML, which I could account for by tweaking the XPath queries, the biggest problem I ran into was the fact that the source code was littered with these two strange characters--ASCII 160 and ASCII 194. Having ASCII values greater than 127, they were not part of the normal ASCII character set, which gave me problems in places where I needed to access individual characters in a string. They appeared as spaces in my web browser, but were not treated as spaces in my PHP code. Simply replacing all of these characters with spaces before creating the DOM fixed this issue.

The DOMDocument::loadHTML() function was throwing warnings that didn't affect the screen scraping results, but that I didn't want appearing on my webpage. You can silence the error messages that a function generates using PHP's @ operator: @$dom->loadHTML($html);

I also ran into a problem of my web server running an earlier version of PHP than my local computer. It took me forever to track down. FYI: DateTime::getTimestamp() is only supported in PHP versions 5.3 and above...

Implementation

After scraping the page, I save the data to an XML file and use it as a cache. If the file gets to be more than an hour old, it will refresh the cache by re-scraping the original webpage. This keeps the cache up to date with any changes that were made to the original webpage. And by using a cache, CelebrityBookSigningsAndEvents.com is not constantly harassed by requests from my website.

No comments: