Saturday, October 30, 2010

Screen-scraping Wikipedia

In order to screen-scrape a page on Wikipedia, there is one extra step that you must take in order to successfully download a page for processing. You must include a User-Agent header in your HTTP request. Wikipedia requires that this header be included or else it will return a 403 Forbidden error. I found this out thanks to a user on the #mediawiki IRC channel. They suggest that you set the User-Agent to something which uniquely identifies your program or application. They strongly discourage using the User-Agent string of a browser because this signals that you might be doing something malicious.

It is easy to set the User-Agent header in PHP. You can either edit your PHP installation's php.ini file or add the following line of code to your PHP script. The cURL library also supports setting HTTP headers, but this library is not included in the standard PHP installation.

//tell it what value to use for the User-Agent header
ini_set('user_agent', 'My Cool Screen-Scraper (+http://www.mangst.com)');

//includes the above User-Agent header in this request and all subsequent requests
$page = file_get_contents('http://en.wikipedia.org/wiki/Pumpkin');

Note that this is different from the header() function. The header() function is used to set the headers of the HTTP response that the PHP script itself is generating. This has nothing to do with any HTTP requests that the script makes in the process of generating its response.

7 comments:

birchy said...

Bless you! I searched long and hard to solve this problem. Thank you for sharing this valuable piece of know-how!

Michael Angstadt said...

Thanks birchy! I'm glad that you found it helpful. :)

Extract Data From Website said...

Hi,

Web scraping is usually a web feed is made available by the same entity that created the content. Typically the feed comes from the same place as the website. Thanks.........

Extract Data From Website

digital signatures sharepoint said...

I wasted my 1 hour to figure out that Wikipedia requires that HTTP request header be included or else it will return a 403 Forbidden error.Your blog really saved me from spending a full day.Good work .Keep it up.

Michael Angstadt said...

Glad to help.

Sem said...

Greetings from Norway

I see that the other person above used an hour to find this soulution and that he was happy that he didnt spend the whole day.

Guess what. I have spent a full saturday to get this...

If it was not for you and this info I would propably spend tomorrow as well.

Michael Angstadt said...

Glad I could help you, Sem!