Michael Angstadt's Blog: Screen-scraping Wikipedia

Saturday, October 30, 2010

Screen-scraping Wikipedia

In order to screen-scrape a page on Wikipedia, there is one extra step that you must take in order to successfully download a page for processing. You must include a User-Agent header in your HTTP request. Wikipedia requires that this header be included or else it will return a 403 Forbidden error. I found this out thanks to a user on the #mediawiki IRC channel. They suggest that you set the User-Agent to something which uniquely identifies your program or application. They strongly discourage using the User-Agent string of a browser because this signals that you might be doing something malicious.

It is easy to set the User-Agent header in PHP. You can either edit your PHP installation's php.ini file or add the following line of code to your PHP script. The cURL library also supports setting HTTP headers, but this library is not included in the standard PHP installation.

//tell it what value to use for the User-Agent header
ini_set('user_agent', 'My Cool Screen-Scraper (+http://www.mangst.com)');

//includes the above User-Agent header in this request and all subsequent requests
$page = file_get_contents('http://en.wikipedia.org/wiki/Pumpkin');

Note that this is different from the header() function. The header() function is used to set the headers of the HTTP response that the PHP script itself is generating. This has nothing to do with any HTTP requests that the script makes in the process of generating its response.

6 comments:

birchy said...: Bless you! I searched long and hard to solve this problem. Thank you for sharing this valuable piece of know-how!; December 9, 2010 at 8:05 AM
Michael Angstadt said...: Thanks birchy! I'm glad that you found it helpful. :); December 10, 2010 at 8:32 AM
digital signatures sharepoint said...: I wasted my 1 hour to figure out that Wikipedia requires that HTTP request header be included or else it will return a 403 Forbidden error.Your blog really saved me from spending a full day.Good work .Keep it up.; March 19, 2011 at 5:09 AM
Michael Angstadt said...: Glad to help.; March 19, 2011 at 10:55 AM
Sem said...: Greetings from Norway

I see that the other person above used an hour to find this soulution and that he was happy that he didnt spend the whole day.

Guess what. I have spent a full saturday to get this...

If it was not for you and this info I would propably spend tomorrow as well.; April 28, 2012 at 11:40 AM
Michael Angstadt said...: Glad I could help you, Sem!; April 30, 2012 at 8:51 PM