A Single Article

Read it, comment, and share it with your friends

Scraping Wikipedia for whatever I need

Posted June 25 in Technology.

Sunday I was working on a project for which I need a complete list of games for the Nintendo Wii. The best way I know how to do this is as follows:

  • Find a web page that has a complete list of games: http://en.wikipedia.org/wiki/List_of_wii_games (Wikipedia has a lot of pages like this).
  • Use Perl regular expression matching in PHP to scrape this list from the page and insert into my database.
  • Done.

Let me dive a bit deeper. All the names on this web page are marked with the same HTML pattern:

wiki_wii_scrape

It’s <i><a href="[some link]" title="[THE NAME]". With a bit of knowledge about regular expressions, and after I have used htmlentities() to encode the markup, the expression I need is:

preg_match_all("/<i><a href=(.+?)title=(\")(.+?)(\"| \()/", $r, $matches);

What this does is capture all occurrences of the regular expression I’ve used in the string $r, and place those in the array $matches. Furthermore, each (...) has its own set of matches, so that the $matches variable is actually an array of arrays. So, the names of these Wii games that I am looking for are all in $matches[3]. All I have to do is iterate through that array and put each name in my database.

But wait, there’s more…

Something I should explain further: The most important part of my regular expression match is this: title=(\")(.+?)(\"| \(). I’m grabbing the content of the title attribute because the actual text sometimes contains more than just the game’s name, but sometimes the title also contains something in parentheses, which is a Wikipedia disambiguation. So, there are two things that might come after the title attribute: a quotation mark, or a space & open-paren. With the pipe, I capture both, and I always get nothing more than the name of each game, without even having to trim().

Scraping is fun, eh?


Get a Trackback link

2 Comments

Responses to my article
  1. Elliott Back June 25, 2008

    Copy, Paste into Excel, …, Profit??

  2. Christian Montoya June 25, 2008

    2 reasons why not:

    • I also needed to scrape DS games with the same method.
    • I want to scrape periodically to capture new games that are added w/ a cron job.

Leave a comment

Share your thoughts with the world

You can use Markdown, or you can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Please keep comments respectful and on topic.

This form is guarded by Akismet, so don't waste your time trying to submit spam. It won't work. Ever.





Stay on top of new updates at this site: Subscribe to the Feed!