The Montoya Herald, a weblog about Blueprint, jQuery, design, music and life, publishing on the web since September 2005. Written by Christian Montoya: developer, designer and entrepreneur.

The Montoya Herald — ChristianMontoya.com

Search

Buy My DVD!

Like What I Do?

My Amazon.com Wish List

On this domain

Elsewhere

Scraping Wikipedia for whatever I need

Posted on June 25, 2008.

Sunday I was working on a project for which I need a complete list of games for the Nintendo Wii. The best way I know how to do this is as follows:

Let me dive a bit deeper. All the names on this web page are marked with the same HTML pattern:

wiki_wii_scrape

It's <i><a href="[some link]" title="[THE NAME]". With a bit of knowledge about regular expressions, and after I have used htmlentities() to encode the markup, the expression I need is:

preg_match_all("/<i><a href=(.+?)title=(\")(.+?)(\"| \()/", $r, $matches);

What this does is capture all occurrences of the regular expression I've used in the string $r, and place those in the array $matches. Furthermore, each (...) has its own set of matches, so that the $matches variable is actually an array of arrays. So, the names of these Wii games that I am looking for are all in $matches[3]. All I have to do is iterate through that array and put each name in my database.

But wait, there's more…

Something I should explain further: The most important part of my regular expression match is this: title=(\")(.+?)(\"| \(). I'm grabbing the content of the title attribute because the actual text sometimes contains more than just the game's name, but sometimes the title also contains something in parentheses, which is a Wikipedia disambiguation. So, there are two things that might come after the title attribute: a quotation mark, or a space & open-paren. With the pipe, I capture both, and I always get nothing more than the name of each game, without even having to trim().

Scraping is fun, eh?

Get a trackback link

3 Comments

  1. Elliott Back on June 25, 2008

    Copy, Paste into Excel, …, Profit??

  2. Christian Montoya on June 25, 2008

    2 reasons why not:

    • I also needed to scrape DS games with the same method.
    • I want to scrape periodically to capture new games that are added w/ a cron job.
  3. Thund3rstruck on November 11, 2009

    @Elliot

    "Copy, Paste into Excel, …, Profit??"

    Dude, are you retarded? Seriously? Cut & Paste… wow… just wow.

Leave a comment

Use Markdown or basic HTML. For posting code, use Postable. Please keep comments respectful and on topic.