The Montoya Herald — ChristianMontoya.com
Sunday I was working on a project for which I need a complete list of games for the Nintendo Wii. The best way I know how to do this is as follows:
Let me dive a bit deeper. All the names on this web page are marked with the same HTML pattern:
It's <i><a href="[some link]" title="[THE NAME]". With a bit of knowledge about regular expressions, and after I have used htmlentities() to encode the markup, the expression I need is:
preg_match_all("/<i><a href=(.+?)title=(\")(.+?)(\"| \()/", $r, $matches);
What this does is capture all occurrences of the regular expression I've used in the string $r, and place those in the array $matches. Furthermore, each (...) has its own set of matches, so that the $matches variable is actually an array of arrays. So, the names of these Wii games that I am looking for are all in $matches[3]. All I have to do is iterate through that array and put each name in my database.
But wait, there's more…
Something I should explain further: The most important part of my regular expression match is this: title=(\")(.+?)(\"| \(). I'm grabbing the content of the title attribute because the actual text sometimes contains more than just the game's name, but sometimes the title also contains something in parentheses, which is a Wikipedia disambiguation. So, there are two things that might come after the title attribute: a quotation mark, or a space & open-paren. With the pipe, I capture both, and I always get nothing more than the name of each game, without even having to trim().
Scraping is fun, eh?
Sorry, comments for this entry are closed at this time.
Copy, Paste into Excel, …, Profit??
2 reasons why not:
@Elliot
"Copy, Paste into Excel, …, Profit??"
Dude, are you retarded? Seriously? Cut & Paste… wow… just wow.