The Montoya Herald — ChristianMontoya.com
Sunday I was working on a project for which I need a complete list of games for the Nintendo Wii. The best way I know how to do this is as follows:
Let me dive a bit deeper. All the names on this web page are marked with the same HTML pattern:
It's <i><a href="[some link]" title="[THE NAME]". With a bit of knowledge about regular expressions, and after I have used htmlentities() to encode the markup, the expression I need is:
preg_match_all("/<i><a href=(.+?)title=(\")(.+?)(\"| \()/", $r, $matches);
What this does is capture all occurrences of the regular expression I've used in the string $r, and place those in the array $matches. Furthermore, each (...) has its own set of matches, so that the $matches variable is actually an array of arrays. So, the names of these Wii games that I am looking for are all in $matches[3]. All I have to do is iterate through that array and put each name in my database.
But wait, there's more…
Something I should explain further: The most important part of my regular expression match is this: title=(\")(.+?)(\"| \(). I'm grabbing the content of the title attribute because the actual text sometimes contains more than just the game's name, but sometimes the title also contains something in parentheses, which is a Wikipedia disambiguation. So, there are two things that might come after the title attribute: a quotation mark, or a space & open-paren. With the pipe, I capture both, and I always get nothing more than the name of each game, without even having to trim().
Scraping is fun, eh?
Copy, Paste into Excel, …, Profit??
2 reasons why not:
@Elliot
"Copy, Paste into Excel, …, Profit??"
Dude, are you retarded? Seriously? Cut & Paste… wow… just wow.