There are lots of examples of people using text searching and regular expressions to find data in webpages. These examples are doing it wrong.
NSXMLDocument
and an XPath query are your friends. They really make finding elements within a webpage, RSS feed or XML documents very easy.
I haven’t used XPath before, but after seeing Matt’s example code, I am convinced he’s right, because I’ve seen the other side of things. (I’ll let you in on a dirty little secret — right now the worst bit of the code-base I’m working on parses XML.)
NSError *error;
NSXMLDocument *document =
[[NSXMLDocument alloc] initWithData:responseData options:NSXMLDocumentTidyHTML error:&error];
[document autorelease];
// Deliberately ignore the error: with most HTML it will be filled with
// numerous "tidy" warnings.
NSXMLElement *rootNode = [document rootElement];
NSString *xpathQueryString =
@"//div[@id='newtothestore']/div[@class='modulecontent']/div[@class='list_content']/ul/li/a";
NSArray *newItemsNodes = [rootNode nodesForXPath:xpathQueryString error:&error];
if (error)
{
[[NSAlert alertWithError:error] runModal];
return;
}
(I added [document autorelease];
to the above code, because you should always immediately balance an alloc
/init
with autorelease
, outside of your own init
methods.)
XPath is AWESOME. Likewise, JQuery is AWESOME for delving into webpage DOM structures (it even supports an XPath-like syntax). XPath is very similar to the stuff showing up in CSS3 selectors (which are already kinda in JQuery. The code in these frameworks gets a sort of declarative/functional feel. I really like it.
Btw, your code blocks have the default styling ‘monospace’. Webkit says it’s from my UA stylesheet, so it might not look the same on your machine if you’re using Firefox, etc. I’d suggest changing it to something sane (like Monaco, then Consolas, then whatever XP has, etc.) so it will fall back to the best available monospace typeface on whoever’s machine is viewing it. As it stands, my machine is defaulting to Courier, which looks terrible at that size, for whatever reason. Really hurts my eyes (and I stare at a Linux box all day at work, with TERRIBLE typeface!)
But yes, manually parsing XML (or even using anything short of XPath) is inadvisable.
Comment by Jason Petersen — September 24, 2008 @ 11:16 pm