Vincent Gable’s Blog

May 1, 2009

NSXMLParser and HTML/XHTML

Filed under: Bug Bite,iPhone,Objective-C,Programming | , , , ,
― Vincent Gable on May 1, 2009

NSXMLParser converts HTML/XML-entities in the string it gives the delegate callback -(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string. So if an XML file contains the string, "&lt; or &gt;", the converted string "< or >" would be reported to the delegate, not the string that you would see if you opened the file with TextEdit.

This is correct behavior for XML files, but it can cause problems if you are trying to use an NSXMLParser to monkey with XHTML/HTML.

I was using an NSXMLParser to modify an XHTML webpage from Simple Wikipedia, and it was turning: “#include &lt;stdio&gt;” into “#include <stdio>“, which then displayed as “#include “, because WebKit thought <stdio> was a tag.

Solution: Better Tools

For scraping/reading a webpage, XPath is the best choice. It is faster and less memory intensive then NSXMLParser, and very concise. My experience with it has been positive.

For modifying a webpage, JavaScript might be a better fit then Objective-C. You can use
- (NSString *)stringByEvaluatingJavaScriptFromString:(NSString *)script to execute JavaScript inside a UIWebView in any Cocoa program. Neat stuff!

My Unsatisfying Solution

Do not use this, see why below:

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string;
{
	string = [string stringByReplacingOccurrencesOfString:@"<" withString:@"&lt;"];
	string = [string stringByReplacingOccurrencesOfString:@">" withString:@"&gt;"];

	/* ... rest of the method */
}

Frankly that code scares me. I worry I’m not escaping something I should be. Experience has taught me I don’t have the experience of the teams who wrote HTML libraries, so it’s dangerous to try and recreate their work.

(UPDATED 2009-05-26: And indeed, I screwed up. I was replacing & with &amp;, and that was causing trouble. While my “fix” of not converting & seems to work on one website, it will not in general.)

I would like to experiment with using JavaScript instead of an NSXMLParser, but at the moment I have a working (and surprisingly compact) NSXMLParser implementation, and much less familiarity with JavaScript then Objective-C. And compiled Obj-C code should be more performant then JavaScript. So I’m sticking with what I have, at least until I’ve gotten Prometheus 1.0 out the door.

September 24, 2008

XML Parsing: You’re Doing it Wrong

Filed under: Cocoa,MacOSX,Objective-C,Programming,Quotes,Sample Code,Tips | ,
― Vincent Gable on September 24, 2008

There are lots of examples of people using text searching and regular expressions to find data in webpages. These examples are doing it wrong.

NSXMLDocument and an XPath query are your friends. They really make finding elements within a webpage, RSS feed or XML documents very easy.

Matt Gallagher

I haven’t used XPath before, but after seeing Matt’s example code, I am convinced he’s right, because I’ve seen the other side of things. (I’ll let you in on a dirty little secret — right now the worst bit of the code-base I’m working on parses XML.)

    NSError *error;
    NSXMLDocument *document =
        [[NSXMLDocument alloc] initWithData:responseData options:NSXMLDocumentTidyHTML error:&error];
    [document autorelease];
    
    // Deliberately ignore the error: with most HTML it will be filled with
    // numerous "tidy" warnings.
    
    NSXMLElement *rootNode = [document rootElement];
    
    NSString *xpathQueryString =
        @"//div[@id='newtothestore']/div[@class='modulecontent']/div[@class='list_content']/ul/li/a";
    NSArray *newItemsNodes = [rootNode nodesForXPath:xpathQueryString error:&error];
    if (error)
    {
        [[NSAlert alertWithError:error] runModal];
        return;
    }

(I added [document autorelease]; to the above code, because you should always immediately balance an alloc/init with autorelease, outside of your own init methods.)

Powered by WordPress