NSXMLParser
converts HTML/XML-entities in the string
it gives the delegate callback -(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string
. So if an XML file contains the string, "< or >"
, the converted string "< or >"
would be reported to the delegate, not the string that you would see if you opened the file with TextEdit.
This is correct behavior for XML files, but it can cause problems if you are trying to use an NSXMLParser
to monkey with XHTML/HTML.
I was using an NSXMLParser
to modify an XHTML webpage from Simple Wikipedia, and it was turning: “#include <stdio>
” into “#include <stdio>
“, which then displayed as “#include
“, because WebKit thought <stdio>
was a tag.
Solution: Better Tools
For scraping/reading a webpage, XPath is the best choice. It is faster and less memory intensive then NSXMLParser
, and very concise. My experience with it has been positive.
For modifying a webpage, JavaScript might be a better fit then Objective-C. You can use
- (NSString *)stringByEvaluatingJavaScriptFromString:(NSString *)script
to execute JavaScript inside a UIWebView
in any Cocoa program. Neat stuff!
My Unsatisfying Solution
Do not use this, see why below:
- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string; { string = [string stringByReplacingOccurrencesOfString:@"<" withString:@"<"]; string = [string stringByReplacingOccurrencesOfString:@">" withString:@">"]; /* ... rest of the method */ }
Frankly that code scares me. I worry I’m not escaping something I should be. Experience has taught me I don’t have the experience of the teams who wrote HTML libraries, so it’s dangerous to try and recreate their work.
(UPDATED 2009-05-26: And indeed, I screwed up. I was replacing &
with &
, and that was causing trouble. While my “fix” of not converting &
seems to work on one website, it will not in general.)
I would like to experiment with using JavaScript instead of an NSXMLParser
, but at the moment I have a working (and surprisingly compact) NSXMLParser
implementation, and much less familiarity with JavaScript then Objective-C. And compiled Obj-C code should be more performant then JavaScript. So I’m sticking with what I have, at least until I’ve gotten Prometheus 1.0 out the door.