{"id":305,"date":"2009-05-01T19:01:33","date_gmt":"2009-05-02T00:01:33","guid":{"rendered":"http:\/\/vgable.com\/blog\/2009\/05\/01\/nsxmlparser-and-htmlxhtml\/"},"modified":"2009-05-26T17:30:06","modified_gmt":"2009-05-26T22:30:06","slug":"nsxmlparser-and-htmlxhtml","status":"publish","type":"post","link":"https:\/\/vgable.com\/blog\/2009\/05\/01\/nsxmlparser-and-htmlxhtml\/","title":{"rendered":"NSXMLParser and HTML\/XHTML"},"content":{"rendered":"<p><code><a href=\"http:\/\/developer.apple.com\/documentation\/Cocoa\/Conceptual\/XMLParsing\/index.html\">NSXMLParser<\/a><\/code> converts HTML\/XML-entities in the <code>string<\/code> it gives the delegate callback <a href=\"http:\/\/developer.apple.com\/DOCUMENTATION\/Cocoa\/Reference\/Foundation\/Classes\/NSXMLParser_Class\/Reference\/Reference.html#\/\/apple_ref\/occ\/instm\/NSObject\/parser:foundCharacters:\"><code>-(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string<\/code><\/a>. So if an XML file contains the string, <code>\"&amp;lt; or &amp;gt;\"<\/code>, the converted string <code>\"&lt; or &gt;\"<\/code> would be reported to the delegate, <em>not<\/em> the string that you would see if you opened the file with TextEdit.<\/p>\n<p>This is correct behavior for XML files, but it can cause problems if you are trying to use an <code>NSXMLParser<\/code> to monkey with XHTML\/HTML.<\/p>\n<p>I was using an <code>NSXMLParser<\/code> to modify an XHTML webpage from <a href=\"http:\/\/simple.wikipedia.org\/\">Simple Wikipedia<\/a>, and it was turning: &#8220;<code>#include &amp;lt;stdio&amp;gt;<\/code>&#8221; into &#8220;<code>#include &lt;stdio&gt;<\/code>&#8220;, which then displayed as &#8220;<code>#include <\/code>&#8220;, because WebKit thought <code>&lt;stdio&gt;<\/code> was a tag.<\/p>\n<h3>Solution: Better Tools<\/h3>\n<p><strong>For scraping\/reading a webpage, <a href=\"http:\/\/cocoawithlove.com\/2008\/10\/using-libxml2-for-parsing-and-xpath.html\">XPath is the best choice<\/a><\/strong>. It is faster and less memory intensive then <code>NSXMLParser<\/code>, and very concise. My experience with it has been positive.<\/p>\n<p><strong>For modifying a webpage, JavaScript might be a better fit<\/strong> then Objective-C. You can use<br \/>\n<a href=\"http:\/\/developer.apple.com\/iphone\/library\/documentation\/UIKit\/Reference\/UIWebView_Class\/Reference\/Reference.html#\/\/apple_ref\/occ\/instm\/UIWebView\/stringByEvaluatingJavaScriptFromString:\"><code> - (NSString *)stringByEvaluatingJavaScriptFromString:(NSString *)script<\/code><\/a> to execute JavaScript inside a <code>UIWebView<\/code> in any Cocoa program. Neat stuff!<\/p>\n<h3>My Unsatisfying Solution<\/h3>\n<p><strong>Do not use this, see why below:<\/strong><\/p>\n<pre>\n- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string;\n{\n\tstring = [string stringByReplacingOccurrencesOfString:@\"<\" withString:@\"&amp;lt;\"];\n\tstring = [string stringByReplacingOccurrencesOfString:@\">\" withString:@\"&amp;gt;\"];\n\n\t\/* ... rest of the method *\/\n}\n<\/pre>\n<p>Frankly that code scares me.  I worry I&#8217;m not escaping something I should be. Experience has taught me I don&#8217;t have the experience of the teams who wrote HTML libraries, so it&#8217;s dangerous to try and recreate their work.<\/p>\n<p>(UPDATED 2009-05-26: And indeed, I screwed up. I was replacing <code>&amp;<\/code> with <code>&amp;amp;<\/code>, and that was causing trouble. While my &#8220;fix&#8221; of not converting <code>&amp;<\/code> seems to work on <em>one website<\/em>, it will not in general.)<\/p>\n<p>I would like to experiment with using JavaScript instead of an <code>NSXMLParser<\/code>, but at the moment I have a working (and surprisingly compact) <code>NSXMLParser<\/code> implementation, and much less familiarity with JavaScript then Objective-C. And compiled Obj-C code should be more performant then JavaScript.  So I&#8217;m sticking with what I have, at least until I&#8217;ve gotten Prometheus 1.0 out the door.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NSXMLParser converts HTML\/XML-entities in the string it gives the delegate callback -(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string. So if an XML file contains the string, &#8220;&amp;lt; or &amp;gt;&#8221;, the converted string &#8220;&lt; or &gt;&#8221; would be reported to the delegate, not the string that you would see if you opened the file with TextEdit. This is correct [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,203,5,4],"tags":[328,394,408,422,184],"class_list":["post-305","post","type-post","status-publish","format-standard","hentry","category-bug-bite","category-iphone","category-objective-c","category-programming","tag-html","tag-javascript","tag-nsxmlparser","tag-prometheus-development","tag-xml"],"_links":{"self":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/305","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/comments?post=305"}],"version-history":[{"count":0,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/305\/revisions"}],"wp:attachment":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/media?parent=305"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/categories?post=305"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/tags?post=305"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}