<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Vincent Gable's Blog &#187; Unicode</title>
	<atom:link href="http://vgable.com/blog/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://vgable.com/blog</link>
	<description>my weblog.</description>
	<lastBuildDate>Tue, 29 Nov 2011 22:20:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>When In Doubt, UTF8</title>
		<link>http://vgable.com/blog/2009/07/03/when-in-doubt-utf8/</link>
		<comments>http://vgable.com/blog/2009/07/03/when-in-doubt-utf8/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 17:16:59 +0000</pubDate>
		<dc:creator>Vincent Gable</dc:creator>
				<category><![CDATA[Accessibility]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[ASCII]]></category>
		<category><![CDATA[i18n]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[UTF8]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://vgable.com/blog/?p=339</guid>
		<description><![CDATA[/* If you are uncertain of the correct encoding, you should use UTF-8, */ /* which is the encoding designated by RFC 2396 as the correct encoding */ /* for use in URLs.… */ &#8211; CFURL.h This echos my experience, when in doubt, choose UTF8 for the web. UTF8 is backwards compatible with 7-bit ASCII [...]]]></description>
			<content:encoded><![CDATA[<blockquote>
<pre>
/* If you are uncertain of the correct encoding, you should use UTF-8, */
/* which is the encoding designated by <a href="http://www.faqs.org/rfcs/rfc2396.html">RFC 2396</a> as the correct encoding */
/* for use in URLs.… */
</pre>
</blockquote>
<p>&#8211; <a href="http://www.opensource.apple.com/source/CF/CF-476.15/CFURL.h"><code>CFURL.h</code></a></p>
<p>This echos my experience, <strong>when in doubt, choose <a href="http://en.wikipedia.org/wiki/UTF-8">UTF8</a> for the web</strong>. UTF8 is backwards compatible with 7-bit ASCII (eg. &#8216;A&#8217; is 0&#215;41 in ASCII and UTF8).</p>
<p>But know that UTF8 is a variable-length encoding: non-ASCII <strong>characters maybe represented by > 1 byte</strong>. As a general rule with Unicode, I <strong>do <em>not</em> expect a <code>char</code> or <code>wchar_t</code> to always map to a character in a string</strong>. Encoding details can be messy, e.g. &#8220;É&#8221; might be represented as one character, or two composed characters &#8220;´E&#8221;. It never hurts to <a href="http://www.codinghorror.com/blog/archives/001084.html">brush up on Unicode</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://vgable.com/blog/2009/07/03/when-in-doubt-utf8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Percent Escapes Gotcha</title>
		<link>http://vgable.com/blog/2009/04/10/percent-escapes-gotcha/</link>
		<comments>http://vgable.com/blog/2009/04/10/percent-escapes-gotcha/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 15:25:00 +0000</pubDate>
		<dc:creator>Vincent Gable</dc:creator>
				<category><![CDATA[Bug Bite]]></category>
		<category><![CDATA[Cocoa]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Sample Code]]></category>
		<category><![CDATA[NSString]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://vgable.com/blog/2009/04/10/percent-escapes-gotcha/</guid>
		<description><![CDATA[If you use stringByAddingPercentEscapesUsingEncoding: more than once on a string, the resulting string will not decode correctly from just one call to stringByReplacingPercentEscapesUsingEncoding:. (stringByAddingPercentEscapesUsingEncoding: is not indempotent). NSString *string = @"100%"; NSString *escapedOnce = [string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding]; NSString *escapedTwice = [escapedOnce stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding]; NSLog(@"%@ escaped once: %@, escaped twice: %@", string, escapedOnce, escapedTwice); 100% escaped once: 100%25, [...]]]></description>
			<content:encoded><![CDATA[<p>If you use <code><a href="http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/stringByAddingPercentEscapesUsingEncoding:">stringByAddingPercentEscapesUsingEncoding:</a></code> more than once on a string, the resulting string will <em>not</em> decode correctly from just one call to <code>stringByReplacingPercentEscapesUsingEncoding:</code>. (stringByAddingPercentEscapesUsingEncoding: is not <a href="http://en.wikipedia.org/wiki/Indempotent">indempotent</a>).</p>
<pre>
NSString *string = @"100%";
NSString *escapedOnce = [string stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSString *escapedTwice = [escapedOnce stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSLog(@"%@ escaped once: %@, escaped twice: %@", string, escapedOnce, escapedTwice);
</pre>
<blockquote><p>100% escaped once: <strong>100%25</strong>, escaped twice: <strong>100%2525</strong></p></blockquote>
<p>I thought I was programming defensively by eagerly adding percent-escapes to any string that would become part of a URL.  But this caused some annoying bugs resulting form a string being percent-escaped more then once. My solution was to create an indempotent replacement for <code>stringByAddingPercentEscapesUsingEncoding:</code> (I also simplified things a little by removing the encoding parameter, because I <em>never</em> used any encoding other then <code>NSUTF8StringEncoding</code>),</p>
<pre>
@implementation NSString (IndempotentPercentEscapes)
- (NSString*) stringByReplacingPercentEscapesOnce;
{
	NSString *unescaped = [self stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
	//self may be a string that looks like an invalidly escaped string,
	//eg @"100%", in that case it clearly wasn't escaped,
	//so we return it as our unescaped string.
	return unescaped ? unescaped : self;
}

- (NSString*) stringByAddingPercentEscapesOnce;
{
	return [[self stringByReplacingPercentEscapesOnce] stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
}
@end
</pre>
<p>Usage example,</p>
<pre>NSString *string = @"100%";
NSString *escapedOnce = [string stringByAddingPercentEscapesOnce];
NSString *escapedTwice = [escapedOnce stringByAddingPercentEscapesOnce];
NSLog(@"%@ escaped once: %@, escaped twice: %@", string, escapedOnce, escapedTwice);</pre>
<blockquote><p>100% escaped once: 100%25, escaped twice: 100%25</p></blockquote>
<p>The paranoid have probably noticed that <code>[aBadlyEncodedString stringByReplacingPercentEscapesOnce]</code> will return <code>aBadlyEncodedString</code> not <code>nil</code>, <strong>This could make it harder to detect an error.</strong></p>
<p>But it&#8217;s not something that I&#8217;m worried about for my application. Since I only ever use a UTF8 encoding, and it can represent <em>any</em> unicode character, it&#8217;s not possible to have an invalid string. But it&#8217;s certainly something to be aware of in situations where you might have strings with different encodings.</p>
]]></content:encoded>
			<wfw:commentRss>http://vgable.com/blog/2009/04/10/percent-escapes-gotcha/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ASCII is Dangerous</title>
		<link>http://vgable.com/blog/2008/09/05/ascii-is-dangerous/</link>
		<comments>http://vgable.com/blog/2008/09/05/ascii-is-dangerous/#comments</comments>
		<pubDate>Sat, 06 Sep 2008 02:02:06 +0000</pubDate>
		<dc:creator>Vincent Gable</dc:creator>
				<category><![CDATA[Accessibility]]></category>
		<category><![CDATA[Bug Bite]]></category>
		<category><![CDATA[MacOSX]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[ASCII]]></category>
		<category><![CDATA[File Systems]]></category>
		<category><![CDATA[NSASCIIStringEncoding]]></category>
		<category><![CDATA[NSString]]></category>
		<category><![CDATA[Paths]]></category>
		<category><![CDATA[Strings]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[UTF8]]></category>

		<guid isPermaLink="false">http://vgable.com/blog/2008/09/05/ascii-is-dangerous/</guid>
		<description><![CDATA[Never use NSASCIIStringEncoding &#8220;Foreign&#8221; characters, like the &#239; in &#8220;na&#239;ve&#8221;, will break your code, if you use NSASCIIStringEncoding. Such characters are more common then you might expect, even if you do not have an internationalized application. &#8220;Smart quotes&#8221;, and most well-rendered punctuation marks, are not 7-bit ASCII. For example, that last sentence can&#8217;t be encoded [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Never use <code>NSASCIIStringEncoding</code></strong><br />
<br />&#8220;Foreign&#8221; characters, like the &iuml; in &#8220;na&iuml;ve&#8221;, <em>will</em> break your code, if you use <code>NSASCIIStringEncoding</code>.  Such characters are more common then you might expect, even if you do not have an internationalized application.  &#8220;Smart quotes&#8221;, and most well-rendered punctuation marks, are not 7-bit ASCII.  For example, that last sentence can&#8217;t be encoded into ASCII, because my blog uses smart-quotes. (Seriously, [<code>thatSentence cStringUsingEncoding:NSASCIIStringEncoding]</code> will return <code>nil</code>!)</p>
<p>Here are some simple alternatives:</p>
<p><strong>C-String Paths</strong><br />
Use <code>- (const char *)fileSystemRepresentation;</code> to get a C-string that you can pass to POSIX functions.  The C-string will be freed when the <code>NSString</code> it came from is freed.</p>
<p><strong>An Alternate Encoding</strong><br />
<code>NSUTF8StringEncoding</code> is the closest safe alternative to <code>NSASCIIStringEncoding</code>.  ASCII characters have the same representation in UTF-8 as in ASCII.  UTF-8 strings will <code>printf</code> correctly, but will look wrong ('fancy' characters will be garbage) if you use <code>NSLog(%s)</code>.</p>
<p><strong>Native Foundation (<code>NSLog</code>) Encoding</strong><br />
Generally, Foundation uses UTF-16.  It is my understanding that this is what NSStrings are by default under the hood.  UTF-16 strings will look right if you print them with <code>NSLog(%s)</code>, but will not print correctly using <code>printf</code>.  In my experience <code>printf</code> truncates UTF-16 strings in an unpredictable way. <strong>Do not mix UTF-16 and <code>printf</code></strong>.</p>
<p><strong>Convenience C-Ctrings</strong><br />
<code>[someNSString UTF8String]</code> will give you a <code>const char *</code> to a <code>NULL</code>-terminated UTF8-string.  ASCII characters have the same representation in UTF-8 as in ASCII.</p>
<p><strong>Take a minute to search all your projects for <code>NSASCIIStringEncoding</code>, and replace it with a more robust option.</strong></p>
<p>It never hurts to <a href="http://www.codinghorror.com/blog/archives/001084.html">brush up on unicode</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://vgable.com/blog/2008/09/05/ascii-is-dangerous/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

