Vincent Gable’s Blog

July 3, 2009

When In Doubt, UTF8

Filed under: Accessibility,Programming | , , , ,
― Vincent Gable on July 3, 2009
/* If you are uncertain of the correct encoding, you should use UTF-8, */
/* which is the encoding designated by RFC 2396 as the correct encoding */
/* for use in URLs.… */

CFURL.h

This echos my experience, when in doubt, choose UTF8 for the web. UTF8 is backwards compatible with 7-bit ASCII (eg. ‘A’ is 0x41 in ASCII and UTF8).

But know that UTF8 is a variable-length encoding: non-ASCII characters maybe represented by > 1 byte. As a general rule with Unicode, I do not expect a char or wchar_t to always map to a character in a string. Encoding details can be messy, e.g. “É” might be represented as one character, or two composed characters “´E”. It never hurts to brush up on Unicode.

September 5, 2008

ASCII is Dangerous

Never use NSASCIIStringEncoding

“Foreign” characters, like the ï in “naïve”, will break your code, if you use NSASCIIStringEncoding. Such characters are more common then you might expect, even if you do not have an internationalized application. “Smart quotes”, and most well-rendered punctuation marks, are not 7-bit ASCII. For example, that last sentence can’t be encoded into ASCII, because my blog uses smart-quotes. (Seriously, [thatSentence cStringUsingEncoding:NSASCIIStringEncoding] will return nil!)

Here are some simple alternatives:

C-String Paths
Use - (const char *)fileSystemRepresentation; to get a C-string that you can pass to POSIX functions. The C-string will be freed when the NSString it came from is freed.

An Alternate Encoding
NSUTF8StringEncoding is the closest safe alternative to NSASCIIStringEncoding. ASCII characters have the same representation in UTF-8 as in ASCII. UTF-8 strings will printf correctly, but will look wrong (‘fancy’ characters will be garbage) if you use NSLog(%s).

Native Foundation (NSLog) Encoding
Generally, Foundation uses UTF-16. It is my understanding that this is what NSStrings are by default under the hood. UTF-16 strings will look right if you print them with NSLog(%s), but will not print correctly using printf. In my experience printf truncates UTF-16 strings in an unpredictable way. Do not mix UTF-16 and printf.

Convenience C-Ctrings
[someNSString UTF8String] will give you a const char * to a NULL-terminated UTF8-string. ASCII characters have the same representation in UTF-8 as in ASCII.

Take a minute to search all your projects for NSASCIIStringEncoding, and replace it with a more robust option.

It never hurts to brush up on unicode.

Powered by WordPress