Unicode String Literals
Last week we started looking into
C string and
NSString
literals. Today we'll continue this topic by
looking at embedding Unicode characters in literals using Unicode
escape sequences.
Unicode escape sequences were added to the C language in the TC2
amendment to C99, and to the Objective-C language (for
NSString
literals) with Mac OS X 10.5. The C99 standard
actually refers to these escape sequences as universal character
names since C doesn't require that the compiler use a particular
character set or encoding scheme, but iOS and most modern systems use
the Unicode character set so we'll continue to call them "Unicode
escapes".
There are two flavors of Unicode escapes. The first begins with a
backslash (\) followed by a lower case 'u' and four hexadecimal digits,
allowing the encoding of Unicode characters from 0 to 65535. This
Unicode range encodes the
basic
multilingual plane, which includes most characters in common use
today. The second Unicode escape type begins with a backslash (\)
followed by an upper case 'U' and eight hexadecimal digits,
which can encode every possible Unicode character, including historical
languages and special character sets such as musical notation.
// Examples of Unicode escapes
char const *gamma1 = "\u0393"; // capital Greek letter gamma (Γ)
NSString *gamma2 = @"\U00000393"; // also Γ
Unlike hexadecimal escape sequences, Unicode escapes are required to
have four or eight digits after the 'u' or 'U' respectively. If you
have too few digits, the compiler generates a "incomplete universal
character name" error.
If you're familiar with character encoding issues, you're probably
wondering how Unicode characters get encoded in plain old C strings.
Since the char
data type can only hold a character value
from zero to 255, what does the compiler do when it encounters a
capital gamma (Γ) with a Unicode character value of 915 (or 393 in
hex)? The C99 standard leaves this up to the compiler. In the version
of GCC that ships with Xcode and the iOS SDK, the answer is UTF-8
encoding.
This is one potential gotcha when using Unicode escape sequences. Even
though the string literal in our example specifies a single
logical character, capital gamma (Γ)
char const *gamma1 = "\u0393";
the compiler has no way to encode that logical character in a
single char
. We would expect that
NSLog(@"%u", strlen(gamma1));
would print 1
for the length of the string, but it
actually prints 2
.
If you read the first post
in the strings series, you might remember this table
showing the memory layout of the word "Geek" in Greek letters (Γεεκ) in
the UTF-8 encoding:
Address | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
Value | 206 | 147 | 206 | 181 | 206 | 181 | 206 | 186 | 0 |
Character | 'Γ' | 'ε' | 'ε' | 'κ' | '\0' |
In UTF-8, letters in the Greek alphabet take up two bytes (or
char
s) each. (And other characters may use three or four
bytes.) The standard C strlen()
function actually counts
char
s (or bytes) in the string rather than logical
characters, which made perfect sense in 1970 when computers used ASCII
or another single byte character set like Latin-1.
NSString
literals suffer from a similar problem.
Internally, NSString
uses 16 bit words to encode each
character. This made sense when NSString
was created,
since early versions of the Unicode standard only encoded up to 65,535
characters, so a 16 bit word value could hold any Unicode character (at
the time).
Unfortunately the Unicode consortium discovered there was a strong
desire to encode historical scripts and special character sets like
music and math notation along with modern languages, and 16 bits wasn't
large enough to accommodate all the symbols. The Unicode character set
was expanded to 32 bits and the UTF-16 encoding was created. In the
UTF-16 encoding, characters in the hexadecimal ranges DC00-DFFF (the
low surrogates) and D800-DB7F (the high surrogates) are used in pairs
to encode Unicode characters with values greater than 65,535. This is
analogous to how UTF-8 uses multiple bytes to encode a single logical
character.
So the musical G clef symbol (𝄞)
which has Unicode value 1D11E in hex (119,070 in decimal), is encoded
as two "characters" in an NSString
.
// NSString sometimes has a misleading "length"
NSString *gClef = @"\U0001d11e"; // musical G clef symbol (𝄞)
NSLog(@"%u", [gClef length]);
The log statement prints out 2
instead of 1
.
In memory, the NSString
data looks like this:
Address | 64 | 65 | 66 | 67 | 68 | 69 |
Value | 0xD834 | 0xDD1E | 0 |
Character | '𝄞' | '\0' |
Like the strlen()
function for C strings, the
-length
method actually returns the number of
words in the NSString
, which is usually but not
always the number of logical characters in the NSString
object.
Next time, we'll continue our dive into Unicode string madness by
looking at
wide
character strings.