Wide Character Strings

Last week we looked at Unicode escape sequences in C string and NSString literal. Today we'll take a quick overview of wide character strings and talk about where they fit into the iOS development.

When the C language was developed in the early 1970's, little thought was given to representing non-English languages. By default, most C compilers assumed that both code files and application output used 7-bit ASCII encoding and that each logical character in a string fit into a single 8-bit byte or char value. By the time C was first standardized by ANSI in 1989 (and by ISO in 1990), the need to handle many more characters than ASCII was obvious, but the Unicode standard was still nascent. So the ANSI C committee included a wide character type and wide character string functions in the C89 standard, but didn't tie wide character support to any specific character encoding scheme.

wchar_t

C89 introduced a new integer type, wchar_t. This is similar to a char, but typically "wider". On many systems, including Windows, a wchar_t is 16 bits. This is typical of systems that implemented their Unicode support using earlier versions of the Unicode standard, which originally defined fewer than 65,535 characters. Unicode was later expanded to support historical and special purpose character sets, so on some systems, including Mac OS X and iOS, the wchar_t type is 32 bits in size. This is often poorly documented, but you can use a simple test like this to find out:

// how big is wchar_t?
NSLog(@"wchar_t is %u bits wide", 8 * sizeof(wchar_t));

On a Mac or iPhone, this will print "wchar_t is 32 bits wide". Additionally, wchar_t is a typedef for another integer type in C. In C++, wchar_t is a built-in integer type. In practice, this means you need to #include <wchar.h> in C when using wide characters.

signed or unsigned?

The char integer type is almost always a signed integer with a range from -128 to 127. You can use the CHAR_MIN and CHAR_MAX constants defined in <limits.h> to find out the range for a particular compiler:

NSLog(@"CHAR_MIN = %0.f", (double)CHAR_MIN);
NSLog(@"CHAR_MAX = %0.f", (double)CHAR_MIN);

The wchar_t type can be signed or unsigned. The WCHAR_MIN and WCHAR_MAX constants hold the range of a wchar_t and are defined in both <wchar.h> and <stdint.h>.

NSLog(@"WCHAR_MIN = %0.f", (double)WCHAR_MIN);
NSLog(@"WCHAR_MAX = %0.f", (double)WCHAR_MIN);

On Windows, wchar_t is an unsigned 16-bit integer. On Mac and iPhone, wchar_t is a signed 32-bit integer, so the code above will print out "WCHAR_MAX = 2147483647" and "WCHAR_MIN = -2147483648". For the most part you don't need to worry about whether wchar_t is signed or unsigned; it only becomes important if you need to do comparisons and operations that mix wchar_t with other integer types (a rarity).

wide character literals

We looked at C string literals in previous entries. Wide character string literals are very similar, but are prefixed with 'L':

// example of a wide character string literal
wchar_t const *s = L"foobarf!";

Like C string literals, wide strings separated by only whitespace are considered one logical string:

// wide strings written in segments
wchar_t const *s1 = L"foo" "bar";
wchar_t const *s2 = L"Hello, " L"world!";

wide character functions

Most string functions in the standard C library are defined in the <string.h> header. A very similar set of functions for wide character strings are defined in <wchar.h>. The functions follow a similar naming convention. Where string functions are prefixed with str, the wide character equivalents are prefixed with wcs (for wide character string). So the strlen() function calculates the length of a string and the corresponding wcslen() function calculates the length of a wide character string.

not used much

In practice, you won't use wide character strings very often in Objective-C since the NSString class does just about everything wide character strings are meant to do, but you may occasionally run across them in other C libraries.

Next time, we'll begin looking at common string operations using C strings and NSStrings, starting with string concatenation.