C Strings
Welcome to the start of a new topic: strings. We'll cover
both plain old C strings as well as the much nicer
NSString
s of Objective-C (and their CFString
siblings). Today we start at the beginning with C strings.
C strings are also called null (or nul) terminated strings, zero
terminated strings or sometimes z-strings. A C string is simply a
block of memory where bytes represent characters. The last byte in the
block contains a zero value (the nul character) to mark the end of the
string.
Here is the memory layout of the string "iPhone" in ASCII encoding.
The string starts at memory address 48:
Address | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
Value | 105 | 80 | 104 | 111 | 110 | 101 | 0 |
Character | 'i' | 'P' | 'h' | 'o' | 'n' | 'e' | '\0' |
Notice that "iPhone" is six characters long but uses seven bytes of
memory, since it has a zero value after the last character to mark the
end of the string. Functions that work with C strings depend on the
zero terminator being there to know how long the string is. Forgetting
to write a zero at the end, or overwriting it by accident is a common
programming error when working with C strings. This is a type of
buffer
overrun error that can lead to security breaches and program
crashes.
Since C strings are just memory blocks, you declare C string variables
as pointers to type char
for mutable C strings or type
const char
or char const
for constant
(immutable) C strings. (The const
has the same meaning
before or directly after the char
.)
// example C string variable declarations
char *mutable_c_string;
const char *immutable_c_string1;
char const *immutable_c_string2;
Since C string literals are immutable, variables that point to literals
should be declared const
, or the compiler will complain:
char const *string1 = "foobar"; // okay
char *string2 = "barfoo"; // WARNING! should be const
When you need some temporary storage to receive a C string, it's common
to declare a char
array.
char buffer[81];
sprintf(buffer, "The answer is %d\n", 42);
Here we use the sprintf()
function to write formatted data
to a string that's placed in buffer
.
You may also sometimes see a C string declared like this:
char const name[] = "foo";
This is almost the same as:
char const *name = "foo";
There's a subtle and mostly unimportant difference between these two
declarations. The first one declares an array, the second one declares
a pointer to an array. We'll look at the difference between these two
in the future when we cover arrays.
Character Encodings
C doesn't mandate any specific character encoding for strings. C
strings frequently contain single byte encodings like ASCII or
ISO-8859-1 (Latin-1). In a single byte encoding, each byte in the C
string corresponds to a character, and the encoding defines 256
characters (or fewer -- some byte values may not be valid characters).
C strings can also contain multibyte encodings such as UTF-8 or Shift
JIS where some characters are represented by two or more bytes. It's
up to the application programmer to keep track of character encoding
issues when using C strings.
Most encodings used today are ASCII compatible, meaning that character
values from zero to 127 represent the same characters defined by the
ASCII encoding. If your program only ever processes ASCII text, you're
in luck: you can ignore most encoding issues (at least until some pesky
user decides to enter "San José" or "Björk"). In the real world,
people use many more characters than the measly 128 in the ASCII set,
so it's necessary to pay a little attention to character encodings when
working with C strings. When you have mismatched encodings, you get
data corruption and unhappy users.
For example, here is the word "Γεεκ" ("Geek" in Greek letters) at
memory address 64 using the
ISO-8859-7 single
byte encoding:
Address | 64 | 65 | 66 | 67 | 68 |
Value | 195 | 229 | 229 | 234 | 0 |
Character | 'Γ' | 'ε' | 'ε' | 'κ' | '\0' |
And here is the word "Γεεκ" at memory address 64 using the multibyte
UTF-8 encoding:
Address | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
Value | 206 | 147 | 206 | 181 | 206 | 181 | 206 | 186 | 0 |
Character | 'Γ' | 'ε' | 'ε' | 'κ' | '\0' |
Even though they represent the same text, the two strings have very
different representations in memory. If you fed one string into a
function expecting the other encoding, you would get an error at best.
Data corruption would be the usual result.
Converting Between Encodings
The standard C library doesn't provide support for converting between
encodings. On modern Unix and Unix-derived systems, the
iconv()
function is commonly used to convert between encodings. Objective-C
programs usually use the facilities provided by NSString
or CFString
. Since NSString
and
CFString
objects are represented internally as Unicode,
they can store text from any encoding. To translate a C string to an
NSString
:
char const *c_string = "foobar";
NSString *ns_string = [NSString stringWithCString:c_string
encoding:NSASCIIStringEncoding];
And to translate an NSString
to a C string:
NSString *ns_string = @"foobar";
NSData *c_string_data = [ns_string dataUsingEncoding:NSASCIIStringEncoding];
char const *c_string = c_string_data.bytes;
When converting to a C string, the -dataUsingEncoding:
method returns an NSData
object to manage the memory that
needs to be allocated for the C string. You simply use the
-bytes
method to retrieve the C string pointer.
If you're using the UTF-8 encoding, you can do this in one step using
the convenience method -UTF8String
.
char const *c_string = ns_string.UTF8String;
Note that this is just a short cut for calling
-dataUsingEncoding:
with the
NSUTF8StringEncoding
; the returned C string lives in an
autoreleased NSData
object. (This is great if you just
need to pass a C string along to a C function, but you'll need to copy
the returned C string if you want to keep it around.)
Core Foundation provides similar C functions. You use
CFStringCreateWithCString()
to create a CFString
from a C string:
char const *c_string = "foobar";
CFStringRef cf_string = CFStringCreateWithCString(kCFAllocatorDefault, c_string, kCFStringEncodingASCII);
Converting from a CFString
to a C string requires that you
provide a buffer to receive the converted C string.
CFStringRef cf_string = (CFStringRef)@"foobar";
char buffer[7];
Boolean result = CFStringGetCString(cf_string, buffer, 7, kCFStringEncodingASCII);
if (result == true) {
// ... conversion succeeded, okay to use string in buffer
printf("%s\n", buffer);
}
There's a lot more to cover. Computers may be all about numbers, but
it seems to me that programming is 90% text processing. Next time,
we'll look at C
string literals and NSString
literals.