String Literals
Last time we started our new topic, strings, by looking at
memory organization and
character encodings of C strings. Today we'll look at C string and
NSString
literals.
Most programs do string processing, or at least print out a status
message or two. It's convenient to define some strings directly in the
program's code. Since strings are really just lists of numbers, you
could certainly define your strings "by the numbers" using the raw
character codes:
// what does this print out?
char message[7] = { 105, 80, 104, 111, 110, 101, 0 };
printf(message);
Geek points if you recognized that message
is a null
terminated string. Super ultra mega geek points if you can read what
it says (hint: it's ASCII).
So obviously writing the raw character codes isn't that convenient for
the programmer. Since the compiler has to translate your code into
machine instructions anyway, it's a no-brainer to make it translate
strings into the correctly encoded bytes. A string literal is a
representation of a string in your program that the compiler translates
into the corresponding character codes and stores in the program's data
section. There are two kinds of string literals in Objective-C: plain
old C string literals and NSString
literals. They look
like this:
// C string literal
char const *s1 = "Hello, world!";
// NSString literal
NSString *s2 = @"Hello, world!";
The double quote characters (") mark the beginning and the end of the
string literal. NSString
literals are prefixed with
@
to distinguish them from C string literals. It's
important not to mix the two up; they're not directly compatible.
Line Breaks
String literals are not allowed to span multiple lines. Actually,
that's not exactly true, so I'll illustrate what I mean; this
string is not a legal string literal:
// not a legal string literal
char const *s1 = "Hello, world!
How are you?";
Line breaks are not allowed inside the double quotes in a string
literal. To include a line break, you use the new line (\n) escape
sequence. We'll talk more about escape sequences below, but using
the new line escape sequence, the string literal becomes:
// string literal containing a new line
char const *s1 = "Hello, world!\nHow are you?";
Notice that the new line escape sequence takes the place of an actual
line break in the code. When the compiler sees "\n" in a string
literal, it replaces it with ASCII character code 10, the line feed (or
new line) character.
But sometimes you don't want to add line breaks to your string literal,
but simply to break a long string literal across several lines to make
your code more readable. One way is to use a backslash (\) before the
line break to tell the compiler to ignore the line break; this is often
used to format long preprocessor macros. These two string literals are
identical:
char const *error1 = "Unable to complete request: please wait a few minutes and try again.";
char const *error2 = "Unable to complete request: \
please wait a few minutes and try again.";
This works for NSString
literals also, but note that any
leading space in the continuation line will be interpreted as part of
the string. Also note that the backslash (\) must be directly before
the line break in the code; if you have any space or tab characters
between the backslash and the line break, the compiler will complain.
A better way to do this is by simply breaking the string literal into
two or more string literals that are separated only by whitespace. The
following two string literals are identical:
char const *error1 = "Unable to complete request: please wait a few minutes and try again.";
char const *error2 = "Unable to complete request: "
"please wait a few minutes and try again.";
This also works for NSString
literals; only the first part
of an NSString
literal is prefixed with @
:
NSString *error1 = @"Unable to complete request: please wait a few minutes and try again.";
NSString *error2 = @"Unable to complete request: "
"please wait a few minutes and try again.";
Only spaces, tabs and line breaks are allowed between sections of a
string literal. If the string is supposed to have a line break at the
end of each section, you need to add new line escapes:
char const *error_page =
"<html>\n"
" <head><title>404 Not Found</title></head>\n"
" <body>\n"
" <h1>404 Not Found</h1>\n"
" </body>\n"
"</html>\n";
Escape Sequences
There are other escape sequences like the new line (\n) escape
sequence. The most commonly used ones are:
escape sequence | name | ASCII value |
\n | new line or line feed | 10 |
\r | carriage return | 13 |
\t | tab | 9 |
\" | double quote | 34 |
\\ | backslash | 92 |
Each of these escape sequences requires two characters in the string
literal, but becomes only one character in the string when the program
is compiled.
Octal Escape Sequences
If you wish to specify an arbitrary byte value in a string literal, you
can use an octal escape sequence. Octal escape sequences
begin with a backslash (\) like normal escapes, but the backslash is
followed by an octal (base 8) number instead of a letter or punctuation
mark.
// octal escape sequence examples
char const *bell = "\7"; // ASCII code 7 (bell)
char const *bs = "\10"; // ASCII code 8 (backspace)
char const *del = "\177"; // ASCII code 127 (delete)
The octal numbers in escape sequences are limited to three digits; you
can pad short octal numbers with leading zeros:
char const *bell = "\007";
which is handy to format a long sequence of octal escapes. Also note
that octal numbers must be between 0 and 255. Octal escapes greater
than 255 (377 octal) will be interpreted in a surprising way:
// max octal value is 377 (255 decimal)
char const *two55 = "\377";
NSLog(@"length = %u", strlen(two55));
NSLog(@"first char = %u", (unsigned char)two55[0]);
// prints:
// length = 1
// first char = 255
// octal value of 378 (256 decimal) isn't a valid escape
char const *two56 = "\378";
NSLog(@"length = %u", strlen(two56));
NSLog(@"first char = %u", (unsigned char)two56[0]);
NSLog(@"second char = %u", (unsigned char)two56[1]);
// prints:
// length = 2
// first char = 31 (octal 37)
// second char = 56 (ASCII code for '8')
Because the compiler will try to read up to three octal digits, an
octal escape with fewer than three digits can sometimes have an
unexpected interpretation. For example, embedding a form feed
character (ASCII code 12, octal 14) at the start of this string
produces the expected string:
char const *heading = "\14Preface";
NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
// first char = 12 (form feed, octal value 14)
// second char = 80 (ASCII code for 'P')
But if the character directly after '\14' is a valid octal digit, the
compiler produces something unintended:
char const *heading = "\141. Introduction";
NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
// first char = 97 (octal value 141)
// second char = 46 (ASCII code for '.')
The heading number '1' is a valid octal digit, so the compiler assumes
it's part of the octal escape. There are several ways to prevent this.
You can use an escape sequence to specify the ambiguous character,
break the string into parts, or simply pad the octal number with
leading zeros.
// dealing with ambiguous octal escapes
// replace possible octal characters with escapes
char const *heading1 = "\14\61. Introduction"; // '\61' is octal escape for '1'
// pad octal escape to three digits
char const *heading2 = "\0141. Introduction"; // unambiguous
// break string into parts
char const *heading3 = "\14" "1. Introduction"; // easier to read
Hexadecimal Escape Sequences
Hexadecimal numbers can also be used in escape sequences to specify an
arbitrary byte value. Hexadecimal escape sequences begin with a
backslash (\) followed by 'x' and one or more hexadecimal (base 16)
numbers. Note that the 'x' must be lower case. Like octal escapes,
you can pad hexadecimal escapes with leading zeros.
// hexadecimal escape sequence examples
char const *tab = "\x09"; // ASCII code 9 (horizontal tab)
char const *newline = "\xA"; // ASCII code 10 (new line/line feed)
char const *del = "\x7f"; // ASCII code 127 (delete)
The upper hexadecimal digits (represented by A through F) can be upper
or lower case.
Like octal escapes, hexadecimal escapes have a gotcha: the compiler
will interpret every valid hex digit after the "\x" as part of the
hexadecimal escape. For example, embedding a form feed character
(ASCII code 12) at the start of this string works as expected:
char const *title = "\xcThe C Language";
NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
// first char = 12 (form feed, hex value c)
// second char = 84 (ASCII code for 'T')
Since 'T' isn't a valid hex digit, the compiler figures out that the
first character is '\xc'. The following string doesn't work as
expected:
char const *title = "\xcC Language Primer";
NSLog(@"first char = %u", (unsigned char)title[0]);
NSLog(@"second char = %u", (unsigned char)title[1]);
// prints:
// first char = 204 (hex value cc)
// second char = 32 (ASCII code for space)
Since 'C' is a valid hex digit, the compiler sees the first character
as '\xcC' (cc in hex, 204 in decimal) and the second character as the
space after the 'C'. To prevent this, you can replace any ambiguous
character with an escape sequence, or better yet simply break the
string into parts.
// dealing with ambiguous hexadecimal escapes
// replace possible hex characters with escapes
char const *title1 = "\xc\103 Language Primer"; // '\103' is octal escape for 'C'
// break string into parts
char const *title2 = "\xc" "C Language Primer"; // much easier to read
As in octal escapes, the hexadecimal number in a hexadecimal escape is
limited to the range of 0 through 255. If you specify a hexadecimal
escape sequence larger than 255, the compiler will emit a "hex escape
sequence out of range" warning.
Next time we will continue our look at string literals by examining
Unicode
escape sequences.