Slicing And Dicing Strings
Last time, we looked at
C
string and NSString
comparison and equality. Today
we'll examine functions and methods for creating substrings of C
strings and NSString
s.
Substrings of C strings
Creating a C string requires you to explicitly manage the memory the
string lives in. Depending on how long you need to keep the C string
around, you can use either a fixed buffer or a dynamically allocated
one. As always with C strings, you need to be careful not to write
past the end of the buffer.
Creating a substring that starts at the beginning of the source string
is straight forward: use the strncpy()
function. There's a big gotcha when using strncpy()
to
copy a substring: it doesn't automatically add a null terminator to the
destination. Here's an example of copying the first three characters
of a C string into a fixed buffer:
// copy substring from start of source
// using a fixed buffer
char const *source = "foobar";
char buffer[4]; // make sure buffer includes
// space for null terminator
strncpy(buffer, source, 3); // copy first 3 chars from source
buffer[3] = '\0'; // remember to add null terminator
Using a dynamic buffer is similar, but requires explicit memory
management.
// copy substring from start of source
// using a dynamic buffer
char const *source = "foobar";
char *buffer = malloc(4 * sizeof(char)); // make sure buffer includes
// space for null terminator
if ( ! buffer) {
// must handle allocation failure
}
strncpy(buffer, source, 3); // copy first 3 chars from source
buffer[3] = '\0'; // remember to add null terminator
// use buffer ...
// don't forget to free() buffer when done
free(buffer);
You can make this a little more compact by using calloc()
instead of malloc()
. The calloc()
function
allocates memory using malloc()
, then clears all the bytes
to zero. As long as you make sure to include an extra byte at the end,
your new substring will be null terminated:
// copy substring from start of source
// using a dynamic buffer
// allocated with calloc()
char const *source = "foobar";
char *buffer = calloc(4, sizeof(char)); // make sure buffer includes
// space for null terminator
if ( ! buffer) {
// handle allocation failure
}
strncpy(buffer, source, 3); // copy first 3 chars from source
// last char in buffer is already '/0'
// use buffer ...
// don't forget to free() buffer when done
free(buffer);
There's not a huge difference between malloc()
and
calloc()
, so choose whichever one you're more used to
using, or use calloc()
if you don't have a strong
preference. The cost of clearing a range of memory to zeros is so tiny
as to not be worth considering in most circumstances, and knowing that
your buffer is initialized to zeros can be handy.
There's no standard C function for getting a substring that starts
somewhere in the middle of the source string, because one isn't needed
-- you simply move the pointer from the start of the string. Here's an
illustration:
// C strings are pointers
char const *string = "foobar";
NSLog(@"'%s'", string);
// prints out 'foobar'
char const *substring = string + 3;
NSLog(@"'%s'", substring);
// prints out 'bar'
You can add an integer value to the C string pointer to get a pointer
to the middle of the source string -- just be careful not to go off the
end of the string! If you only need the substring for a short period
of time, or if you know that the source string will live longer than
the substring and never change, it's safe to simply create a substring
this way. However, you can introduce weird bugs if you get this wrong.
When in doubt, copy the substring to a new buffer:
// create a substring from the middle of a string
char const *source = "foobar";
char const *substringSource = source + 3;
size_t charCount = strlen(substringSource) + 1;
char *buffer = calloc(charCount, sizeof(char));
if ( ! buffer) {
// handle allocation failure
}
strcpy(buffer, substringSource);
// use buffer ...
free(buffer);
Here we calculate the starting point by simply adding 3
to
the string pointer source
. Then we figure out the number
of char
s we need to allocate using the
strlen()
function, remembering to add 1
for
the null terminator character. After allocating memory, the
strcpy()
function copies all the characters from
substringSource
into buffer
. Unlike
strncpy()
, strcpy()
will copy
the null terminator, so this code will be the same whether we use
calloc()
or malloc()
to allocate the buffer.
If you need to grab a substring that falls between the beginning and
end of a longer string, you combine these two techniques: use pointer
arithmetic to get a pointer to the start of the substring, then use
strncpy()
to copy just the characters you
need.
Warning: beware encoding issues!
Slicing and dicing C strings is easy when you're using a single byte
encoding like ASCII. If you're using a multibyte encoding like UTF-8,
you need to be aware that one logical character may require two or more
bytes. If you want to omit the first three logical characters in a
string, you need to examine each byte from the start of a string to
determine if it's part of a multibyte sequence, and adjust your string
pointer accordingly. If you need to work with multibyte encodings, I
recommend finding an appropriate library for the encoding, such as the
International Components for Unicode
for working with Unicode encodings. Or better yet, transform your C
strings into NSString
s.
Substrings of NSString
s
There are three ways to get a substring of an NSString
.
First we'll look at taking a substring from the start of an
NSString
:
// create a substring from the start of source
NSString *source = @"foobar";
NSString *substring = [source substringToIndex:3];
// substring is "foo"
The substring returned by -substringToIndex:
is
autoreleased. You should -retain
or -copy
it
if you need to hold on to it.
Similarly, to get a substring that starts in the middle of an
NSString
and goes to the end:
// create a substring to the end of source
NSString *source = @"foobar";
NSString *substring = [source substringFromIndex:3];
// substring is "bar"
Finally, the general purpose way to create a substring of an
NSString
is the -substringWithRange:
method,
which uses an NSRange
structure, which is defined
something like this:
// NSRange structure
struct NSRange {
NSUInteger location;
NSUInteger length;
}
When used with -substringWithRange:
method, the
NSRange
's location
field is the zero-based
index of the first character to be included in the substring, and the
length
field is the number of characters to include in the
substring. Here are some examples:
// -substringWithRange: examples
NSString *source = @"foobar";
NSRange range;
range.location = 0;
range.length = 3;
NSString *frontHalf = [source substringWithRange:range];
// frontHalf is "foo"
range.location = 3;
range.length = 3;
NSString *backHalf = [source substringWithRange:range];
// backHalf is "bar"
range.location = 2;
range.length = 2;
NSString *middle = [source substringWithRange:range];
// middle = "ob"
One word of caution: if the range you give falls outside the receiver
(the source string), this method will raise an
NSRangeException
.
Setting the fields of NSRange
is fairly verbose; it's
generally more convenient to use the NSMakeRange()
function to create the NSRange
structure instead.
// NSMakeRange() example
NSString *source = @"foobar";
NSString *frontHalf = [source substringWithRange:NSMakeRange(0, 3)];
// frontHalf is "foo"
NSString encoding mostly not a worry
Internally, NSString
uses UTF-16 encoding. Although
UTF-16 is a variable length encoding like UTF-8, characters from the
basic
multilingual plane are all two bytes (one word) in length. If
you're certain that your NSString
contains only basic
multilingual plane characters, then methods like -length
and -substringWithRange:
will work exactly as you expect
them. However, if your NSString
includes characters
outside the basic multilingual plane, it will contain
surrogate
pairs, which are multi-word sequences that represent a single
character. You'll find that -length
tells you the number
of words rather than logical characters, and if
you're not careful, methods like -substringWithRange:
can
split a surrogate pair in half, leaving you with an invalidly encoded
string.
Unless your application needs to work with characters outside the basic
multilingual plane, the easiest solution is to filter out such
characters when you accept data from a source outside your app. Since
the basic multilingual plane contains all the characters in common use
in most modern languages, this is sufficient for many applications.
The standard iOS input keyboards limit the user to characters in the
basic multilingual plane, but if your app reads data from the network,
such as an RSS feed you don't control, you need to watch out for this.
Next time, we'll look at
searching in C
strings and NSString
s.