Regular Expressions
Welcome back to Objective-C Tuesdays after a long hiatus. In the last
couple of entries, we looked at
searching
and replacing
in C strings and
NSString
s.
Today we'll look at a more powerful way to search and replace in
strings: regular expressions.
A mini language
Regular expressions is a small, specialized programming language for
matching text. In addition to being specialized in scope, regular
expressions are also very compact, which can make them very hard to
read. I won't cover regular expression grammar in dept here—there are
plenty of regular expression tutorials, references and cheat sheets out
there. If you're looking for a good reference, I recommend O'Reilly's
Mastering
Regular Expressions and the
Regular
Expressions Cookbook.
Some programming languages have first class support for regular
expressions, including Perl, Ruby and JavaScript. Modern languages like
Python and Java support regular expressions as part of the language's
standard library. Unfortunately C doesn't have regular expression
support as part of its standard library. For Objective-C, regular
expression support first appeared with iOS 3.2 on iPad and 4.0 on
iPhone. Before that, developers needed to use a third party library to
use regular expressions in their apps.
Some simple examples
Most characters in regular expressions match themselves, but some
characters have special meaning. Here's a simple regular expression
that matches the word "foobar":
foobar
To match "foo" followed by one 'b' character, you could use:
foob?
Here, the '?' character is a special character that modifies the
expression before it, (in this case the character 'b') and matches it
zero or one times. Thus foob?
will match "foo" or "foob"
but not "foe".
To match "foo" followed by zero or more 'b' characters, you could use:
foob*
The '*' special character matches the preceding expression zero or more
times. foob*
will match "foo", "foob" and "foobbbbbb" but
not "fod".
To match "foo" followed by one or more 'b' characters, you could use:
foob+
The '+' special character matches the preceding expression one or more
times. foob+
will match "foob" and "foobbbbbb" but not
"foo" or "food".
Regular expression languages have many more capabilities. Though
different regular expression implementations have different
capabilities, most share a large set of common operations,.
C libraries for regular expressions
The PCRE or Perl
Compatible Regular Expressions library is a widely used C library
that implements the same regular expression language that's used in
Perl 5. The PCRE library is open source and distributed under a BSD
license. It's a substantial library, and due to the lack of support on
iOS for OS X style frameworks, it can be a bit challenging to include
in an iOS project. In addition, because it's a C library, working with
PCRE requires more low-level code than many Cocoa Touch programmers are
comfortable with.
Here's a PCRE code snippet that compiles the regular expression
foo(bar|fy)
and tests it against the string
"foofy"
.
#include <pcre.h>
#include <stdlib.h>
#include <string.h>
#define OUTPUT_VECTOR_COUNT 30 /* should be a multiple of 3 */
/* ... */
char const *pattern = "foo(bar|fy)";
int compileOptions = 0;
char const *error;
int errorOffset;
unsigned char const *characterTable = NULL;
pcre *regularExpression = pcre_compile(pattern, compileOptions,
&error, &errorOffset, characterTable);
if ( ! regularExpression) {
NSLog(@"ERROR: regular expression <%s> failed to compile\n", pattern);
NSLog(@" Error at offset %i: %s\n", errorOffset, error);
exit(EXIT_FAILURE);
}
pcre_extra *extraData = NULL;
char const *subject = "foofy";
int subjectLength = strlen(subject);
int subjectOffset = 0;
int execOptions = 0;
int outputVector[OUTPUT_VECTOR_COUNT];
int execResultCount = pcre_exec(regularExpression, extraData,
subject, subjectLength, subjectOffset,
execOptions, outputVector, OUTPUT_VECTOR_COUNT);
if (execResultCount == PCRE_ERROR_NOMATCH) {
NSLog(@"The subject <%s> did not match the pattern <%s>\n", subject, pattern);
} else if (execResultCount < 0) {
NSLog(@"Unexpected pcre_exec() result %i\n", execResultCount);
} else if (execResultCount == 0) {
NSLog(@"Output vector only has room for %i captured substrings\n",
execResultCount - 1);
} else {
int resultIndex;
NSLog(@"Found match for pattern <%s> in subject <%s> at offset %i\n",
pattern, subject, outputVector[0]);
for (resultIndex = 0; resultIndex < execResultCount; ++resultIndex) {
int substringIndex = 2 * resultIndex;
int substringStartingOffset = outputVector[substringIndex];
int substringEndingOffset = outputVector[substringIndex + 1];
char const* substringStart = subject + substringStartingOffset;
int substringLength = substringEndingOffset - substringStartingOffset;
NSLog(@"match %i: <%.*s>\n", resultIndex, substringLength, substringStart);
}
}
pcre_free(regularExpression);
/* ... */
As you can see, it takes a lot of boilerplate code to do even a simple
regular expression search using PCRE, but it's a very powerful library
used by many well knows open source and commercial projects, including
Apple's Safari browser. The PCRE project also includes a set of C++
wrappers that make the library easier to use from that language. If you
only plan to target OS X, there is also the very excellent
RegexKit open source
library that provides a nice Objective-C wrapper around PCRE. The PCRE
library can be built with Unicode support, but is limited to scanning
UTF-8 encoded Unicode strings, since it's a byte oriented library.
The
POSIX
Regular Expressions specification describes a regular
expression dialect and
C
interface that is supported on many variants of Unix and Linux,
including OS X and iOS. Like many specifications, POSIX regular
expressions are a "lowest common denominator" solution and lack some of
the advanced features found in regular expression variants like Perl
and PCRE. The POSIX regular expression interface is much smaller than
what PCRE provides: four functions, two structs and a bunch of
constants. Here's a PCRE code snippet that compiles the regular
expression foo(bar|fy)
and tests it against the string
"foofy"
using the POSIX interface.
#import <regex.h>
#define MATCH_COUNT 10
/* ... */
regex_t regularExpression;
char const *pattern = "foo(fy|bar)";
int compileFlags = REG_EXTENDED;
int compileResult = regcomp(®ularExpression, pattern, compileFlags);
if (compileResult) {
size_t bufferSize = regerror(compileResult, ®ularExpression, NULL, 0);
char *buffer = malloc(bufferSize);
if ( ! buffer) {
NSLog(@"Memory allocation failed");
return;
}
regerror(compileResult, ®ularExpression, buffer, bufferSize);
NSLog(@"%s", buffer);
free(buffer);
return;
}
char const *string = "foofy";
regmatch_t matches[MATCH_COUNT];
int executeFlags = 0;
int executeResult = regexec(®ularExpression, string,
MATCH_COUNT, matches, executeFlags);
if (executeResult == REG_NOMATCH) {
NSLog(@"Pattern <%s> doesn't match string <%s>\n", pattern, string);
} else {
NSLog(@"Pattern <%s> matches string <%s>\n", pattern, string);
NSLog(@"Found %lu submatches\n", (unsigned long)regularExpression.re_nsub);
for (size_t i = 0; i < regularExpression.re_nsub + 1; ++i) {
int substringLength = matches[i].rm_eo - matches[i].rm_so;
char const *substringStart = string + matches[i].rm_so;
NSLog(@"submatch %lu: <%.*s>\n", (unsigned long)i, substringLength, substringStart);
}
}
regfree(®ularExpression);
/* ... */
This is a little more concise than the PCRE version, mainly because the
POSIX functions have fewer options, but is still very low level. POSIX
regular expression implementations are generally regarded as slower
than PCRE, and don't provide any explicit Unicode support (though you
may be able to do regex matching against UTF-8 encoded strings if
you're careful to keep your UTF-8 strings
normalized).
The ICU or
International Components for Unicode library is included as part of
iOS. (ICU is open source and carries a
non-restrictive
license.) The ICU is a general purpose library for working with
Unicode text which includes Unicode-aware regular expression support.
It has two versions: one for Java and another for C and C++. The C/C++
version features a low level C function interface and a set of higher
level C++ classes. I'll only look at the C interface to ICU here.
Because the ICU library is a general purpose Unicode library, most
operations require that C strings be converted into ICU's Unicode text
type, UChar *
, which makes the ICU example the longest yet:
// compile regular expression
char const *pattern = "foo(bar|fy)";
uint32_t compileFlags = 0;
UParseError parseError;
UErrorCode errorCode = U_ZERO_ERROR;
URegularExpression* regularExpression = uregex_openC(pattern, compileFlags,
&parseError, &errorCode);
if (errorCode) {
NSLog(@"uregex_openC() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
NSLog(@" parse error at line %li, offset %li" ,
(long)parseError.line, (long)parseError.offset);
return;
}
// determine size of search text as ICU Unicode string
UChar *unicodeText = NULL;
int32_t unicodeTextCapacity = 0;
int32_t unicodeTextLength;
char const *utf8Text = "foofy";
int32_t utf8TextLength = -1; /* null terminated */
errorCode = U_ZERO_ERROR;
u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength,
utf8Text, utf8TextLength, &errorCode);
if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
NSLog(@"Conversion to Unicode string failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
uregex_close(regularExpression);
return;
}
// allocate buffer for search text ICU Unicode string
unicodeTextCapacity = unicodeTextLength + 1;
unicodeText = calloc(sizeof(UChar), unicodeTextLength);
if ( ! unicodeText) {
NSLog(@"Memory allocation failed");
uregex_close(regularExpression);
return;
}
// convert search text to ICU Unicode string
errorCode = U_ZERO_ERROR;
u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength,
utf8Text, utf8TextLength, &errorCode);
uregex_setText(regularExpression, unicodeText, unicodeTextLength, &errorCode);
if (errorCode) {
NSLog(@"uregex_setText() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
free(unicodeText);
uregex_close(regularExpression);
return;
}
// search for regular expression
int32_t startIndex = 0;
errorCode = U_ZERO_ERROR;
BOOL matchFound = uregex_find(regularExpression, startIndex, &errorCode);
if (errorCode) {
NSLog(@"uregex_find() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
free(unicodeText);
uregex_close(regularExpression);
return;
}
if (matchFound) {
// get number of subgroup matches
NSLog(@"Pattern <%s> matched string <%s>", pattern, utf8Text);
errorCode = U_ZERO_ERROR;
int32_t subgroupCount = uregex_groupCount(regularExpression, &errorCode);
if (errorCode) {
NSLog(@"uregex_groupCount() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
free(unicodeText);
uregex_close(regularExpression);
return;
}
// enumerate subgroup matches
NSLog(@"Matched %li subgroups", (long)subgroupCount);
for (int32_t i = 0; i <= subgroupCount; ++i) {
// get size of the subgroup
UChar *subgroup = NULL;
int32_t subgroupCapacity = 0;
errorCode = U_ZERO_ERROR;
int32_t subgroupLength = uregex_group(regularExpression, i,
subgroup, subgroupCapacity,
&errorCode);
if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
NSLog(@"uregex_group() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
break;
}
// allocate buffer to hold the subgroup
subgroupCapacity = subgroupLength + 1;
subgroup = calloc(sizeof(UChar), subgroupCapacity);
if ( ! subgroup) {
NSLog(@"Memory allocation failed");
return;
}
// copy subgroup into buffer
errorCode = U_ZERO_ERROR;
uregex_group(regularExpression, i, subgroup, subgroupCapacity, &errorCode);
// determine size of buffer to hold subgroup as UTF8 string
char *utf8Subgroup = NULL;
int32_t utf8SubgroupCapacity = 0;
int32_t utf8SubgroupLength;
errorCode = U_ZERO_ERROR;
u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength,
subgroup, subgroupLength, &errorCode);
if (errorCode != U_BUFFER_OVERFLOW_ERROR) {
NSLog(@"u_strToUTF8() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
free(subgroup);
break;
}
// allocate buffer to hold subgroup as UTF8 string
utf8SubgroupCapacity = utf8SubgroupLength + 1;
utf8Subgroup = calloc(sizeof(char), utf8SubgroupCapacity);
if ( ! utf8Subgroup) {
NSLog(@"Memory allocation failed");
free(subgroup);
break;
}
// convert subgroup to UTF8 string
errorCode = U_ZERO_ERROR;
u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength,
subgroup, subgroupLength, &errorCode);
if (errorCode) {
NSLog(@"u_strToUTF8() failed: %li: %s",
(long)errorCode, u_errorName(errorCode));
free(subgroup);
free(utf8Subgroup);
break;
}
// print subgroup UTF8 string
NSLog(@"submatch %lu: <%.*s>\n",
(unsigned long)i, utf8SubgroupLength, utf8Subgroup);
free(subgroup);
free(utf8Subgroup);
}
} else {
NSLog(@"Pattern <%s> did not match string <%s>", pattern, utf8Text);
}
free(unicodeText);
uregex_close(regularExpression);
I wouldn't be surprised if there's a memory leak in there somewhere.
I'm also no expert on this library, and I may be doing some things the
hard way. If you're using C++, I'm sure a lot of this boilerplate code
goes away. If you don't need or want to mix C++ into your iOS app,
there are some Objective-C alternatives that are much more satisfying.
Regular expressions in Objective-C
In iOS 3.2, Apple added a new NSStringCompareOptions
value
for use in the various -rangeOfString:options:
methods of
NSString
: NSRegularExpressionSearch
.
This option uses the regular expression support of the ICU library to
do simple regular expression matching on an NSString
object. Unfortunately, Apple only implemented a very minimal regular
expression interface on NSString
. You can search, but not
replace, within an NSString
and powerful regular
expression features like subgroup matching are not exposed.
NSString *string = @"foofy";
NSString *pattern = @"foo(bar|fy)";
NSRange match = [string rangeOfString:pattern
options:NSRegularExpressionSearch];
if (match.location == NSNotFound) {
NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
} else {
NSLog(@"Pattern <%@> matches string <%@> starting at location %lu",
pattern, string, (unsigned long)match.location);
}
In iOS 4.0, Cocoa Touch gained the
NSRegularExpression
class, a
full-featured regular expression processor built on top of the ICU
regular expression library.
NSError *error;
NSString *pattern = @"foo(bar|fy)";
NSRegularExpression *regularExpression = [NSRegularExpression regularExpressionWithPattern:pattern
options:0
error:&error];
if ( ! regularExpression) {
NSLog(@"Error in pattern <%@>: %@", pattern, error);
return;
}
NSString *string = @"foofy";
NSRange range = NSMakeRange(0, [string length]);
NSArray *matches = [regularExpression matchesInString:string
options:0
range:range];
if ([matches count]) {
NSTextCheckingResult *firstMatch = [matches objectAtIndex:0];
NSLog(@"Found %lu submatches", (unsigned long)[firstMatch numberOfRanges]);
for (NSUInteger i = 0; i < [firstMatch numberOfRanges]; ++i) {
NSRange range = [firstMatch rangeAtIndex:i];
NSString *submatch = [string substringWithRange:range];
NSLog(@"submatch %lu: <%@>", (unsigned long)i, submatch);
}
} else {
NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
}
This is much easier to use than the underlying ICU C library and a good
choice for iOS apps that target 4.0 and later.
The
RegexKitLite
library provides an alternate Objective-C wrapper around the low level
ICU regular expression library that supports both Mac OS X and iOS,
including iOS 3.0. RegexKitLite provides a full featured set of regular
expression methods in a category that extends NSString
and
can do matching, searching, replacing and supports subgroups.
RegexKitLite is open source and available under a BSD license.
NSString *string = @"foofy";
NSString *pattern = @"foo(bar|fy)";
NSArray *submatches = [string captureComponentsMatchedByRegex:pattern];
if ([submatches count]) {
NSLog(@"Found %lu submatches", (unsigned long)[submatches count]);
for (NSUInteger i = 0; i < [submatches count]; ++i) {
NSLog(@"submatch %lu: <%@>", (unsigned long)i, [submatches objectAtIndex:i]);
}
} else {
NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string);
}
RegexKitLite code is a little more concise than the
NSRegularExpression
class, but comparable in power, since
they're both built on top of ICU.
Which one to use?
Unless you're writing cross-platform code in C or need to target very
old versions of iOS, I recommend you avoid using the C regular
expression libraries. If you must, the POSIX regex functions are the
easiest to get started with, but if you're targeting non-Apple
platforms, watch out for implementation differences across platforms.
Both PCRE and ICU are good cross-platform choices; choose ICU if you
also need robust Unicode support, PCRE if you're mainly working with
eight bit encodings.
On the Objective-C side, choosing NSRegularExpression
or
RegexKitLite is largely up to personal preference.
NSRegularExpression
is a no-brainer if you're targeting
iOS 4.0 and later (but currently not supported on OS X). Adding
RegexKitLite to your project is as easy as adding two files to Xcode
and currently works on both Apple operating systems. And the
-rangeOfString:options:
method on NSString
is
handy for simple searches.
<eot /> (end of topic)
That concludes our look at
strings in
Objective-C. Data
structures is our next topic.