Welcome back to Objective-C Tuesdays after a long hiatus. In the last
couple of entries, we looked at
searching
and replacing
in C strings and
NSString
s.
Today we'll look at a more powerful way to search and replace in
strings: regular expressions.
Regular expressions is a small, specialized programming language for matching text. In addition to being specialized in scope, regular expressions are also very compact, which can make them very hard to read. I won't cover regular expression grammar in dept here—there are plenty of regular expression tutorials, references and cheat sheets out there. If you're looking for a good reference, I recommend O'Reilly's Mastering Regular Expressions and the Regular Expressions Cookbook.
Some programming languages have first class support for regular expressions, including Perl, Ruby and JavaScript. Modern languages like Python and Java support regular expressions as part of the language's standard library. Unfortunately C doesn't have regular expression support as part of its standard library. For Objective-C, regular expression support first appeared with iOS 3.2 on iPad and 4.0 on iPhone. Before that, developers needed to use a third party library to use regular expressions in their apps.
Most characters in regular expressions match themselves, but some characters have special meaning. Here's a simple regular expression that matches the word "foobar":
foobar
To match "foo" followed by one 'b' character, you could use:
foob?
Here, the '?' character is a special character that modifies the
expression before it, (in this case the character 'b') and matches it
zero or one times. Thus foob?
will match "foo" or "foob"
but not "foe".
To match "foo" followed by zero or more 'b' characters, you could use:
foob*
The '*' special character matches the preceding expression zero or more
times. foob*
will match "foo", "foob" and "foobbbbbb" but
not "fod".
To match "foo" followed by one or more 'b' characters, you could use:
foob+
The '+' special character matches the preceding expression one or more
times. foob+
will match "foob" and "foobbbbbb" but not
"foo" or "food".
Regular expression languages have many more capabilities. Though different regular expression implementations have different capabilities, most share a large set of common operations,.
The PCRE or Perl Compatible Regular Expressions library is a widely used C library that implements the same regular expression language that's used in Perl 5. The PCRE library is open source and distributed under a BSD license. It's a substantial library, and due to the lack of support on iOS for OS X style frameworks, it can be a bit challenging to include in an iOS project. In addition, because it's a C library, working with PCRE requires more low-level code than many Cocoa Touch programmers are comfortable with.
Here's a PCRE code snippet that compiles the regular expression
foo(bar|fy)
and tests it against the string
"foofy"
.
#include <pcre.h> #include <stdlib.h> #include <string.h> #define OUTPUT_VECTOR_COUNT 30 /* should be a multiple of 3 */ /* ... */ char const *pattern = "foo(bar|fy)"; int compileOptions = 0; char const *error; int errorOffset; unsigned char const *characterTable = NULL; pcre *regularExpression = pcre_compile(pattern, compileOptions, &error, &errorOffset, characterTable); if ( ! regularExpression) { NSLog(@"ERROR: regular expression <%s> failed to compile\n", pattern); NSLog(@" Error at offset %i: %s\n", errorOffset, error); exit(EXIT_FAILURE); } pcre_extra *extraData = NULL; char const *subject = "foofy"; int subjectLength = strlen(subject); int subjectOffset = 0; int execOptions = 0; int outputVector[OUTPUT_VECTOR_COUNT]; int execResultCount = pcre_exec(regularExpression, extraData, subject, subjectLength, subjectOffset, execOptions, outputVector, OUTPUT_VECTOR_COUNT); if (execResultCount == PCRE_ERROR_NOMATCH) { NSLog(@"The subject <%s> did not match the pattern <%s>\n", subject, pattern); } else if (execResultCount < 0) { NSLog(@"Unexpected pcre_exec() result %i\n", execResultCount); } else if (execResultCount == 0) { NSLog(@"Output vector only has room for %i captured substrings\n", execResultCount - 1); } else { int resultIndex; NSLog(@"Found match for pattern <%s> in subject <%s> at offset %i\n", pattern, subject, outputVector[0]); for (resultIndex = 0; resultIndex < execResultCount; ++resultIndex) { int substringIndex = 2 * resultIndex; int substringStartingOffset = outputVector[substringIndex]; int substringEndingOffset = outputVector[substringIndex + 1]; char const* substringStart = subject + substringStartingOffset; int substringLength = substringEndingOffset - substringStartingOffset; NSLog(@"match %i: <%.*s>\n", resultIndex, substringLength, substringStart); } } pcre_free(regularExpression); /* ... */
As you can see, it takes a lot of boilerplate code to do even a simple regular expression search using PCRE, but it's a very powerful library used by many well knows open source and commercial projects, including Apple's Safari browser. The PCRE project also includes a set of C++ wrappers that make the library easier to use from that language. If you only plan to target OS X, there is also the very excellent RegexKit open source library that provides a nice Objective-C wrapper around PCRE. The PCRE library can be built with Unicode support, but is limited to scanning UTF-8 encoded Unicode strings, since it's a byte oriented library.
The
POSIX
Regular Expressions specification describes a regular
expression dialect and
C
interface that is supported on many variants of Unix and Linux,
including OS X and iOS. Like many specifications, POSIX regular
expressions are a "lowest common denominator" solution and lack some of
the advanced features found in regular expression variants like Perl
and PCRE. The POSIX regular expression interface is much smaller than
what PCRE provides: four functions, two structs and a bunch of
constants. Here's a PCRE code snippet that compiles the regular
expression foo(bar|fy)
and tests it against the string
"foofy"
using the POSIX interface.
#import <regex.h> #define MATCH_COUNT 10 /* ... */ regex_t regularExpression; char const *pattern = "foo(fy|bar)"; int compileFlags = REG_EXTENDED; int compileResult = regcomp(®ularExpression, pattern, compileFlags); if (compileResult) { size_t bufferSize = regerror(compileResult, ®ularExpression, NULL, 0); char *buffer = malloc(bufferSize); if ( ! buffer) { NSLog(@"Memory allocation failed"); return; } regerror(compileResult, ®ularExpression, buffer, bufferSize); NSLog(@"%s", buffer); free(buffer); return; } char const *string = "foofy"; regmatch_t matches[MATCH_COUNT]; int executeFlags = 0; int executeResult = regexec(®ularExpression, string, MATCH_COUNT, matches, executeFlags); if (executeResult == REG_NOMATCH) { NSLog(@"Pattern <%s> doesn't match string <%s>\n", pattern, string); } else { NSLog(@"Pattern <%s> matches string <%s>\n", pattern, string); NSLog(@"Found %lu submatches\n", (unsigned long)regularExpression.re_nsub); for (size_t i = 0; i < regularExpression.re_nsub + 1; ++i) { int substringLength = matches[i].rm_eo - matches[i].rm_so; char const *substringStart = string + matches[i].rm_so; NSLog(@"submatch %lu: <%.*s>\n", (unsigned long)i, substringLength, substringStart); } } regfree(®ularExpression); /* ... */
This is a little more concise than the PCRE version, mainly because the POSIX functions have fewer options, but is still very low level. POSIX regular expression implementations are generally regarded as slower than PCRE, and don't provide any explicit Unicode support (though you may be able to do regex matching against UTF-8 encoded strings if you're careful to keep your UTF-8 strings normalized).
The ICU or International Components for Unicode library is included as part of iOS. (ICU is open source and carries a non-restrictive license.) The ICU is a general purpose library for working with Unicode text which includes Unicode-aware regular expression support. It has two versions: one for Java and another for C and C++. The C/C++ version features a low level C function interface and a set of higher level C++ classes. I'll only look at the C interface to ICU here.
Because the ICU library is a general purpose Unicode library, most
operations require that C strings be converted into ICU's Unicode text
type, UChar *
, which makes the ICU example the longest yet:
// compile regular expression char const *pattern = "foo(bar|fy)"; uint32_t compileFlags = 0; UParseError parseError; UErrorCode errorCode = U_ZERO_ERROR; URegularExpression* regularExpression = uregex_openC(pattern, compileFlags, &parseError, &errorCode); if (errorCode) { NSLog(@"uregex_openC() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); NSLog(@" parse error at line %li, offset %li" , (long)parseError.line, (long)parseError.offset); return; } // determine size of search text as ICU Unicode string UChar *unicodeText = NULL; int32_t unicodeTextCapacity = 0; int32_t unicodeTextLength; char const *utf8Text = "foofy"; int32_t utf8TextLength = -1; /* null terminated */ errorCode = U_ZERO_ERROR; u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength, utf8Text, utf8TextLength, &errorCode); if (errorCode != U_BUFFER_OVERFLOW_ERROR) { NSLog(@"Conversion to Unicode string failed: %li: %s", (long)errorCode, u_errorName(errorCode)); uregex_close(regularExpression); return; } // allocate buffer for search text ICU Unicode string unicodeTextCapacity = unicodeTextLength + 1; unicodeText = calloc(sizeof(UChar), unicodeTextLength); if ( ! unicodeText) { NSLog(@"Memory allocation failed"); uregex_close(regularExpression); return; } // convert search text to ICU Unicode string errorCode = U_ZERO_ERROR; u_strFromUTF8(unicodeText, unicodeTextCapacity, &unicodeTextLength, utf8Text, utf8TextLength, &errorCode); uregex_setText(regularExpression, unicodeText, unicodeTextLength, &errorCode); if (errorCode) { NSLog(@"uregex_setText() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); free(unicodeText); uregex_close(regularExpression); return; } // search for regular expression int32_t startIndex = 0; errorCode = U_ZERO_ERROR; BOOL matchFound = uregex_find(regularExpression, startIndex, &errorCode); if (errorCode) { NSLog(@"uregex_find() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); free(unicodeText); uregex_close(regularExpression); return; } if (matchFound) { // get number of subgroup matches NSLog(@"Pattern <%s> matched string <%s>", pattern, utf8Text); errorCode = U_ZERO_ERROR; int32_t subgroupCount = uregex_groupCount(regularExpression, &errorCode); if (errorCode) { NSLog(@"uregex_groupCount() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); free(unicodeText); uregex_close(regularExpression); return; } // enumerate subgroup matches NSLog(@"Matched %li subgroups", (long)subgroupCount); for (int32_t i = 0; i <= subgroupCount; ++i) { // get size of the subgroup UChar *subgroup = NULL; int32_t subgroupCapacity = 0; errorCode = U_ZERO_ERROR; int32_t subgroupLength = uregex_group(regularExpression, i, subgroup, subgroupCapacity, &errorCode); if (errorCode != U_BUFFER_OVERFLOW_ERROR) { NSLog(@"uregex_group() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); break; } // allocate buffer to hold the subgroup subgroupCapacity = subgroupLength + 1; subgroup = calloc(sizeof(UChar), subgroupCapacity); if ( ! subgroup) { NSLog(@"Memory allocation failed"); return; } // copy subgroup into buffer errorCode = U_ZERO_ERROR; uregex_group(regularExpression, i, subgroup, subgroupCapacity, &errorCode); // determine size of buffer to hold subgroup as UTF8 string char *utf8Subgroup = NULL; int32_t utf8SubgroupCapacity = 0; int32_t utf8SubgroupLength; errorCode = U_ZERO_ERROR; u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength, subgroup, subgroupLength, &errorCode); if (errorCode != U_BUFFER_OVERFLOW_ERROR) { NSLog(@"u_strToUTF8() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); free(subgroup); break; } // allocate buffer to hold subgroup as UTF8 string utf8SubgroupCapacity = utf8SubgroupLength + 1; utf8Subgroup = calloc(sizeof(char), utf8SubgroupCapacity); if ( ! utf8Subgroup) { NSLog(@"Memory allocation failed"); free(subgroup); break; } // convert subgroup to UTF8 string errorCode = U_ZERO_ERROR; u_strToUTF8(utf8Subgroup, utf8SubgroupCapacity, &utf8SubgroupLength, subgroup, subgroupLength, &errorCode); if (errorCode) { NSLog(@"u_strToUTF8() failed: %li: %s", (long)errorCode, u_errorName(errorCode)); free(subgroup); free(utf8Subgroup); break; } // print subgroup UTF8 string NSLog(@"submatch %lu: <%.*s>\n", (unsigned long)i, utf8SubgroupLength, utf8Subgroup); free(subgroup); free(utf8Subgroup); } } else { NSLog(@"Pattern <%s> did not match string <%s>", pattern, utf8Text); } free(unicodeText); uregex_close(regularExpression);
I wouldn't be surprised if there's a memory leak in there somewhere. I'm also no expert on this library, and I may be doing some things the hard way. If you're using C++, I'm sure a lot of this boilerplate code goes away. If you don't need or want to mix C++ into your iOS app, there are some Objective-C alternatives that are much more satisfying.
In iOS 3.2, Apple added a new NSStringCompareOptions
value
for use in the various -rangeOfString:options:
methods of
NSString
: NSRegularExpressionSearch
.
This option uses the regular expression support of the ICU library to
do simple regular expression matching on an NSString
object. Unfortunately, Apple only implemented a very minimal regular
expression interface on NSString
. You can search, but not
replace, within an NSString
and powerful regular
expression features like subgroup matching are not exposed.
NSString *string = @"foofy"; NSString *pattern = @"foo(bar|fy)"; NSRange match = [string rangeOfString:pattern options:NSRegularExpressionSearch]; if (match.location == NSNotFound) { NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string); } else { NSLog(@"Pattern <%@> matches string <%@> starting at location %lu", pattern, string, (unsigned long)match.location); }
In iOS 4.0, Cocoa Touch gained the
NSRegularExpression
class, a
full-featured regular expression processor built on top of the ICU
regular expression library.
NSError *error; NSString *pattern = @"foo(bar|fy)"; NSRegularExpression *regularExpression = [NSRegularExpression regularExpressionWithPattern:pattern options:0 error:&error]; if ( ! regularExpression) { NSLog(@"Error in pattern <%@>: %@", pattern, error); return; } NSString *string = @"foofy"; NSRange range = NSMakeRange(0, [string length]); NSArray *matches = [regularExpression matchesInString:string options:0 range:range]; if ([matches count]) { NSTextCheckingResult *firstMatch = [matches objectAtIndex:0]; NSLog(@"Found %lu submatches", (unsigned long)[firstMatch numberOfRanges]); for (NSUInteger i = 0; i < [firstMatch numberOfRanges]; ++i) { NSRange range = [firstMatch rangeAtIndex:i]; NSString *submatch = [string substringWithRange:range]; NSLog(@"submatch %lu: <%@>", (unsigned long)i, submatch); } } else { NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string); }
This is much easier to use than the underlying ICU C library and a good choice for iOS apps that target 4.0 and later.
The
RegexKitLite
library provides an alternate Objective-C wrapper around the low level
ICU regular expression library that supports both Mac OS X and iOS,
including iOS 3.0. RegexKitLite provides a full featured set of regular
expression methods in a category that extends NSString
and
can do matching, searching, replacing and supports subgroups.
RegexKitLite is open source and available under a BSD license.
NSString *string = @"foofy"; NSString *pattern = @"foo(bar|fy)"; NSArray *submatches = [string captureComponentsMatchedByRegex:pattern]; if ([submatches count]) { NSLog(@"Found %lu submatches", (unsigned long)[submatches count]); for (NSUInteger i = 0; i < [submatches count]; ++i) { NSLog(@"submatch %lu: <%@>", (unsigned long)i, [submatches objectAtIndex:i]); } } else { NSLog(@"Pattern <%@> doesn't match string <%@>", pattern, string); }
RegexKitLite code is a little more concise than the
NSRegularExpression
class, but comparable in power, since
they're both built on top of ICU.
Unless you're writing cross-platform code in C or need to target very old versions of iOS, I recommend you avoid using the C regular expression libraries. If you must, the POSIX regex functions are the easiest to get started with, but if you're targeting non-Apple platforms, watch out for implementation differences across platforms. Both PCRE and ICU are good cross-platform choices; choose ICU if you also need robust Unicode support, PCRE if you're mainly working with eight bit encodings.
On the Objective-C side, choosing NSRegularExpression
or
RegexKitLite is largely up to personal preference.
NSRegularExpression
is a no-brainer if you're targeting
iOS 4.0 and later (but currently not supported on OS X). Adding
RegexKitLite to your project is as easy as adding two files to Xcode
and currently works on both Apple operating systems. And the
-rangeOfString:options:
method on NSString
is
handy for simple searches.
That concludes our look at strings in Objective-C. Data structures is our next topic.