Regular Expression Specification

Definition

A regular expression is a string like any other string - a sequence of characters. However, special characters within the string have certain functions which make regular expressions useful when trying to match portions of other strings. In the following discussion and examples, a string containing a regular expression will be called the "pattern", and the string against which it is to be matched will be called the "reference string".

Regular expressions allow one to search for "all strings ending with the letters ize" or "all strings beginning with a number between 1 and 3 and ending in a comma".

In order to accomplish this, regular expressions co-opt the use of some characters to have special meaning. They also provide for these characters to lose their special meaning if the user so desires. The rules for regular expresssion are

c
Any character c matches itself unless it has been assigned another special meaning as listed below. Most special characters can be escaped (made to lose its special meaning), by placing the character '\' in front of it. Thus although '*' normally has a special meaning the string '\*' matches itself.

Example: The pattern

hello
matches
ohhello or hellothere or ohhellothere
but not
Hello or ohhell or ohhelothere
That is it will match any string that contains "hello" anywhere in the reference string.

Example: Normally the characters '*' and '$' are special, but the pattern

a\*bse\$
acts as above. That is any reference string containing "a*bse$" as a substring will be flagged as a match.
.
A period matches any character except the newline character. This is known as the wildcard character.

Example: The pattern

....
will match any 4 characters in the reference string, except a newline character.

[string]
One or more characters within square brackets. This pattern matches any single character within the brackets. The caret, '^', has a special meaning if it is the first character in the series: the pattern will match any character other than those in the list.

Example: The pattern

[Hh]ello
Will match "Hello" or "hello".
[^abc]
Will match any character except 'a', 'b' or 'c'.

To match a right bracket, ']', in the list it must be put first:

[]ab01]
To match a caret, '^', in the list it can appear anywhere but first. In
[ab^01]
the caret loses its special meaning.

The '-' character is special within square brackets. It is interpreted as a range of characters (in the ASCII character set) and will match any single character within that range. '[a-z]' matches any lower case letter. The '-' can be made non special by placing it first or last within the square brackets.

The characters '$', '*' and '.' are not special within square brackets.

Example: The pattern

[a0-9b.$]
matches one of 'a', 'b', '.' , '$' or a digit between 0 and 9 inclusive.

Example: The pattern

[^a0-9b.$]
matches any single character that is not 'a', 'b' '.' , '$' or a digit between 0 and 9 inclusive.

(pattern1)|(pattern2)
Alternation and grouping are achieved using the '|' character to signify alternatives and parentheses to signify grouping.

Example: The pattern

a|b|c
is equivalent to the pattern
[abc]
that is, it matches 'a' or 'b' or 'c'.

Example: The pattern

(hello)|(hi)
matches any string containing "hello" or "hi".

Example: The pattern

([Hh]ello)|([Hh]i)
matches any string containing "Hello" or "hello" or "Hi" or "hi".

*
An asterisk following a regular expression in the pattern has the effect of matching zero or more occurrences of that expression.

Example: The pattern

a*
means zero or more occurrences of the character 'a'.

Example: The pattern

[A-Z]*
means zero or more occurrences of the upper case alphabet.

+
An '+' following a regular expression in the pattern has the effect of matching one or more occurrences of that expression.

Example: The pattern

a+
means one or more occurrences of the character 'a'.

Example: The pattern

[A-Z]+
means one or more occurrences of the upper case alphabet.

?
An '?' following a regular expression in the pattern has the effect of matching zero or one occurrences of that expression.

Example: The pattern

a?
means zero or one occurrences of the character 'a'.

Example: The pattern

[+-]?[0-9]+
means zero or one occurrences of a sign followed by one or more digits. In other words this pattern would match an integer with an optional sign.

{m,n}
A regular expression followed by "{m,n}" requires that the pattern must be repeated at least m times and at most n times for a valid match. Both m and n must be non-negative integers between 0 and 255 (inclusive) with m < n. The notation "{m,}" means that the pattern must be repeated at least m times, and the notation "{m}" means that the pattern must be repeated exactly m times.

Example: The pattern

ab{3}
would match any substring consisting of an 'a' followed by exactly 3 'b's.

Example: The pattern

ab{3,}
would match any substring in the reference string of an 'a' followed by at least 3 'b's.

Example: The pattern

ab{3,5}
would match any substring in the reference string of an 'a' followed by at least 3 but at most 5 'b's.

Example: The pattern

[+-]{0,1}
is equivalent to the earlier pattern
[+-]?

Example: The pattern

[0-9]{1,}
is equivalent to the earlier pattern
[0-9]+

^
A caret, '^', at the begining of the pattern is said to "anchor" the match to the beginning of the line. That is, the reference string must start with the pattern following the '^'. If the caret appears anywhere else other than at the beginning of the pattern, then it is no longer considered special, and matches itself as any non-special character would. Similarly, if it starts the pattern but is escaped, it matches itself.

Example: The pattern

^efghi
Will match
efghi or efghijlk
but not
abcefghi
That is the pattern will match only those reference strings starting with "efghi". Just containing the substring is not sufficient.

$
The dollar sign, '$', at the end of the pattern "anchors" the pattern to the end of the line (reference string). A '$' occurring anywhere else in the pattern is non-special and matches itself. Similarly if it is at the end of the pattern but is escaped, it matches itself.

Example: The pattern

efghi$
Will match
efghi or abcefghi
but not
efghijlk
That is the pattern will match only those reference strings ending with "efghi". Just containing the substring is not sufficient.

Summary

The key operations in defining a regular expression are concatenation, alternation, and repetition. It is completely possible to specify any regular expression using the Ascii (or Unicode) character set and the four special characters '(', ')', '|', and '*'. All the remaining notation is just added to make writing regular expressions more flexible and compact. The character class notation is just a very compact way to describe alternation at the level of individual characters. For example, the pattern

[0-9]
is equivalent to
(0|1|2|3|4|5|6|7|8|9)
Similarly, the standard pattern for an identifier is
[a-zA-Z][a-zA-Z0-9]*
While this could certainly be expressed using only the four special characters, it would require a string of 231 characters - a little unwieldly.

In addtion to the special characters '.', '^', and '$', there are a raft of special characters denoting characters that are not easily typed or character classes that occur frequently. These characters take on their special meaning when they are escaped, that is, when they are preceded by the escape character, '\'. A few of the more common of these special characters are given in the following table.

Character Meaning
\t Tab Character
\n Newline
\r Carriage Return
\f Form Feed
\d Digit [0-9]
\s Whitespace