When I first learnt about regular expressions I thought “cool I can match characters or patterns in text, I’ll just google the regex pattern when I need it” and when that time came I discovered regex is almost like a programming language itself with many different concepts like; anchors, alternation, groups, backreferences, lookahead, lookbehind and the list goes on…
As there are so many regex concepts and engines I decided to write this blog and just briefly cover most of them. I am also writing this with relation to Ruby version 2.0+ which uses the Onigmo engine.
Regular expression engines
There are many different regular expression implementations. This means some features/concepts might not work or might function slightly differently depending on which engine you’re using. For example Ruby 2.0 and onwards uses the Onigmo engine which is a fork of the Oniguruma engine. Atom and Sublime IDE’s also use Oniguruma so the experience across Ruby and these editors will be very similar, although the Onigmo fork focuses on supporting new features. But then Javascript’s regex engine doesn’t support as many features such as no support for lookbehind, no atomic groups, no possessive quantifiers. Although Javascript’s regular expression engine is part of the ECMA-262 standard so at least the different implements of Javascript (v8, chakra, spidermonkey) should all experience the same regular expressions.
Defining a regular expression
Defining a regular expression is commonly done inside forward slashes such as /regex/
. In ruby you can also use %r{regex}
or the Regexp::new
constructor.
Character types
Literal characters simply match the character itself a
will match a
, 9
will match 9
.
Special characters also known as metacharacters have special meanings. The special characters are \ ^ $ . | ? * + () [ {
To use special characters as literal characters they need to be escaped by preceding them with \
for example to use \?
will match ?
Non printable characters such as tab \t
, carriage return \r
, line feed \n
and line separator in windows \r\n
can also be matched.
How a regex engine works
Understanding how the regex engine works internally is important and will help with reading and constructing regular expressions.
The regex engine will start with the first regex token and start walking through the string looking for a match. If there is a match the regex engine moves to the next token and continues looking for a match. If a match fails the engine will backtrack and then try to find another way to make a match. As soon as the regex engine matches all tokens it will return that match even if there is a better match later in the string.
Character classes
Character classes are defined using []
and match one from the selection of characters within.
[a]
matches a
[ab]
does not match ab
it matches a
or b
Negated character class
A negated character class is defined using ^
immediately after the opening [
and this matches any character that is not in the character class.
[^a]
matches any character that is not a
Special Characters in character classes
When using special characters within a character class most are taken as literal characters except for ]
, \
, ^
and -
.
[+]
matches +
[*]
matches *
[\^]
backslash escapes the carrot and will match ^
[\]]
backslash escapes the closing square bracket and will match ]
[a-f]
the hyphen is used for ranges, matches one lowercase letter from abcdef
[1-4]
number range matches one digit from 1234
Shorthand character classes
There are a number of shorthand characters and these can be used inside and outside of character classes.
[\d]
is shorthand for any digit and matches [0-9]
[\w]
is word character and matches ASCII characters [A-Za-z0-9_]
note that \w includes digits and can also include other characters like the underscore
[\s]
is whitespace character and matches [ \t\r\n\f]
Negated shorthand character classes
There are shorthand ways to do negation as well instead of using the ^
.
[\D]
is not a digit and is shorthand for [^\d]
[\W]
is not a word character and is shorthand for [^\w]
[\S]
is not a whitespace character and is shorthand for [^\s]
Extra shorthand character classes
There are more character classes but they can vary depending on the regex engine being used. For example the Onigmo engine used by Ruby has:
\h
is hexadecimal-digit character which matches [0-9a-fA-F]
\H
is non-hexadecimal-digit character
Character class subtraction
Character class subtraction is used to subtract one character class from another. It is defined as [...-[...]]
[abcd-[d]]
the d
is subtracted from abcd
making it match a
, b
or c
[a-z-[aeiou]]
matches any lowercase consonant
Character class subtraction can also be nested which will subtract from the subtraction.
[abcde-[cde-[c]]]
will subtract c
from cde
which leaves de
to be subtracted matching a
, b
or c
Negation can also be used with character class subtraction and it takes precendence before the subtraction.
[^abcd-[de]]
this means not abcd
so any other character subtract de
which matches any character not abcde
The character class subtraction always needs to be the last token within the character class.
Character class intersection
Character class intersection matches a character that is present across multiple character sets. It is defined as [...&&[...]]
or in Ruby the inside square brackets can be left out if not using negation like this [...&&...]
[abc&&cde]
this only has c
present in both sides so it will match c
[abcd&&abc&&ab]
this only has ab
present across all intersects so it will match a
or b
When using negation with character class intersection with the onigmo engine the intersect takes precedence over the negation.
[^abc&&cde]
so abc
AND cde
only has c
in both present so is the same as [^c]
matching any character not c
But if you put the negation on the right hand side you need to use []
and this will set the precedence to the negation then the intersect.
[abc&&[^cde]]
is not cde
leaving only ab
in both sets so this will match a
or b
Dot
The .
character matches anything except for a line break \n
An alternative to using dot is to use a negated character class, and this is probably the better option most of the time.
Anchors
Anchors are used to match a position. ^
matches the position at the beginning of the line, including after a line break where a new line starts. $
matches the position at the end of the line and before a line break. \A
matches the beginning of the string, not after line breaks. \Z
matches the end of the string, or before newline at the end and \z
matches the end of the string.
When validating input its probably a good idea to remove leading and trailing whitespace.
Word boundaries
A word boundary allows you to match whole words. It is defined using \b
. It is an anchor so it matches a position and there are 3 positions it can match.
Before the first character where the character of \w
After the last character where the character matches \w
Inbetween 2 characters if 1 matches \w
and the other doesn’t
The negated version for the word boundary, not a word boundary, is \B
Alternation
Alternation matches one regular expression out of multiple regular expressions. It is defined using |
hello|hi|hey
matches hello
or hi
or hey
To limit the alternation use brackets.
\b(hello|hi|hey)\b
is similar to the above but it matches only whole words
Alternation stops as soon as it finds a match so order can matter.
Quantifiers
Quantifiers set the quantity for the previous regex token.
*
matches zero or more times
+
matches one or more times
?
matches one or none
{3}
matches exactly three times
{3,}
matches three or more times
{3, 5}
matches three to five times
Quantifiers are greedy
meaning they try to match the token as many times as possible and they give up matches gradually as the engine backtracks.
Quantifiers can be made lazy
, also known as reluctant
, meaning they try to match the token as few times as possible and they gradually expand matches as the engine backtracks. To make a quantifier lazy add a ?
to it.
*A
matches AAA
in AAA
*A?
returns an empty
match in AAA
\d+
matches 12345
in 12345
\d+?
matches 1
in 12345
\w{2,4}
matches abcd
in abcd
\w{2,4}?
matches ab
in abcd
Quantifiers can also be made possessive
they are greedy
as they try to match the token as many times as possible but they don’t give up matches when its time for the engine to backtrack. Possessive quantifiers are useful for speeding up regex performance but they can change what is matched.
Grouping
Using ()
creates a group and allows you to apply a quantifier to the group or restrict alternation.
()
are also numbered capturing groups which store the match allowing it to be referenced later, called a backreference, and match something that was previously matched.
([abc])\1
can match aa
, bb
or cc
([abc])\1\1
can match aaa
, bbb
or ccc
([ab])([yz])\1\2
can match ayay
, azaz
, byby
or bzbz
Backreferences can also be relative using \k<-1>
(a)(b)(c)\k<-1>
matches cc
(a)(b)(c)\k<-2>
matches bb
You can disable the numbered capturing by by putting a ?:
immediately after the opening ‘(‘
(?:[abc])
matches a
, b
or c
(there is no backreference available)
When backtracking into captured groups happens if a new match is found it will overwrite the previously stored match.
Instead of numbered groups you can used name groups using ?<name>
to define the name and \k<name>
for the backreference.
(?<this>[abc])\k<this>
can match ‘aa’, ‘bb’ or ‘cc’
There is also another sort of group called an atomic group. An atomic group won’t backtrack into the group once its exited the group. An atomic group is defined using ?>
immediately after the opening (
a(bc|b)c
normal group that matches abcc
and will also match abc
using backtracking
a(?>bc|b)c
atomic group that only matches ‘abcc’ because once it exists the group there is no backtracking so the alternation in the group won’t be attempted
Groups and backreferences cannot be used inside character classes doing so will cause them to be used as literal characters.
Mode modifiers
Mode modifiers are like extra options to control how a pattern can match. These can be passed in after the regex /regex/modifier
.
/abc/i
the i
tells the regex to ignore case
so this can match regardless of the case eg aBc
A mode modifier can also be turned on/off within the regular expression.
/a(?i:b)c
can match aBc
Ruby supports the follow mode modifiers:
/regex/i
ignore case
/regex/m
enable dot .
to match newline characters
/regex/x
ignore whitespace and comments in pattern
/regex/o
perform #{} interpolation only once
Free-spacing mode and comments
Using the x
mode modifier mentioned means literal white space in a regex will be ignored as well as allow comments to be added using #
/[1-9][0-9]{3}[abc]/x
this could be written like so:
/[1-9] # match a digit 1 to 9
[0-9]{3} # match exactly 3 digits 0 to 9
[a b c] # match a b or c
/x
Lookarounds
Lookarounds are zero-length assertions meaning they will lookaround
around, either lookahead
or lookbehind
the current character and assert if there is a match or not without actually moving onto this character. There are 4 lookarounds:
Lookahead (?=a)
means assert what immediately follows the current position is a
Lookbehind (?<=a)
means assert what immediately precedes the current position is a
Negative lookahead (?!a)
means assert what immediately follows the current position is not a
Negative lookbehind (?<!a)
means assert what immediately precedes the current position is not a
Conditionals
Conditionals work similar to an if then else
statement. They are defined like (?(if)then|else)
(?(A)B|c)
this means if A
then B
else c
Conditionals can also be used to check if a captured group exists.
(?(1)hello|hi)
this will check if the numbered group exists
(?(<mygroup>)hello|hi)
this will check if the named group mygroup exists
More…
There’s still more to learn with regex’s like recursion
and subroutines
but I’m going to leave this post for now, maybe I’ll update it later and I’ll finish it off with some useful links.
Useful links
Great resource with a lot of information, good for getting started with regular expressions.
Regular-expressions.info
Onigmo is the engine used by Ruby since version 2.0.
Onigmo
Onigmo documentation
Ruby’s documentation on regular expressions.
Ruby-docs.org Regular Expressions
Ruby-docs.org Regexp Class
A comparison of regular expression engines.
regular expression engines
If you want to practice your regex skills you can play
regex golf
regex crosswords