Power Regexps, Part II
In the previous article, we looked at some of the more intermediate features of regular expressions, including multiline matching, quoting, and interpolation. This time, we’re going to look at more-advanced features. We’ll also look at some modules that can help us handle regular expressions.
Look Forward, Look Back
Perhaps the most misunderstood facility of regular expressions are the lookahead and lookbehind operators; let’s begin with the simplest, the positive lookahead operator.
This operator, spelled (?= )
, attempts to match a pattern, and if successful, promptly forgets all about it. As its name implies, it peeks forward into the string to see whether the next part of the string matches the pattern. For instance:
$a="13.15 Train to London";
$a=~ /(?=.*London)([\d\.]+)/
This is perhaps an inefficient way of writing:
$a =~ /([\d\.]+).*London/;
and it can be read as “See if this string has ‘London’ in it somewhere, and if so, capture a series of digits or periods.”
Here’s an example of it in real-life code; I want to turn some file names into names of Perl modules. I’ll have a name like /Library/Perl/Mail/Miner/Recogniser/Phone.pm - this is part of my Mail::Miner
module, so I can guarantee that the name of the module will start with Mail/Miner
- and I want to get Mail::Miner::Recogniser::Phone
. Here’s the code that does it:
our @modules = map {
s/.pm$//;
s{.*(?=Mail/Miner)}{};
join "::", splitdir($_)
} @files;
We look at each of our files, and first take off the .pm
from the end. Now what we need to do is remove everything before the Mail/Miner
portion, stripping off /Library/Perl or whatever our path happens to be. Now we could write this as:
s{.*Mail/Miner}{Mail/Miner};
removing everything which appears before Mail/Miner
and then the text Mail/Miner
itself, and then replacing all that with Mail/Miner
again. This is obviously horribly long-winded, and it’s much more natural to think of this in turns of “get rid of everything but stop when you see Mail/Miner
”. In most cases, you can think of (?= )
as meaning “up to”.
Similar but subtly different is the negative counterpart (?! )
. This again peeks forward into the string, but ensures that it doesn’t match the pattern. A good way to think of this is “so long as you don’t see”. Damian Conway’s Text::Autoformat
contains some code for detecting quoted lines of text, such as may be found in an e-mail message:
% Will all this regular expression scariness go away in
% Perl 6?
Yes, definitely; we're replacing it with a completely different set
of scariness.
Here the first two lines are quoted, and the expressions that check for this look like so:
my $quotechar = qq{[!#%=|:]};
my $quotechunk = qq{(?:$quotechar(?![a-z])|[a-z]*>+)};
$quotechar
contains the characters that we consider signify a quotation, and $quotechunk
has two options for what a quotation looks like. The second is most natural: a greater-than sign, possibly preceded by some initials, such as produced by the popular Supercite emacs
package:
SC> You're talking nonsense, you odious little gnome!
The left-hand side of the alternation in $quotechunk
is a little more interesting. We look for one of our quotation characters, such as %
as in the example above, but then we make sure that the next character we see is not alphabetic; this may be a quotation:
% I think that all right-thinking people...
but this almost certainly isn’t
%options = ( verbose => 1, debug => 0 );
The (?!)
acts as a “make sure you don’t see” directive.
The mistake everyone makes at least once with this is to assume you can say:
/(?!foo)bar/;
and wonder why it matches against foobar
. After all, we’ve made sure we didn’t see a foo
before the bar
, right? Well, not exactly. These are lookahead operators, and so can’t be used to find things “before” anything at all; they’re only used to determine what we can or can’t see after the current position. To understand why this is wrong, imagine what it would mean if it were a positive assertion:
/(?=foo)bar/;
This means “are the next three characters we see foo
? If so, the next three characters we see are bar
”. This is obviously never going to happen, since a string can’t contain both foo
and bar
at the same position and the same time. (Although I believe Damian has a paper on that.) So the negative version means “are the next three characters we see not foo
? Then match bar
”. foo
is not bar
, so this matches any bar
. What was probably meant was a lookbehind assertion, which we will look at imminently.
Now we’ve seen the two forward-facing assertions, we can turn (ha, ha) to the backward-facing assertions, positive and negative lookbehind. There’s one important difference between these and their forward-facing counterparts; while lookahead operators can contain more or less any kind of regular expression pattern, for reasons of implementation the lookbehind operators must have a fixed width computable at compile time. That is, you’re not allowed to use any indefinite quantifiers in your subpatterns.
The positive lookbehind assertion is (?<=)
, and the only thing you need to know about it is that it’s so rare I can’t remember the last time I saw it in real code. I don’t think I’ve ever used it, except possibly in error. If you think you want to use one of these, then you almost certainly need to rethink your strategy. Here’s a quick example, though, from IPC::Open3
:
$@ =~ s/(?<=value attempted) at .*//s;
The context for this is that we’ve just done the equivalent of
eval { $_[0] = ... };
and if someone maliciously passes a constant value to the subroutine, we want to through the Modification of a read-only value attempted
error back in their face. We check we’re seeing the error we expect, then strip off the at .../IPC/Open3.pm, line 154
part of the message so that it can be fed to croak
. The less Tom-Christianseny way to do this would be something like:
croak "You fed me bogus parameters" if $@ =~ /attempted/;
The negative lookbehind assertion, on the other hand, is considerably more common; this is the answer to our “bar
not preceded by foo
” problem of the previous section.
/(?!<foo)bar/;
This will match bar
, peeking backward into the string to make sure it doesn’t see foo
first. To take another example, suppose we’re preparing some text for sending over the network, and we want to make sure that all the line feeds (\n
) have carriage returns (\r
) before them. Here’s the truly lazy way to do it:
# Make sure there's an \r in there somewhere
s{\n} {\r\n}g;
# And then strip out duplicates
s{\r\r}{\r} g;
This is fine (if somewhat inefficient) unless it's OK for two carriage
returns to appear without a line feed in the way. Here's the finesse:
s/(?<!\r)\n/\r\n/g;
If you see a line feed that is not preceded by a carriage return, then stick a carriage return in there – much cleaner, and much more efficient.
split
, //g
and other shenanigans
In the previous article, we had a nice piece of multiline, formatted data, such as one might expect to parse with Perl:
Name: Mark-Jason Dominus
Occupation: Perl trainer
Favourite thing: Octopodes
Name: Simon Cozens
Occupation: Hacker
Favourite thing: Sleep
Now, there’s a boring way to parse this. If you’re coming from a C or Java background, then you might try:
my $record = {}
my @records;
for (split /\n/, $text {
chomp;
if (/([^:]+): (.*)/) {
$record->{$1} = $2;
} elsif ($_ =~ /^\s*$/) {
# Blank line => end of current record
push @records, $record;
$record = {};
} else {
die "Wasn't expecting to see '$_' here";
}
}
And, of course, this will work. But there’s several more Perl-ish solutions that this. When you know the fields provided by your data, it’s rather nice to have a regular expression that reflects the data structure:
while ($data =~ /Name:\s(.*)\n
Occupation:\s(.*)\n
Favourite.*:\s(.*)/gx) {
push @records, { name => $1, occupation => $2, favourite => $3 }
}
Here we use the /g
modifier, which allows us to resume the match from where it last left off.
If we don’t know the fields while we’re writing our program, then we’ll have to break the process up into two stages. First, we extract individual records: records are delimited by a blank line:
my @texts = split /\n\s*\n/, $text;
And then for each record, we can either use the /g
trick again, or simply split each record into lines. I prefer the latter, for reasons you’ll see in a second:
for (@texts) {
my $record = {};
for (split /\n/, $_) {
/([^:]+): (.*)/;
$record->{$1} = $2;
}
push @records, $record;
}
This is not dissimilar from the initial solution, but it allows us to make some interesting improvements. For starters, when you see code that transforms data with a for
loop, you should wonder whether it could be better written with a map
statement. This goes double if you’re using push
inside the for
loop as we are here. So this version is a natural evolution:
@records = map {
my $record = {};
for (split /\n/, $_) {
/([^:]+): (.*)/;
$record->{$1} = $2;
}
$record;
} split /\n\s*\n/, $text;
And we can actually do away with the inner for
loop too:
@records = map {
{
map { /([^:]+): (.*)/ and ($1 => $2) } split /\n/, $_
}
} split /\n\s*\n/, $text;
But if we’re prepared to be a little lax about trailing whitespace, there’s actually an even nicer way to do it, using the one thing that everyone forgets about split
: if your split
pattern contains parentheses, then the captured text is inserted into the list returned by split
. That is, the following code:
split( /(\W+)/, "perl-5.8.0.tar.gz")
will produce the list
("perl", "-", "5", ".", "8", ".", "0", ".", "tar", ".", "gz")
So we can actually use the field name, colon and space at the start of each line as the split
expression itself:
split /^([^:]+):\s*/m
There is a slight problem with this idea - because the first thing in each record is delimeter we’re looking for, the first thing returned by split
will be an empty string. But we can easily get around this by adding another undef
to provide a fake undef => ''
hash element. This allows us to reduce the parser code to:
@records = map {
{ undef, split /^([^:]+):\s*/m, $_ }
} split /\n\s*\n/, $text;
It may not be pretty, but it’s quick and it works.
Of course, you may also use lookahead and lookbehind assertions with split
; I sometimes use the following code to break a string into tokens:
split /(?<=\W)|(?=\W)/, $string;
This is almost the same as
split /(\W)/, $string
but with a subtle difference. Again, as Perl wants to see a nonword character as a delimiter, it will return an empty string between two adjacent nonwords:
split /(\W)/, '$foo := $bar';
# '', '$', 'foo', ' ', '', ':', '', '=', '', ' ', '', '$', 'bar'
Splitting on a word boundary goes too much the other way:
split /\b/, '$foo := $bar';
# '$', 'foo', ' := $', 'bar'
And so it turns out that we want to cleave the string where we’ve just seen a nonword character, or if we’re about to see one:
split /(?<=\W)|(?=\W)/, $string;
# '$', 'foo', ' ', ':', '=', ' ', '$', 'bar'
And this gives us the sort of tokenisation we want.
Regexp Modules
Now, though, we are getting into the sort of regular expressions that are not written lightly, and we may need some help constructing and debugging these expressions. Thankfully, there are plenty of modules which make regexp handling much easier for us.
re
The re
module is as invaluable as it is obscure. It’s one of those hidden treasures of the Perl core that Casey was talking about last month. As well as turning on two features of the regular expression engine, tainting subexpressions and evaluated assertions, it provides a debugging facility that allows you to watch your expression being compiled and executed.
Here’s a relative simple expression:
$a =~ /([^:]+):\s*(.*)/;
When this code is run under -Mre=debug
, then the following will be printed when the regexp is compiled:
Compiling REx `([^:]+):\s*(.*)'
size 25 first at 4
1: OPEN1(3)
3: PLUS(13)
4: ANYOF[\0-9;-\377](0)
13: CLOSE1(15)
15: EXACT <:>(17)
17: STAR(19)
18: SPACE(0)
19: OPEN2(21)
21: STAR(23)
22: REG_ANY(0)
23: CLOSE2(25)
25: END(0)
This tells us the instructions for the little machine that the regular expression compiler creates: it should first open a bracket, then go into a loop (PLUS
) finding characters that are ANYOF
character zero through to 9
and ;
through to character 255 - that is, everything apart from a :
. Then we close the bracket, look for a specific character, and so on. The numbers in brackets after each instruction are the line number to jump to on completion; then the PLUS
loop exits, it should go on to line 13, CLOSE1
and so on.
Next when we try to run this match against some text:
$a = "Name: Mark-Jason Dominus";
It will first tell us something about the optimizations it performs:
Guessing start of match, REx `([^:]+):\s*(.*)' against `Name: ...'
Found floating substr `:' at offset 4...
Does not contradict STCLASS...
Guessed: match at offset 0
What this means is that it has found the constant element :
in the regular expression, and tries to locate that in the string, and then work backward to find out where it should start the match. Since the :
is at position four in our string, it will go on to deduce that the match should start at the beginning and…
Matching REx `([^:]+):\s*(.*)' against `Name: Mark-Jason Dominus'
Setting an EVAL scope, savestack=3
0 <> <Name: Mark-J> | 1: OPEN1
0 <> <Name: Mark-J> | 3: PLUS
ANYOF[\0-9;-\377] can match 4 times out of 32767...
The [^:]
can match four times, since it knows there are four things that are not colons there.
The re
module is absolutely essential for heavy-duty study of how the regular expression engine works, and why it doesn’t do what you think it should.
YAPE::Regex::Explain
The description given by re
is a little low-level for some people; well, most people. YAPE::Regex::Explain
aims to put the explanation at a much higher level; for instance,
% perl -MYAPE::Regex::Explain -e 'print
YAPE::Regex::Explain->new(qr/(?<=\W)|(?=\W)/)->explain'
will produce quite a verbose explanation of the regular expression like so:
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<= look behind to see if there is:
----------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z,
0-9, _)
----------------------------------------------------------------------
...
GraphViz::Regex
I find that one of the best ways to debug and understand a complex procedure is to draw a picture. GraphViz::Regex
uses the graphviz
visualization library to draw a state machine diagram for a given regular expression:
use GraphViz::Regex;
my $regex = '(([abcd0-9])|(foo))';
my $graph = GraphViz::Regex->new($regex);
print $graph->as_png;
Regexp::Common
So much for explaining complicated regular expressions; what about generating them? The Regexp::Common
module aims to be a repository for all kinds of commonly needed regular expressions, such as URIs, balanced texts, domain names and IP addresses. The interface is a little freaky, but it can hugely help to clarify complex regexps:
my $ts = qr/\d+:\d+:\d+\.\d+/;
$tcpdump =~ /$ts ($RE{net}{IPv4}) > ($RE{net}{IPv4}) : (tcp|udp) (\d+)/;
Text::Balanced
Finally, one particularly common family of things to match for are quoted, parenthesised or tagged text. Damian’s Text::Balanced
module helps produce both regular expressions and subroutines to match and extract balanced text sequences. For instance, we can create a regular expression for matching double-quoted strings like so:
use Text::Balanced qw(gen_delimited_pat);
$pat = gen_delimited_pat(q{"})
# (?:\"(?:[^\\\"]*(?:\\.[^\\\"]*)*)\")
This pattern will match quoted text, but will also be aware of escape sequences like \"
and \\
, and hence not break off in the middle of
"\"So\", he said, \"How about lunch?\""
Text::Balanced
also contains routines for extracting tagged text, finding balanced pairs of parentheses, and much more.
Summary
We’ve looked at some slightly more-complex features of regular expressions, and shown how we can use these to slice and dice text with Perl. As these regexes get more complicated, the need for tools to help us debug them increases; and so we’ve looked also at re
, YAPE
and GraphViz::Regex
.
Finally, the Regexp::Common
and Text::Balanced
modules help us create complex regular expressions of our own.
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub