Repeated Capturing and Parsing in Perl

When I checked my email after arriving at the office today, I found a query that had been sent to our internal Perl mail list. The questioner was trying to match a pattern repeatedly, capturing all of the results in an array. But, it wasn’t doing quite what he expected. The message, with minor edits, went a little something like the following.

I’m trying to extract key/value pairs from a file with the following contents:

- name = gcc_xo_src_clk, type = rcg
+ name = cxo_clk, type = xo, fgroup = xo, wt = 10, bloo = blah
? type = hm_mnd_rcg, name = bo : type = rcg_mn
+ name = pxo_clk

I was hoping to do something like this:

@list = $_ =~ m{ ^[-+?] \s* (\S+) \s* = \s* (\S+) \s* (?:, \s* (\S+) \s* = \s* (\S+) \s*)* }xms;

Thinking @list would be assigned the alternating key/value pairs. But the above doesn’t extract anything sane. Adding the /gc modifiers doesn’t make any difference.

If I do the following, it extracts the first two key/value pairs correctly (if the line has more than one pair).

@list = $_ =~ m{
    ^[-+?] \s* (\S+) \s* = \s* (\S+) \s*
    , \s* (\S+) \s* = \s* (\S+) \s*
}xms;

If I keep repeating the pattern in the second line, it keeps matching more key/value pairs.

I would expect using (?: )* should mean zero or more instances of match inside the parentheses, but obviously it’s not working. What am I doing wrong?

When I’m presented with a problem like this, that is some kind of structured data, I immediately think of writing a parser. I’ll get back to that in a bit, but I wanted to address the confusion about capturing in the pattern. And, in fact, that’s how the discussion on the mail list proceeded.

Repeated Capturing

First, let’s simplify the example to demonstrate why our seeker of wisdom isn’t getting back the list of items he expected.

my @matches = 'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

say "(@matches)";   # prints "(a e)"

Capturing parentheses in Perl are treated somewhat like registers. Most Perl programmers are familiar with the $n variables, which hold the values of a successful pattern match. For example $1 holds the value matched by the first set of parentheses, $2 holds the value of the second set, and so on.

When a pattern is matched in list context, as above, it’s effectively the same as writing,

'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

my @matches = ( $1, $2 );

These pattern match variables are scalars and, as such, will only hold a single value. That value is whatever the capturing parentheses matched last. So, in our simplified example, $1 matches a, which is obvious enough. As the pattern repeats, $2 would be set to b, then c, and so on until the final match of e.

That explains why the pattern match wasn’t returning the expected list. What can be done about it?

Capturing Along the Way

If we break down the sample data, we see that it generalizes to,

prefix key = value[, ...] [: key = value[, ...]]

The first approach that came to mind is to split the data into multiple lines. Each line can then have its initial prefix removed and saved, then parsed for its key/value pairs. That’s starting to look a lot like parsing, which I promised to get to later. For the purposes of this discussion, I wanted to be able to accomplish the task with a single regular expression.

To capture all of the values we want, we need to remove the repeating set of non-capturing parentheses. However, we still need to repeat the match, ideally returning all of the captured values in one statement. We can do that with the /g and /c regular expression modifiers.

my @list = $string =~ m{ ([-+?,:]) \s* (\w+) \s* = \s* (\w+) \s* }xmsgc;

I’ve done two things here. First, I replaced the \S character classes, used to match the key and value, with \w. The + pattern in a Perl regular expression is greedy, so the former character class was also matching the comma used to separate key/value pairs in the data. This left the literal comma with nothing to match, so was one source of confusion.

Second, I noted that the initial prefix, while syntactically important, could be viewed in the same way as the comma and colon separators. I combined all of these separators and added a capture around them so we can later make sense of the parsed data.

When matched against the data, the pattern results in a list like,

("-", "name", "gcc_xo_src_clk", ",", "type", "rcg", "+", "name", "cxo_clk", ...)

Now we can process the data using a simple state machine.

my $state = undef;

while ( my $token = shift @list ) {
    if ( $token eq '-' ) { $state = 'dash'; next; }
    # ...
    if ( $token eq ',' ) { next; }

    my $key   = shift @list;
    my $value = shift @list;

    if ( $state eq 'dash' ) {
        # ...
    }
}

Even though we did all of the data extraction using a single pattern match, it looks remarkably like … a parser! The pattern is simply the tokenizer used to feed tokens into our state machine, the parser.

Parsing

I stated at the outset that I looked at this as a parsing problem, so the solution I would use is most likely a parser. For simple, one-off scripts, I’d use a technique similar to the one I described in the previous section. However, for more complex data or a more complex script, I’d turn to a real parser.

In fact, one of my contributions to the thread that led me to compose this post included an example of using the $^R and $^N variables in embedded code blocks to demonstrate a rudimentary parser that allowed a simulated form of capturing within a repeated non-capturing group. I won’t go into any detail beyond showing what I wrote. As this was from an early point in the thread, the prefix is ignored in this example.

my @list = ();

my $kv = qr{
    (\w+) (?{ $^N; })           # capture the key
    \s* = \s* (\w+)
    (?{ $^R = [ $^R, $^N ]; })  # capture the value, saving the key
    (?{ push @list, @{ $^R } }) # push the key/value onto @list
}xms;

$data =~ m{ (?: ^[-+?] \s* $kv \s* (?:[,:] \s* $kv \s* )* )* }xms;

Fortunately for us, there are parsing modules on the CPAN.

Prior to Perl 5.10, Damian Conway had written Parse::RecDescent, but with the introduction of grammar-like facilities like named captures and named backreferences, Damian improved upon his original work and presented the Perl community with Regexp::Grammars.

What does a parser for this data built with Regexp::Grammars look like?

my $parser = qr{
    <[Line]>+

    <token: Prefix>   <MATCH= ([-+?]) >
    <token: Key>      <MATCH= (\w+) >
    <token: Value>    <MATCH= (\w+) >

    <rule: Line>      <Prefix> <Pairs> <Options>?
    <rule: Pairs>     <[Pair]>* % ,
    <rule: Pair>      <Key> = <Value>
    <rule: Options>   : <[Option]>* % ,
    <rule: Option>    <Key> = <Value>
}x;

if ( $data =~ $parser ) {
    # Do something with %/
}

This is a trivial example and all the work is left to be done by inspecting the parse tree in %/. However, the module supports embedded code that will be called when a token or rule matches, which can be used to process the data as its parsed.

References

For Want of a Newline

Today I had the pleasure of spending three hours debugging an obscure bug. An obscure bug I caused by introducing a newline. That little punk, 0x0A.

I released a new version of a command line program. It’s an elegant piece of work, combining a marvelously complex-but-intuitive configuration for system administrators with a absolutely simple interface for users. To use the command, the user runs it with a couple of arguments and it prints out a single line of useful text derived from the marvelously complex configuration.

But, it doesn’t print a newline.

Anyone who has encountered a command like this knows well my irritation. You end up with something like this:

my awesome prompt> some_lame_command
my awesome prompt>e answer

Argh!

The workaround most of us use is to see the above, face-palm, then run something like this:

my awesome prompt> echo `some_lame_command`
42 is obviously the answer
my awesome prompt>

Being the arrogant bastard programmer that I am, I decided to fix this. Since all commands print newlines, everyone should already be assuming that this one does too and should already be handling it in the proper manner. When writing a shell script, the distinction between newline-printing and non-newline-printing commands is irrelevant. In either Bourne shell,

FROBBED=`frobnosticate`

or C shell,

setenv FROBBED `frobnosticate`

the shell is benevolent enough to remove the newline, if it exists. After all, this is the most commonly desired behavior when assigning command output to a variable. However, things are a bit different when switching to a programming language, like Perl:

$ENV{FROBBED} = `frobnosticate`; # Caution, newline ahead!

Sure, it looks more or less the same, but veteran Perl programmers will immediately grimace when reading the above. Unlike the shell, Perl, like other programming languages, will preserve the output of the command. In this case, preserving data and letting the programmer decide how to use it is the most commonly desired behavior. Since everything coming from an external command ends with a newline, the environment variable being set in this case will have a newline. This will almost always cause a problem. One that, as I’ve learned, is not always easy to find. Since stripping input of newlines is just as common as the desire to preserve data, Perl makes this easy and most Perl programmers will habitually write this:

chomp( $ENV{FROBBED} = `frobnosticate` );

Now it doesn’t matter if the command prints a newline or not, the chomp function has your back. It’s just like being in the warm embrace of the shell, only with a little extra syntax.

So it turns out that one of the engineering groups I support was using a Perl script that set an environment variable as in the first example. The value of this environment variable was then being passed off to the batch system and used by an engineering program as a network address to connect to. Of course, the program made the fatal mistake of trusting user input and, in a spectacular fashion, failed to connect to the server whose name just happened to contain a newline.

After chasing down a couple of red herrings which left me flummoxed, one of the affected users shared with me an error log and the script that generated it. There, in all its syntax highlighted, monospaced glory was the environment variable being set without attempting to trim off the newline. I quickly released an update that reverted the newline behavior and the problem went away. My engineers—at least, the subset using this particular script—could once again get their work done.

By far, this isn’t the worst thing I’ve done to our batch system. One time I caused all jobs that launched on Solaris hosts to immediately fail. Whoops.

Anyway, what’s the lesson to be learned from today’s experience?

Never—and I’ll repeat that, never—assume everyone will be doing the right thing. Inevitably, someone won’t be.

There’s a corollary to today’s lesson. When coming across something that could be improved with a small change, don’t. Seriously, just don’t. Inevitably, someone will be depending on the current behavior, no matter how right or wrong it may seem.

“Ninja” Code

The Amazon booth at OSCON 2008 is advertising heavily that they are hiring. They are also holding a raffle. To enter, simple look over some Perl code they have written out on some poster board and tell them what it does. It looks a little something like this (transcribing from memory):

my $code = qq{
    print 1+1 . "\n";
    $code =~ m/(\d+)\+(\d+)/;
    $new = $1 + $2;
    $code =~ s/\d+\+(\d+)/$2+$new/;
};

for ( 1 .. 10 ) {
    eval($code);
}

What’s the first bug? Yes, it should use q{}, or the variables will interpolate on the initial assignment to $code. To their credit, they initially used single quotes, but people said it was too hard to read.

I wasn’t content with just figuring out what the code did and fixing a small bug. I think it can be written better.

eval($code = q{
    print 1+1 . "\n";
    $code =~ s/(\d+)(\+)(\d+)/"$3$2" . ($1 + $3)/e;
    eval $code;
});

Much better. Not only is it more concise, I was able to remove that pesky loop, so I wouldn’t be bothered by any silly upper bounds.

So what does it do? Should be obvious. Head over to the Amazon booth and let them know.

When a Problem Comes Along, You Must Parse It

(With apologies to Devo.)

I attended the first of three courses taught by Damian Conway today, Advanced Parsing with Parse::RecDescent. This was, effectively, the Parse::RecDescent tutorial as presented live by Damian. I’ve read the tutorial, and it’s quite good, but it’s hard to beat instruction from the man himself. Not having used the module on too many occasions, the course served to clarify a few concepts of the grammar used by Parse::RecDescent.

Now I see problems in a new light—for better or worse. This evening, I joined my boss to debug a problem we’ve been seeing in our compute cluster. I joked that I could just whip up a grammar to track down the problem for us. I must resist my usual urge to apply my new, shiny tool to all problems. I mean, it’s bad enough I use the Perl hammer for everything, right?

Today’s course left me with nostalgia for the compilers course I took in college. As my partner for that long ago class reminded me, it was filled with “late nights and Mountain Dew.” Ah, good times.

OSCON 2007: State of the Onion

After the auction to benefit the Perl Foundation, it was finally time for the State of the Onion. I don’t know which number this is, but there have been a lot.

When Larry hooked up his computer to the projector, he had an IRC window open to #parrot on irc.perl.org. Yes, of course I did it. I jumped right into the channel and wrote, “hi mom.” I got a good laugh from those in the room, but I’ll probably never be welcome in that channel again.

Larry thinks it’s a bad idea to get rid of the term scripting. Perl already owns the brand when it comes to scripting. We have about the same chance of changing the branding of hacker.

“Programming is hard, let’s go scripting!”

Scripting isn’t so bad. It’s actually kind of easy; just look at all the script kiddies out there. But we can use Perl to turn all those script kiddies into real programmers. After all, Larry claims to have come to Perl in the same way.

So what’s the difference between scripting and programmers? Scripting is like profanity, you know it when you see it.

This year’s State of the Onion is about scripting, past, present, and future.

The past, essentially, is a brief history of Larry and his experience with scripting at different times of his life. More importantly, it’s about what all of these languages ar, how they and his experiences with them influenced what Perl was, is, and will be.

The present is an overview of the different ways languages can be designed. Binding, dispatch, typology, structure, and others are all different forks in the road of language design. Each fork developed for different reasons, whether it be efficiency of code or abstraction of language concepts. Follow all of these forks like some kind of Choose Your Own Adventure book, and different languages emerge. The lessons of each of these languages can be used as new ones are developed.

So I guess what Larry is trying to say is that Perl 6 looked at what every other language (including Perl 5) did right and what they did wrong, then went ahead and did everything right.

In fact, Perl 6 has taken Yogi Berra’s advice and took all of the forks. Sure, it seems confusing, but think of the power.

Okay, so what’s the future?

Perl 6.

Duh.

OSCON 2007: Domain Specific Languages in Perl

As a programmer who spends most of his time writing code that helps other people to write code, I’m pretty interested in domain specific languages. In fact, I have a couple of modules we use at work that use them. That’s how I ended up in Domain Specific Languages in Perl, presented by Jesse Vincent of Best Practical.

Domain specific languages (DSLs) are languages designed for specific programming tasks. A couple of well-known examples of DSLs are SQL and regular expressions. However, today, Jesse is talking to us about “Englishy” DSLs, which he’s been very interested in lately. Additionally, these are DSLs that are internal to Perl. Instead of parsing a DSL and executing it, they are instead implemented by playing with Perl’s syntax and taking advantage of Perl’s parser (and having all of Perl’s features available in the DSL).

The two main goals when implementing a DSL in Perl are,

  1. Does it feel good?
  2. Can we actually do it?

The first big DSL is Jifty::DBI, an object-relational mapper for Jifty. For the folks at Best Practical, it was a fun learning process in tweaking Perl to look the way they wanted it to while still declaring the database schema in an intuitive, pretty way.

The Template::Declare module is flat-out awesome. I’ve seen it before, and I’ve never taken the time to play with it. How stupid have I been? Coding HTML templates in Perl is cool, and something I would happily do (I don’t really like writing HTML).

As if that weren’t enough, in comes Jifty::Dispatcher. Modelled after Mason’s (auto|d)handler files, it will manipulate web application requests and take care of dispatching processes. I really like the DSL used here. It’s very declarative, removing the infrastructure entirely, which is something I’m very interested in doing in my code right now. I must be getting old; I’m more interested in telling the computer what to do, rather than how to do it, these days.

Testing web sites is ugly, and it sucks. That’s what Test::WWW::Declare is for. It’s a beautiful module. Sessions, flows, declarative statements to define the web flow. So very awesome.

I can tell that the Best Practical folks have seen Damian Conway’s Sufficiently Advanced Technologies talk. The moral of that story is, write what you want to see, and only then figure out how to make it work in Perl.

I’m sold. As soon as I get home, I’m going to work on porting my start-up company’s web site with Jifty. Jifty is, well, absolutely nifty.

At the end, Jesse showed us his own domain specific language to define his slides. It’s awesome. I want it for my own talks.