Category Archives: Programming

Automating On Call Jury Instructions

A version of this article is also published on my Perl blog.

Early in February, I received a jury summons for the United States District Court, Southern District of California. Prospective jurors for federal jury service (at least in this court) are placed on call for a period of about 30 days. I was to call for instructions on April 1 and potentially proceed to do so periodically until May 4 (assuming I wasn’t instructed to report).

Since my initial instruction date was nearly two months away, I created an entry for it in Google Calendar, and promptly forgot about it. On Monday, April 2, I was riding the train to work when I realized that I hadn’t remembered to check my instructions. Fortunately, after arriving at my office and checking my instructions, I had been deferred to the next day.

So I added a new entry in Google Calendar, this time with an SMS reminder. I proceeded to do this for most of April, checking my instructions and duplicating the calender entry with another SMS reminder.

I’m embarrassed to admit that it wasn’t until the last week of April that it occurred to me that I could automate the whole process. After all, isn’t automating drudgery the whole reason I ended up programming Perl in an engineering support group at my day job?

In addition to a telephone recording, jury instructions can be obtained online. In fact, this is the method I used all month. The form uses the HTTP POST method, so it wasn’t a simple matter of constructing an URL to fetch my instructions. While I could construct a POST request with curl(1) or the LWP module, it’s so much easier to do with with the WWW::Mechanize module.

my $mech = WWW::Mechanize->new();
$mech->get('http://jury.casd.uscourts.gov/AppearWeb/Default.aspx');
$mech->submit_form(
    form_name => 'Form1',
    fields    => {
        'ctl02$txtPart' => 'PARTICIPANT_ID',
        'ctl02$txtZip'  => 'ZIP_CODE',
    },
    button => 'ctl02$btnInstructions',
);

When I’m not supposed to report, the following message appears in the returned content:

<span id="ctl02_lblMsg">Please check again Sunday, April 29, after 6:00pm for further reporting instructions. Do NOT report at this time.</span>

Given how simple this is, I could parse it with a regular expression. But, I figured it was worth trying to do it right, so I searched CPAN and found the HTML::DOM module. I’ve worked a bit with DOM in JavaScript, so the module appealed to me. Annoyingly, the parse method only supports file names or file handles. Fortunately, this isn’t terribly difficult to work around and the whole thing isn’t much more verbose than using a regular expression.

my $dom = HTML::DOM->new;
$dom->parse_file( IO::Scalar->new( do { my $c = $mech->content; \$c } ) );
my $message = $dom->getElementById('ctl02_lblMsg')->innerHTML;

Now that I have the message what does it say? Thus far my instructions have always been to check again on another day, so I’ll need to work with what I know and defensively code for the exceptions.

if ( $message !~ /Do NOT report at this time/ ) {
    # We didn't see the message we wanted to see, so we'd better alert...
}

If I don’t see the known message, I send myself an alert (I happened to use the Email::Sender module in the script) and exit. If this happens, I’ll need to address it as it probably means I need to report (or I’m no longer on call).

However, if I do see the above message, I need to figure out when I’m supposed to check again. If this fails for some reason (e.g., I don’t know what the format looks like if the day is a single digit), I go through the alert process again. It’s rather important that this script be noisy, given the nature of what I’m doing and the limited knowledge I’m working with.

if ( $message !~ /Please check again (?<weekday>\w+), (?<month>\w+) (?<day>\d+)/ ) {
    # We couldn't parse the next date to check, so we'd better alert...
}
 
my $dt = DateTime::Format::DateParse->parse_datetime("$+{'weekday'}, $+{'day'} $+{'month'} 18:15");

I’ve hard-coded the time to check as 6:15 PM, because the instructions are always updated at 6:00 PM.

Finally, the script schedules itself to run again at the time indicated. Here I’ve broken out of Perl to use the at(1) command. Since I’m running the script on my Linode VPS, this seemed an easy way to accomplish the task of rescheduling.

open my $at, '|-', 'at', $dt->strftime('%R'), $dt->strftime('%F');
say {$at} "$0 2>/dev/null"; # $0 must be fully qualified or in PATH
close $at;

Running this script once will set the rescheduling process in motion, alleviating me of the need to run it again. If I’d thought of this at the beginning of April, I could have forgotten about the whole bother of checking for instructions several times per week. Oh well, live and learn.

I’ve posted the full script as a Gist on GitHub.

As a way of outsourcing the work and perhaps offer this type of service to a wider audience, I looked at ifttt and Yahoo! Pipes. Unfortunately, the former doesn’t appear to have a way to trigger on scraping an arbitrary web page, and the latter doesn’t appear to support the HTTP POST method. If anyone knows of an approach using existing services, I’m open to suggestions.

Updated on 30 April 2012.

Mojo::UserAgent

Posting this on blogs.perl.org resulted in a few comments (including a rant I completely agree with, but that’s beside the point). There was one suggestion that I try using Mojo::UserAgent instead of WWW::Mechanize.

My first attempt at doing so wasn’t particularly successful. After a while, I realized that I needed to manually do some of the work that WWW::Mechanize was doing for me. Namely, fetch the page, extract the hidden fields, and submit the form with these fields included (there’s a cookie involved, but it’s taken care of behind the scenes by both modules).

Because of this, the Mojo::UserAgent version is a bit more annoying to write, but I think this is more than made up for by the built-in access to the DOM.

my $ua  = Mojo::UserAgent->new;
my $url = 'http://jury.casd.uscourts.gov/AppearWeb/Default.aspx';
my $res = $ua->get($url)->res;  # initial fetch to get cookie and form fields
my $tx  = $ua->max_redirects(3)->post_form(
    $url => {
        '__VIEWSTATE'           => $res->dom('form#Form1 > input#__VIEWSTATE')->[0]->attrs('value'),
        '__EVENTVALIDATION'     => $res->dom('form#Form1 > input#__EVENTVALIDATION')->[0]->attrs('value'),
        'ctl02$txtPart'         => 'PARTICIPANT_ID',
        'ctl02$txtZip'          => 'ZIP_CODE',
        'ctl02$btnInstructions' => 'Reporting Instructions',
    }
);
 
$res = $tx->success or die $tx->error;
my $message = $res->dom('span#ctl02_lblMsg')->[0]->text;

WWW::Scripter

As I was working on the Mojo::UserAgent version of my script, I kept thinking how perfect it would be if WWW::Mechanize gave me access to the DOM in the same way. Well, as I was pushing the new jury-mojo.pl script to my Gist, cpansprout left a comment to not only tell me how I could remove my IO::Scalar hack, but that WWW::Scripter does exactly what I had just been wishing for. It’s like he read my mind.

my $mech = WWW::Scripter->new();
$mech->get('http://jury.casd.uscourts.gov/AppearWeb/Default.aspx');
$mech->submit_form(
    form_name => 'Form1',
    fields    => {
        'ctl02$txtPart' => 'PARTICIPANT_ID',
        'ctl02$txtZip'  => 'ZIP_CODE',
    },
    button => 'ctl02$btnInstructions',
);
 
my $message = $mech->document->getElementById('ctl02_lblMsg')->innerHTML;

I like this last version the most and have updated my Gist accordingly. Also, my automation worked and emailed me tonight to inform me that my jury service has concluded.

Repeated Capturing and Parsing in Perl

A version of this article is posted on my Perl blog.

When I checked my email after arriving at the office today, I found a query that had been sent to our internal Perl mail list. The questioner was trying to match a pattern repeatedly, capturing all of the results in an array. But, it wasn’t doing quite what he expected. The message, with minor edits, went a little something like the following.

I’m trying to extract key/value pairs from a file with the following contents:

- name = gcc_xo_src_clk, type = rcg
+ name = cxo_clk, type = xo, fgroup = xo, wt = 10, bloo = blah
? type = hm_mnd_rcg, name = bo : type = rcg_mn
+ name = pxo_clk

I was hoping to do something like this:

@list = $_ =~ m{ ^[-+?] \s* (\S+) \s* = \s* (\S+) \s* (?:, \s* (\S+) \s* = \s* (\S+) \s*)* }xms;

Thinking @list would be assigned the alternating key/value pairs. But the above doesn’t extract anything sane. Adding the /gc modifiers doesn’t make any difference.

If I do the following, it extracts the first two key/value pairs correctly (if the line has more than one pair).

@list = $_ =~ m{
    ^[-+?] \s* (\S+) \s* = \s* (\S+) \s*
    , \s* (\S+) \s* = \s* (\S+) \s*
}xms;

If I keep repeating the pattern in the second line, it keeps matching more key/value pairs.

I would expect using (?: )* should mean zero or more instances of match inside the parentheses, but obviously it’s not working. What am I doing wrong?

When I’m presented with a problem like this, that is some kind of structured data, I immediately think of writing a parser. I’ll get back to that in a bit, but I wanted to address the confusion about capturing in the pattern. And, in fact, that’s how the discussion on the mail list proceeded.

Repeated Capturing

First, let’s simplify the example to demonstrate why our seeker of wisdom isn’t getting back the list of items he expected.

my @matches = 'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

say "(@matches)";   # prints "(a e)"

Capturing parentheses in Perl are treated somewhat like registers. Most Perl programmers are familiar with the $n variables, which hold the values of a successful pattern match. For example $1 holds the value matched by the first set of parentheses, $2 holds the value of the second set, and so on.

When a pattern is matched in list context, as above, it’s effectively the same as writing,

'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

my @matches = ( $1, $2 );

These pattern match variables are scalars and, as such, will only hold a single value. That value is whatever the capturing parentheses matched last. So, in our simplified example, $1 matches a, which is obvious enough. As the pattern repeats, $2 would be set to b, then c, and so on until the final match of e.

That explains why the pattern match wasn’t returning the expected list. What can be done about it?

Capturing Along the Way

If we break down the sample data, we see that it generalizes to,

prefix key = value[, ...] [: key = value[, ...]]

The first approach that came to mind is to split the data into multiple lines. Each line can then have its initial prefix removed and saved, then parsed for its key/value pairs. That’s starting to look a lot like parsing, which I promised to get to later. For the purposes of this discussion, I wanted to be able to accomplish the task with a single regular expression.

To capture all of the values we want, we need to remove the repeating set of non-capturing parentheses. However, we still need to repeat the match, ideally returning all of the captured values in one statement. We can do that with the /g and /c regular expression modifiers.

my @list = $string =~ m{ ([-+?,:]) \s* (\w+) \s* = \s* (\w+) \s* }xmsgc;

I’ve done two things here. First, I replaced the \S character classes, used to match the key and value, with \w. The + pattern in a Perl regular expression is greedy, so the former character class was also matching the comma used to separate key/value pairs in the data. This left the literal comma with nothing to match, so was one source of confusion.

Second, I noted that the initial prefix, while syntactically important, could be viewed in the same way as the comma and colon separators. I combined all of these separators and added a capture around them so we can later make sense of the parsed data.

When matched against the data, the pattern results in a list like,

("-", "name", "gcc_xo_src_clk", ",", "type", "rcg", "+", "name", "cxo_clk", ...)

Now we can process the data using a simple state machine.

my $state = undef;

while ( my $token = shift @list ) {
    if ( $token eq '-' ) { $state = 'dash'; next; }
    # ...
    if ( $token eq ',' ) { next; }

    my $key   = shift @list;
    my $value = shift @list;

    if ( $state eq 'dash' ) {
        # ...
    }
}

Even though we did all of the data extraction using a single pattern match, it looks remarkably like … a parser! The pattern is simply the tokenizer used to feed tokens into our state machine, the parser.

Parsing

I stated at the outset that I looked at this as a parsing problem, so the solution I would use is most likely a parser. For simple, one-off scripts, I’d use a technique similar to the one I described in the previous section. However, for more complex data or a more complex script, I’d turn to a real parser.

In fact, one of my contributions to the thread that led me to compose this post included an example of using the $^R and $^N variables in embedded code blocks to demonstrate a rudimentary parser that allowed a simulated form of capturing within a repeated non-capturing group. I won’t go into any detail beyond showing what I wrote. As this was from an early point in the thread, the prefix is ignored in this example.

my @list = ();

my $kv = qr{
    (\w+) (?{ $^N; })           # capture the key
    \s* = \s* (\w+)
    (?{ $^R = [ $^R, $^N ]; })  # capture the value, saving the key
    (?{ push @list, @{ $^R } }) # push the key/value onto @list
}xms;

$data =~ m{ (?: ^[-+?] \s* $kv \s* (?:[,:] \s* $kv \s* )* )* }xms;

Fortunately for us, there are parsing modules on the CPAN.

Prior to Perl 5.10, Damian Conway had written Parse::RecDescent, but with the introduction of grammar-like facilities like named captures and named backreferences, Damian improved upon his original work and presented the Perl community with Regexp::Grammars.

What does a parser for this data built with Regexp::Grammars look like?

my $parser = qr{
    <[Line]>+

    <token: Prefix>   <MATCH= ([-+?]) >
    <token: Key>      <MATCH= (\w+) >
    <token: Value>    <MATCH= (\w+) >

    <rule: Line>      <Prefix> <Pairs> <Options>?
    <rule: Pairs>     <[Pair]>* % ,
    <rule: Pair>      <Key> = <Value>
    <rule: Options>   : <[Option]>* % ,
    <rule: Option>    <Key> = <Value>
}x;

if ( $data =~ $parser ) {
    # Do something with %/
}

This is a trivial example and all the work is left to be done by inspecting the parse tree in %/. However, the module supports embedded code that will be called when a token or rule matches, which can be used to process the data as its parsed.

References

For Want of a Newline

This article is also published on my Perl blog.

Today I had the pleasure of spending three hours debugging an obscure bug. An obscure bug I caused by introducing a newline. That little punk, 0x0A.

I released a new version of a command line program. It’s an elegant piece of work, combining a marvelously complex-but-intuitive configuration for system administrators with an absolutely simple interface for users. To use the command, the user runs it with a couple of arguments and it prints out a single line of useful text derived from that marvelously complex configuration.

But, it doesn’t print a newline.

It’s never printed a newline. The original author didn’t include one for some reason. Anyone who has ever encountered a command like this knows well my irritation.

my awesome prompt> some_lame_command
my awesome prompt>e answer

Argh!

The workaround I’ve seen used, after seeing the above is to face-palm, then run the command again, only differently.

my awesome prompt> echo `some_lame_command`
42 is obviously the answer
my awesome prompt>

I’m embarrassed when users see behavior like this from a program I wrote. Being the arrogant bastard programmer that I am, I decided to fix this. Since all commands print newlines, everyone should already be assuming that this one does too and should already be handling it in the proper manner. Right? When writing a shell script, the distinction between newline-printing and non-newline-printing commands is irrelevant. In either Bourne shell:

FROBBED=`frobnosticate`

Or C shell:

setenv FROBBED `frobnosticate`

The shell is benevolent enough to remove the newline, if it exists. After all, this is the most commonly desired behavior when assigning command output to a variable. However, things are a bit different when switching to a so-called real programming language, like Perl:

$ENV{'FROBBED'} = `frobnosticate`;  # Caution, newline ahead!

Sure, it looks more or less the same, but veteran Perl programmers will immediately grimace when reading the above. Unlike the shell, Perl, like other programming languages, will preserve the output of the command verbatim. In this case, preserving data and letting the programmer decide how to use it is the most commonly desired behavior. Since everything coming from an external command ends with a newline, the environment variable being set in this case will have a newline. This will almost always cause a problem. One that, as I’ve learned, is not always easy to find. Since stripping input of newlines is just as common as the desire to preserve data, Perl makes this easy and most Perl programmers will habitually write it.

chomp( $ENV{'FROBBED'} = `frobnosticate` );

Now it doesn’t matter if the command prints a newline or not, the chomp built-in has your back. It’s just like being in the warm embrace of the shell, only with a little extra syntax.

So it turns out that one of the engineering groups I support was using a Perl script that set an environment variable as in the first example. The value of this environment variable was then being passed off to the batch system and used by an engineering program as a network address to connect to. Of course, the program made the fatal mistake of trusting user input and, in a spectacular fashion, failed to connect to the server whose name just happened to contain a newline.

After chasing down a couple of red herrings which left me flummoxed, one of the affected users shared with me an error log and the script that generated it. There, in all its syntax highlighted, mono-spaced glory was the environment variable being set without any attempt made to trim off the newline. I immediately swallowed my pride and released an updated version of the command reverting the newline behavior and the problem went away. My engineers—at least, the subset using this particular in this particular way—could once again get their work done.

By far, this isn’t the worst thing I’ve done to our batch system. One time I caused all jobs executing on Solaris hosts to immediately fail. Whoops.

Anyway, what’s the lesson to be learned from today’s experience?

Never—and I’ll repeat that, never—assume everyone will be doing the right thing (remember what they say when you assume something). Inevitably, someone won’t be.

There’s a corollary to today’s lesson. When coming across something that could be improved with a small change, don’t. Seriously, just don’t. Inevitably, someone will be depending on the current behavior, no matter how right or wrong it may seem. This is why the phrase “bug compatible” exists in the lexicon.

“Ninja” Code

A version of this article is published on my Perl blog.

The Amazon booth at OSCON 2008 is advertising heavily that they are hiring. They are also holding a raffle. To enter, simply look over some Perl code they have written out on some poster board and tell them what it does. It looks a little something like this (transcribing from memory):

my $code = qq{
    print 1+1 . "\n";
    $code =~ m/(\d+)\+(\d+)/;
    $new = $1 + $2;
    $code =~ s/\d+\+(\d+)/$2+$new/;
};

for ( 1 .. 10 ) {
    eval($code);
}

What’s the first bug? Yes, it should use q{}, or the variables will interpolate on the initial assignment to $code. To their credit, they initially used single quotes, but people said it was too hard to read.

I wasn’t content with just figuring out what the code did and fixing a small bug. I think it can be written better.

eval($code = q{
    print 1+1 . "\n";
    $code =~ s/(\d+)(\+)(\d+)/"$3$2" . ($1 + $3)/e;
    eval $code;
});

Much better. Not only is it more concise, I was able to remove that pesky loop, so I wouldn’t be bothered by any silly upper bounds.

So what does it do? Should be obvious. Head over to the Amazon booth and let them know.

When a Problem Comes Along, You Must Parse It

(With apologies to Devo.)

I attended the first of three courses taught by Damian Conway today, Advanced Parsing with Parse::RecDescent. This was, effectively, the Parse::RecDescent tutorial as presented live by Damian. I’ve read the tutorial, and it’s quite good, but it’s hard to beat instruction from the man himself. Not having used the module on too many occasions, the course served to clarify a few concepts of the grammar used by Parse::RecDescent.

Now I see problems in a new light—for better or worse. This evening, I joined my boss to debug a problem we’ve been seeing in our compute cluster. I joked that I could just whip up a grammar to track down the problem for us. I must resist my usual urge to apply my new, shiny tool to all problems. I mean, it’s bad enough I use the Perl hammer for everything, right?

Today’s course left me with nostalgia for the compilers course I took in college. As my partner for that long ago class reminded me, it was filled with “late nights and Mountain Dew.” Ah, good times.

OSCON 2007: State of the Onion

After the auction to benefit the Perl Foundation, it was finally time for the State of the Onion. I don’t know which number this is, but there have been a lot.

When Larry hooked up his computer to the projector, he had an IRC window open to #parrot on irc.perl.org. Yes, of course I did it. I jumped right into the channel and wrote, “hi mom.” I got a good laugh from those in the room, but I’ll probably never be welcome in that channel again.

Larry thinks it’s a bad idea to get rid of the term scripting. Perl already owns the brand when it comes to scripting. We have about the same chance of changing the branding of hacker.

“Programming is hard, let’s go scripting!”

Scripting isn’t so bad. It’s actually kind of easy; just look at all the script kiddies out there. But we can use Perl to turn all those script kiddies into real programmers. After all, Larry claims to have come to Perl in the same way.

So what’s the difference between scripting and programmers? Scripting is like profanity, you know it when you see it.

This year’s State of the Onion is about scripting, past, present, and future.

The past, essentially, is a brief history of Larry and his experience with scripting at different times of his life. More importantly, it’s about what all of these languages ar, how they and his experiences with them influenced what Perl was, is, and will be.

The present is an overview of the different ways languages can be designed. Binding, dispatch, typology, structure, and others are all different forks in the road of language design. Each fork developed for different reasons, whether it be efficiency of code or abstraction of language concepts. Follow all of these forks like some kind of Choose Your Own Adventure book, and different languages emerge. The lessons of each of these languages can be used as new ones are developed.

So I guess what Larry is trying to say is that Perl 6 looked at what every other language (including Perl 5) did right and what they did wrong, then went ahead and did everything right.

In fact, Perl 6 has taken Yogi Berra’s advice and took all of the forks. Sure, it seems confusing, but think of the power.

Okay, so what’s the future?

Perl 6.

Duh.

OSCON 2007: Domain Specific Languages in Perl

As a programmer who spends most of his time writing code that helps other people to write code, I’m pretty interested in domain specific languages. In fact, I have a couple of modules we use at work that use them. That’s how I ended up in Domain Specific Languages in Perl, presented by Jesse Vincent of Best Practical.

Domain specific languages (DSLs) are languages designed for specific programming tasks. A couple of well-known examples of DSLs are SQL and regular expressions. However, today, Jesse is talking to us about “Englishy” DSLs, which he’s been very interested in lately. Additionally, these are DSLs that are internal to Perl. Instead of parsing a DSL and executing it, they are instead implemented by playing with Perl’s syntax and taking advantage of Perl’s parser (and having all of Perl’s features available in the DSL).

The two main goals when implementing a DSL in Perl are,

  1. Does it feel good?
  2. Can we actually do it?

The first big DSL is Jifty::DBI, an object-relational mapper for Jifty. For the folks at Best Practical, it was a fun learning process in tweaking Perl to look the way they wanted it to while still declaring the database schema in an intuitive, pretty way.

The Template::Declare module is flat-out awesome. I’ve seen it before, and I’ve never taken the time to play with it. How stupid have I been? Coding HTML templates in Perl is cool, and something I would happily do (I don’t really like writing HTML).

As if that weren’t enough, in comes Jifty::Dispatcher. Modelled after Mason’s (auto|d)handler files, it will manipulate web application requests and take care of dispatching processes. I really like the DSL used here. It’s very declarative, removing the infrastructure entirely, which is something I’m very interested in doing in my code right now. I must be getting old; I’m more interested in telling the computer what to do, rather than how to do it, these days.

Testing web sites is ugly, and it sucks. That’s what Test::WWW::Declare is for. It’s a beautiful module. Sessions, flows, declarative statements to define the web flow. So very awesome.

I can tell that the Best Practical folks have seen Damian Conway’s Sufficiently Advanced Technologies talk. The moral of that story is, write what you want to see, and only then figure out how to make it work in Perl.

I’m sold. As soon as I get home, I’m going to work on porting my start-up company’s web site with Jifty. Jifty is, well, absolutely nifty.

At the end, Jesse showed us his own domain specific language to define his slides. It’s awesome. I want it for my own talks.