Wednesday, February 18, 2009

Scraping data from a program's output

Consider a (truncated) output from vmstat on AIX:
  2031616 memory pages
  1953185 lruable pages
   935166 free pages
        1 memory pools
   170943 pinned pages
     80.0 maxpin percentage
  ...
Say you want to grab the values for memory pages and free pages. Below, I explain a couple of ways to do it.
#! /usr/bin/perl

use warnings;
use strict;

no warnings "exec";

open my $fh, "vmstat -v |"
  or die "$0: can't execute vmstat: $!\n";

my %vmstat;

while () {
  chomp;
  my($n,$desc) = split " ", $_, 2;

  $vmstat{$desc} = $n;
}

print "Memory pages: $vmstat{'memory pages'}\n",
      "Free pages:   $vmstat{'free pages'}\n";
When the filename argument to open ends with a pipe, Perl runs the named command and makes its output available on the returned filehandle. I turn off the autogenerated error (no warnings "exec") because I like my format better.

Looking at vmstat's output, each line has a value and a description, so the plan is to read each line and stash the parameters where we can find them later. A hash is a perfect data structure for this task.

Most of the time, the pattern to the split operator is a regular expression, but with no arguments (or a pattern of a lone space) it acts like awk, throwing away leading whitespace. Because the descriptions contain spaces, we don't want to split on them and tell Perl to give us back exactly two fields. Because we've limited the number of splits, we have to remove the trailing newline with chomp.

The output is straightforward: print the desired values.

You can of course be more clever:

#! /usr/bin/perl

%vmstat = reverse `vmstat -v` =~ /(\S+) (.+)/g;

print "Memory pages: $vmstat{'memory pages'}\n",
      "Free pages:   $vmstat{'free pages'}\n";
Instead of a piped open, this time we use backticks (``) to capture vmstat's output and from the output extract the values and descriptions.

The regular expression \S+ means a sequence of one or more non-whitespace characters, and this matches the numbers in the output. You might be tempted to use \d+ (one or more digits), but this will give you surprising results on the floating-point numbers.

By default, dot does not match newline, so the (.+) subpattern matches through the rest of the current line — the description in this case.

The /g regular-expression switch means we get all possible non-overlapping matches.

The list returned from the match will look like (2031616, "memory pages", 1953185, "lruable pages", ...), but that's the opposite order from hash initialization, i.e., key then value. The reverse operator fixes this problem.

No comments: