Monday, August 31, 2009

Finding duplicates with Perl and Haskell

A coworker wanted to check a family of log files to be sure that a given task never appeared on multiple nodes at the same time. Log entries are on single, whitespace-separated lines, and the last field records a task's start time, e.g.,
Of the three underscore-separated fields, the first is a timestamp, the second we don't care about, and the third is a task identifier.

This task is straightforward with Perl. The diamond operator (or null filehandle, as described in the "I/O Operators" section of the perlop manpage) takes care of the boilerplate for iterating over the paths on the command line, opening them, and reading each line. The scalar $ARGV contains the name of the current file.

By default, split separates fields by whitespace, so (split)[-1] gives us the last field, from which we then grab the time and task with a regular expression and record its presence by pushing the entry's path and line number onto an array associated with that time/task pair. After we've processed the logs, these arrays should all be singletons.

The continue clause is a little weird but necessary because the special variable $., the current line number, does not reset on <>'s implicit opens. ARGV is a handle on the file being read.

With this data structure, detecting duplicates is a search for time/task pairs with multiple hits. We count duplicates and let the user know what we found.

#! /usr/bin/perl

use warnings;
use strict;

# e.g., $hits = @{ $seen{$time}{$task} };
my %seen;

sub num { $a <=> $b }

while (<>) {
  if ((split)[-1] =~ /^(\d+)_\d+_(\d+)$/) {
    my($time,$task) = ($1,$2);
    push @{ $seen{$time}{$task} } => "$ARGV:$.";
  else {
    die "$0: $ARGV:$.: bad timestamp/task field\n";
continue {
  close ARGV if eof;

my $duplicates = 0;
foreach my $time (sort num keys %seen) {
  foreach my $task (sort num keys %{ $seen{$time} }) {
    my @hits = @{ $seen{$time}{$task} };
    next if @hits == 1;

    $duplicates += @hits - 1;
    warn "$0: duplicates for time=$time, task=$task:\n",
         map "    - $_\n", @hits;

my $s = $duplicates == 1 ? "" : "s";
print "$0: $duplicates duplicate$s detected.\n";

exit $duplicates == 0 ? 0 : 1;

For comparison, I implemented the same log checker in Haskell. The function allInputs emulates Perl's diamond operator, and instead of a multi-level hash, the association is more direct: time/task pair to a list of hits.

module Main where

import Control.Monad (liftM)
import Data.List (sort)
import Data.Map (empty,filter,fromListWith,toList,unionWith)
import Prelude hiding (filter)
import System.Environment (getArgs,getProgName)
import System.Exit (ExitCode(..),exitWith)
import Text.Printf (printf)

type Time = String
type Task = String
data Duplicates =
  Duplicates { timestamp :: Time
             , taskId    :: Task
             , locations :: [(FilePath, Int)]

main :: IO ()
main = do
  logs <- allInputs
  let multi = dups logs
      n = sum $ map (subtract 1 . length . locations) multi
  mapM_ (msg . lines . dupmsg) multi
  msg $ ndups n
  exitWith $ if n == 0
               then ExitSuccess
               else ExitFailure 1
    msg info = do me <- getProgName
                  putStrLn $ me ++ ": " ++ head info
                  mapM_ putStrLn (tail info)

    ndups 1 = ["1 duplicate detected"]
    ndups n = [show n ++ " duplicates detected"]

    dupmsg (Duplicates tm task ls) = unlines $
      printf "duplicates for time=%s, task=%s:" tm task :
      map (\(path,n) -> printf "    - %s:%d" path n) ls

allInputs :: IO [(FilePath, String)]
allInputs = getArgs >>= go
  where go [] = ((:[]) . (,) "-"`liftM` getContents
        go fs = mapM readFile fs >>= return . zip fs

dups :: [(FilePath, String)] -> [Duplicates]
dups = map (\((tm,task),ds) -> Duplicates tm task ds) .
       sort .
       toList .
       filter ((> 1. length) .
       foldl (unionWith (++)) empty .
       map (\(path, contents) ->
              fromListWith (++$
              map (wrap path . getTimeTask) $
              zip [1..$ lines contents)
    wrap path (tm,task,n) = ((tm,task), [(path,n)])

getTimeTask :: (Int,String) -> (Time,Task,Int)
getTimeTask (n,line) = (tm,tsk,n)
    [tm,_,tsk] = splitBy '_' (last $ words line)

    splitBy :: Eq a => a -> [a] -> [[a]]
    splitBy _ [] = []
    splitBy x xs = h : splitBy x t
      where (h,rest) = break (== x) xs
            t = drop 1 rest

Thursday, August 27, 2009

Parker Griffith radio interview

U.S. Representative Parker Griffith, whom I've written about before in this space, was recently on WBHP.

I like his positions on the stimulus, bailouts, cap & tax, and healthcare: all no!

I was happy to hear him speak out against the AMA, which many doctors can't do out of fear of losing their licenses. The AMA cartel is a big source of what's wrong with healthcare in America.

He was also correct that the so-called reform proposals are really "overreach" and a "power grab," nothing to do with improving health care. In a 1961 speech, Ronald Reagan warned, "One of the traditional methods of imposing statism or socialism on a people has been by way of medicine."

So far so good. But then he got to his solution.

I understand the poor guy was sleep deprived, but he fell back into the usual feel-good blah-blah-blah about making health insurance companies "play by the same rules." In real life, this will mean more federal oversight, higher prices, less competition, and therefore less consumer satisfaction.

As a physician, he ought to understand the folly of treating the symptom. The real problem is adults still believe in Santa Claus but use the code name "health insurance." Imagine if State Farm were expected to pony up every time you took your car in for an oil change or other routine maintenance. Health insurance is no such thing: it's really a sort of pre-paid entitlement program, but entirely perverse. For example, legislation requires everyone to pay the same premium in employer-sponsored group insurance. This means healthy people are overcharged, and the couch potato gets a free ride. The system subsidizes and therefore encourages poor health!

In the market, sellers compete against sellers, but buyers also compete against buyers. Throwing deep-pocketed "insurers" into the mix inevitably drives prices beyond affordable levels, especially given the way "insurance" destroys patient price-sensitivity.

I like Peter Schiff's proposal much better. Health "insurers" exist only because of aberration in the tax code. This special tax treatment is why open-market consumers can't purchase their products for reasonable prices. Schiff proposes ending this subsidy but simultaneously raising the personal exemption so as to make the change a wash for taxpayers. The true costs of health "insurance" would then be laid bare. The much cheaper option of major-medical policies would become more attractive. Insurers would have to compete with each other on price, terms, service, and so on. We'd no longer be chained to our employer-sponsored plans.

This would create enormous downward pressure on healthcare prices because people would pay for oil changes.. err, routine doctor visits out of pocket from money that they gasp saved in anticipation of such expenses. Of course when the out-of-pocket limit hits, their bona fide insurance would kick in and cover the rest. Economic arbitrage doing its work means that such a system would also benefit those who opt not to purchase insurance policies.

The jingoistic breast-beating about money versus commitment was poor cover for the fact that Uncle Sam is drowning in debt and flat-out broke. The feds can barely afford to pay their bills now. How will they afford a new program that will dwarf Social Security and Medicare?

Monday, August 24, 2009

Press hit

A good friend of mine rated a mention in a Wall Street Journal article about blatant bias in Heisman voting:
"It seems like it's always something with us," says Patrick Bobo, a Tennessee fan who contributes to a site called Third Saturday in Blogtober. "A lot of Tennessee fans say they don't want the stupid award because it's a joke."
(Back in college, he wore FSU gear but then started pulling for the Vols when they did so well under Peyton Manning, thus earning plenty of good-natured ribbing about bandwagon-hopping.)

As Darren Everson notes in "The South's Heisman Trophy Grudge," no Crimson Tide player has ever won the Heisman even though the program is one of the all-time elites in college football. Bobo pulls for the wrong team but is right on about what a joke Heisman voting is.

The name of the blog is a nod to the third Saturday in October, the annual renewal of the big Alabama-Tennessee rivalry—which happens to take place on the fourth Saturday in October of this year, but whatever. Bloggers from both sides fuel the fire with some good ol' trash-talk.

Even people outside the South know of the Alabama-Auburn rivalry, but the Vols run a very close second. A few years ago, for example, I said I'd rather beat UT than the Tigers if I had to choose. Maybe it has something to do with living in north Alabama and having to endure a higher rate of Tennessee puke-orange sightings. The inset is Daniel Moore's "Rocky Stop," which depicts Roman Harper's (#41) fumble-causing, game-saving hit in an ugly 6-3 matchup.

Congratulations, but I'm looking forward to enjoying this year's victory cigar, Bobo!

Tuesday, August 18, 2009

We have universal legal care

Wow, is it awesome too!
After his appeal had failed and before he was exonerated, Lloyd filed a complaint with the state. He told them Slameka [Lloyd's public defender] never gave him the time of day. Harlin still has a copy of Slameka's rebuttal, which she read out loud:
“This is a sick individual who raped, kidnapped and strangled a young woman on her way to school. His claim of my wrongdoing is frivolous, just as is his existence. Both should be terminated.”
When asked if he really thought his own client should be executed, Slameka said yes, that's what he wrote at the time and that's how he felt.
Read the full story at NPR.

After having 17 years of his life stolen over a horrible crime that DNA evidence showed he didn't commit, Lloyd got out but died two short years later.

His sister, Ruth Harlin, says if someone close to her needed a defense lawyer today, she would do whatever it took to pay for a lawyer.

“I'd mortgage the house,” Harlin says. “I would not let a public defender defend anyone in my family ever again. Ever, ever again.”

Universal legal care's problems—chronic underfunding, unmanageable caseloads, top talent unwilling to participate in the insanity, poor standard of care, and on and on and on—are built in despite its proponents' good intentions.

Economics 101: Incentives Matter!

Sunday, August 16, 2009

Simple analogy for lazy evaluation

In my talk on Friday, I was explaining Haskell's laziness using the following example:
evens (_:x:xs) = x : evens xs
I used promises and thunks to show that the recursive application of evens xs is not evaluated immediately and that we could define the unbounded list of all positive even numbers with evens [1..] without running off into an infinite loop. Some in the mostly imperative-minded audience were still confused.

Someone asked, “Is it kind of like a recipe?” Again my wife's oatmeal cream pies found a fun place in Haskell and this time provided a helpful analogy!

In our pantry we have ingredients such as oatmeal, sugar, and so on. She has her trusty recipe. As long as we provide ingredients and are willing to wait, we can get as many oatmeal cream pies as we want.

The analogy is imperfect. Diminishing marginal utility doesn't weigh on a CPU's execution of programs, but smiling, appreciative faces and delighted exclamations of "Mmm!"—although powerful—may ultimately need to be augmented with shoe-shopping gift certificates to keep the party going.


For a child will be born to us, a son will be given to us;
  And the government will rest on His shoulders;
  And His name will be called Wonderful Counselor, Mighty God,
  Eternal Father, Prince of Peace.

Isaiah 9:6

Monday, August 10, 2009

10Log10 Tech Seminar: Haskell

Come sharpen your skills with a beginner-friendly introduction to Haskell, the advanced functional language that will completely rewire your brain—for the better! I'll give an overview of the language, using the type system to prevent bugs, automatic property-based testing with QuickCheck, and how to improve your procedural or object-oriented code by borrowing from the functional paradigm.

Join us on Friday, August 14, from 11:30am to 1pm in dB's large conference room:

deciBel Research, Inc.
1525 Perimeter Parkway, Suite 500
Huntsville, Alabama 35806
Phone: 256-716-0787
via Google Maps
RSVP to margie at dbresearch dot net. This is a lunchtime event, so feel free to brownbag it!

Saturday, August 08, 2009


His speech was smooth as butter,
    yet war was in his heart;
his words were softer than oil,
    yet they were drawn swords.

Psalm 55:21

Friday, August 07, 2009

git: shrinking Subversion import

At $WORK, we've been attempting for years—but fairly infrequently—to do distributed development with centralized Subversion. We finally had enough and decided to move to git.

Part of that move involved importing a couple of projects with 6+ years of history. Early revisions carried lots of binary test data, so git svn clone produced repositories weighing in at 3.5 and 4.5 gigabytes.

Another less than satisfactory result was the umpteen bazillion git branches corresponding to git tags. Some of the git branches formed families with names of the form name-x.y.z@1234, where name-x.y.z is the name of a Subversion release tag and 1234 was a Subversion revision that modified the tag. A happy design choice made the branch name-x.y.z (with no @nnn) the head revision of that Subversion tag, so we easily picked off some targets:

$ git branch -r -D `git branch -r | grep @`
Cribbing from svn2git, converting the git branches to git tags was a series of commands of the form
$ git checkout 1.2.3
$ git tag -a -m "Tagging release 1.2.3" v1.2.3
$ git branch -r -D 1.2.3
Then to make the Subversion trunk the git master branch:
$ git branch -D master
$ git checkout trunk
$ git checkout -f -b master
$ git branch -r -D trunk
Here's a good point to checkpoint your work in case you hose something later.

Using Antony Stubbs's script to find the biggest objects in a repo's packs, we determined that much of the bulk came from huge HDF5-format test baselines along with a few others. So we cut them out:

$ git filter-branch -d /dev/shm/scratch --index-filter \
  "git rm --cached -f --ignore-unmatch '*.h5'; \
   git rm --cached -f --ignore-unmatch '*.sig'; \
   git rm --cached -f --ignore-unmatch '*.2dsc'" \
  --tag-name-filter cat -- --all
The use of --index-filter makes the long process (remember, it has to hit all possible revisions) quicker because it operates directly on the index rather than checking out every snapshot, munging the filesystem, and shoving the new snapshot back in. Also, /dev/shm is a tmpfs mount for better throughput, and the directory named with -d shouldn't exist.

The git filter-branch manpage has a checklist for shrinking a repository that recommends running filter-branch and then cloning to leave behind the cruft.

Cloning with a filesystem path makes hardlinks, so use a URL:

$ git clone file:///home/gbacon/src/dBTools.git
Even after doing this, some big unnamed blobs had survived the clone. Thanks to #git on freenode for the suggestion to excise the reflog:
$ git reflog expire --verbose --expire=0 --all
$ git gc --prune=0
Note that these options will require a fairly recent git.

After all these steps, the git repositories were went from gigabytes to 75 and 100 megabytes, much nicer!