Monday, August 31, 2009

Finding duplicates with Perl and Haskell

A coworker wanted to check a family of log files to be sure that a given task never appeared on multiple nodes at the same time. Log entries are on single, whitespace-separated lines, and the last field records a task's start time, e.g.,
Of the three underscore-separated fields, the first is a timestamp, the second we don't care about, and the third is a task identifier.

This task is straightforward with Perl. The diamond operator (or null filehandle, as described in the "I/O Operators" section of the perlop manpage) takes care of the boilerplate for iterating over the paths on the command line, opening them, and reading each line. The scalar $ARGV contains the name of the current file.

By default, split separates fields by whitespace, so (split)[-1] gives us the last field, from which we then grab the time and task with a regular expression and record its presence by pushing the entry's path and line number onto an array associated with that time/task pair. After we've processed the logs, these arrays should all be singletons.

The continue clause is a little weird but necessary because the special variable $., the current line number, does not reset on <>'s implicit opens. ARGV is a handle on the file being read.

With this data structure, detecting duplicates is a search for time/task pairs with multiple hits. We count duplicates and let the user know what we found.

#! /usr/bin/perl

use warnings;
use strict;

# e.g., $hits = @{ $seen{$time}{$task} };
my %seen;

sub num { $a <=> $b }

while (<>) {
  if ((split)[-1] =~ /^(\d+)_\d+_(\d+)$/) {
    my($time,$task) = ($1,$2);
    push @{ $seen{$time}{$task} } => "$ARGV:$.";
  else {
    die "$0: $ARGV:$.: bad timestamp/task field\n";
continue {
  close ARGV if eof;

my $duplicates = 0;
foreach my $time (sort num keys %seen) {
  foreach my $task (sort num keys %{ $seen{$time} }) {
    my @hits = @{ $seen{$time}{$task} };
    next if @hits == 1;

    $duplicates += @hits - 1;
    warn "$0: duplicates for time=$time, task=$task:\n",
         map "    - $_\n", @hits;

my $s = $duplicates == 1 ? "" : "s";
print "$0: $duplicates duplicate$s detected.\n";

exit $duplicates == 0 ? 0 : 1;

For comparison, I implemented the same log checker in Haskell. The function allInputs emulates Perl's diamond operator, and instead of a multi-level hash, the association is more direct: time/task pair to a list of hits.

module Main where

import Control.Monad (liftM)
import Data.List (sort)
import Data.Map (empty,filter,fromListWith,toList,unionWith)
import Prelude hiding (filter)
import System.Environment (getArgs,getProgName)
import System.Exit (ExitCode(..),exitWith)
import Text.Printf (printf)

type Time = String
type Task = String
data Duplicates =
  Duplicates { timestamp :: Time
             , taskId    :: Task
             , locations :: [(FilePath, Int)]

main :: IO ()
main = do
  logs <- allInputs
  let multi = dups logs
      n = sum $ map (subtract 1 . length . locations) multi
  mapM_ (msg . lines . dupmsg) multi
  msg $ ndups n
  exitWith $ if n == 0
               then ExitSuccess
               else ExitFailure 1
    msg info = do me <- getProgName
                  putStrLn $ me ++ ": " ++ head info
                  mapM_ putStrLn (tail info)

    ndups 1 = ["1 duplicate detected"]
    ndups n = [show n ++ " duplicates detected"]

    dupmsg (Duplicates tm task ls) = unlines $
      printf "duplicates for time=%s, task=%s:" tm task :
      map (\(path,n) -> printf "    - %s:%d" path n) ls

allInputs :: IO [(FilePath, String)]
allInputs = getArgs >>= go
  where go [] = ((:[]) . (,) "-"`liftM` getContents
        go fs = mapM readFile fs >>= return . zip fs

dups :: [(FilePath, String)] -> [Duplicates]
dups = map (\((tm,task),ds) -> Duplicates tm task ds) .
       sort .
       toList .
       filter ((> 1. length) .
       foldl (unionWith (++)) empty .
       map (\(path, contents) ->
              fromListWith (++$
              map (wrap path . getTimeTask) $
              zip [1..$ lines contents)
    wrap path (tm,task,n) = ((tm,task), [(path,n)])

getTimeTask :: (Int,String) -> (Time,Task,Int)
getTimeTask (n,line) = (tm,tsk,n)
    [tm,_,tsk] = splitBy '_' (last $ words line)

    splitBy :: Eq a => a -> [a] -> [[a]]
    splitBy _ [] = []
    splitBy x xs = h : splitBy x t
      where (h,rest) = break (== x) xs
            t = drop 1 rest


Anonymous said...

Seems fine.

In perl I would've have made lists, I would've just made a sub hash.

For haskell I would've just used tuples, sorted the list using sortBy and then use partitionBy (all defined in the List module). Then used filter to find partitions of size > 2.

Anonymous said...

For 1 log file:

cat log | \
sed -e 's/_/\t/g' | \
awk '{print $1 " " $3}' | \ sort | uniq -c | \
awk '{if ($1 > 1) { print $0 } }'

Anonymous said...

I think the Haskell code is very un-fluent and it takes a lot of practice to write that kind of code concisely. I'm still working on it myself so I thought I would give it a try. I'm sure it's possible to do much better with what I came up with, which follows:

import Data.List
import Data.Function (on)
import Control.Monad
import System.Environment
import Text.Printf

type Timestamp = String
type Task = String
type LineNo = Int

main = do
ai <- allInputs
mapM_ report $ groupBy ((==)`on`fst) (sort ai)

report :: [Quad] -> IO ()
report qs = do
when (length qs > 1) $ do
let ((time,task),_) = head qs
printf "duplicates for time=%s, task=%s:\n" time task
forM_ qs $ \q->do
let (_,(path,lineno)) = q
printf " - %s:%d" path lineno

type Quad = ((Timestamp, Task), (FilePath, LineNo))

allInputs :: IO [Quad]
allInputs = do
pathlist <- getArgs
zq <- (mapM rp pathlist)
return $ foldr (++) [] zq

rp :: FilePath -> IO [Quad]
rp path = do
s <- readFile path
let sl = map parse_line (lines s)
return [((timestamp,task),(path,i))
| (i,(timestamp,task)) <- zip [1..] sl]

parse_line :: String -> (Timestamp, Task)
parse_line line = (x1, tail x4)
sp = span (/= '_')
(x1,x2) = sp line
(x3,x4) = sp (tail x2)

Anonymous said...

That came out messed up, so I pastebinned it:

rgs26 said...

Quick hack in Haskell -- 12 lines excluding imports. I think it probably takes some experience for this to be considered "readable", though...

Jake McArthur said...

Another Haskell version, this time with Parsec. Wrapped to 80 columns, it weighs in at nine lines of actual code, not counting imports and optional type signatures and white space. I think it's also quite readable provided that you have an understanding of Parsec, Map, and Monad.

Jake McArthur said...

Ah, my version doesn't do everything the other versions do! Disregard gloating about the line count. :)