λόγος: Finding duplicates with Perl and Haskell

A coworker wanted to check a family of log files to be sure that a given task never appeared on multiple nodes at the same time. Log entries are on single, whitespace-separated lines, and the last field records a task's start time, e.g.,

1251475056672590000_1732248586_4

Of the three underscore-separated fields, the first is a timestamp, the second we don't care about, and the third is a task identifier.

This task is straightforward with Perl. The diamond operator (or null filehandle, as described in the "I/O Operators" section of the perlop manpage) takes care of the boilerplate for iterating over the paths on the command line, opening them, and reading each line. The scalar $ARGV contains the name of the current file.

By default, split separates fields by whitespace, so (split)[-1] gives us the last field, from which we then grab the time and task with a regular expression and record its presence by pushing the entry's path and line number onto an array associated with that time/task pair. After we've processed the logs, these arrays should all be singletons.

The continue clause is a little weird but necessary because the special variable $., the current line number, does not reset on <>'s implicit opens. ARGV is a handle on the file being read.

With this data structure, detecting duplicates is a search for time/task pairs with multiple hits. We count duplicates and let the user know what we found.

#! /usr/bin/perl

use warnings;
use strict;

# e.g., $hits = @{ $seen{$time}{$task} };
my %seen;

sub num { $a <=> $b }

while (<>) {
  if ((split)[-1] =~ /^(\d+)_\d+_(\d+)$/) {
    my($time,$task) = ($1,$2);
    push @{ $seen{$time}{$task} } => "$ARGV:$.";
  }
  else {
    die "$0: $ARGV:$.: bad timestamp/task field\n";
  }
}
continue {
  close ARGV if eof;
}

my $duplicates = 0;
foreach my $time (sort num keys %seen) {
  foreach my $task (sort num keys %{ $seen{$time} }) {
    my @hits = @{ $seen{$time}{$task} };
    next if @hits == 1;

    $duplicates += @hits - 1;
    warn "$0: duplicates for time=$time, task=$task:\n",
         map "    - $_\n", @hits;
  }
}

my $s = $duplicates == 1 ? "" : "s";
print "$0: $duplicates duplicate$s detected.\n";

exit $duplicates == 0 ? 0 : 1;

For comparison, I implemented the same log checker in Haskell. The function allInputs emulates Perl's diamond operator, and instead of a multi-level hash, the association is more direct: time/task pair to a list of hits.

module Main where

import Control.Monad (liftM)
import Data.List (sort)
import Data.Map (empty,filter,fromListWith,toList,unionWith)
import Prelude hiding (filter)
import System.Environment (getArgs,getProgName)
import System.Exit (ExitCode(..),exitWith)
import Text.Printf (printf)

type Time = String
type Task = String
data Duplicates =
  Duplicates { timestamp :: Time
             , taskId    :: Task
             , locations :: [(FilePath, Int)]
             }

main :: IO ()
main = do
  logs <- allInputs
  let multi = dups logs
      n = sum $ map (subtract 1 . length . locations) multi
  mapM_ (msg . lines . dupmsg) multi
  msg $ ndups n
  exitWith $ if n == 0
               then ExitSuccess
               else ExitFailure 1
  where
    msg info = do me <- getProgName
                  putStrLn $ me ++ ": " ++ head info
                  mapM_ putStrLn (tail info)

    ndups 1 = ["1 duplicate detected"]
    ndups n = [show n ++ " duplicates detected"]

    dupmsg (Duplicates tm task ls) = unlines $
      printf "duplicates for time=%s, task=%s:" tm task :
      map (\(path,n) -> printf "    - %s:%d" path n) ls

allInputs :: IO [(FilePath, String)]
allInputs = getArgs >>= go
  where go [] = ((:[]) . (,) "-") `liftM` getContents
        go fs = mapM readFile fs >>= return . zip fs

dups :: [(FilePath, String)] -> [Duplicates]
dups = map (\((tm,task),ds) -> Duplicates tm task ds) .
       sort .
       toList .
       filter ((> 1) . length) .
       foldl (unionWith (++)) empty .
       map (\(path, contents) ->
              fromListWith (++) $
              map (wrap path . getTimeTask) $
              zip [1..] $ lines contents)
  where
    wrap path (tm,task,n) = ((tm,task), [(path,n)])

getTimeTask :: (Int,String) -> (Time,Task,Int)
getTimeTask (n,line) = (tm,tsk,n)
  where
    [tm,_,tsk] = splitBy '_' (last $ words line)

    splitBy :: Eq a => a -> [a] -> [[a]]
    splitBy _ [] = []
    splitBy x xs = h : splitBy x t
      where (h,rest) = break (== x) xs
            t = drop 1 rest

7 comments:

Anonymous4:51 PM
Seems fine.

In perl I would've have made lists, I would've just made a sub hash.

For haskell I would've just used tuples, sorted the list using sortBy and then use partitionBy (all defined in the List module). Then used filter to find partitions of size > 2.
Anonymous5:05 PM
For 1 log file:

cat log | \
sed -e 's/_/\t/g' | \
awk '{print $1 " " $3}' | \ sort | uniq -c | \
awk '{if ($1 > 1) { print $0 } }'
Anonymous2:34 AM
I think the Haskell code is very un-fluent and it takes a lot of practice to write that kind of code concisely. I'm still working on it myself so I thought I would give it a try. I'm sure it's possible to do much better with what I came up with, which follows:

import Data.List
import Data.Function (on)
import Control.Monad
import System.Environment
import Text.Printf

type Timestamp = String
type Task = String
type LineNo = Int

main = do
ai <- allInputs
mapM_ report $ groupBy ((==)`on`fst) (sort ai)

report :: [Quad] -> IO ()
report qs = do
when (length qs > 1) $ do
let ((time,task),_) = head qs
printf "duplicates for time=%s, task=%s:\n" time task
forM_ qs $ \q->do
let (_,(path,lineno)) = q
printf " - %s:%d" path lineno

type Quad = ((Timestamp, Task), (FilePath, LineNo))

allInputs :: IO [Quad]
allInputs = do
pathlist <- getArgs
zq <- (mapM rp pathlist)
return $ foldr (++) [] zq

rp :: FilePath -> IO [Quad]
rp path = do
s <- readFile path
let sl = map parse_line (lines s)
return [((timestamp,task),(path,i))
| (i,(timestamp,task)) <- zip [1..] sl]

parse_line :: String -> (Timestamp, Task)
parse_line line = (x1, tail x4)
where
sp = span (/= '_')
(x1,x2) = sp line
(x3,x4) = sp (tail x2)
Anonymous2:35 AM
That came out messed up, so I pastebinned it:

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912
Anonymous6:26 AM
Quick hack in Haskell -- 12 lines excluding imports. I think it probably takes some experience for this to be considered "readable", though...
Unknown2:34 PM
Another Haskell version, this time with Parsec. Wrapped to 80 columns, it weighs in at nine lines of actual code, not counting imports and optional type signatures and white space. I think it's also quite readable provided that you have an understanding of Parsec, Map, and Monad.

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912#a8938
Unknown2:36 PM
Ah, my version doesn't do everything the other versions do! Disregard gloating about the line count. :)

Monday, August 31, 2009

Finding duplicates with Perl and Haskell

7 comments: