Monday, August 31, 2009

Finding duplicates with Perl and Haskell

A coworker wanted to check a family of log files to be sure that a given task never appeared on multiple nodes at the same time. Log entries are on single, whitespace-separated lines, and the last field records a task's start time, e.g.,
1251475056672590000_1732248586_4
Of the three underscore-separated fields, the first is a timestamp, the second we don't care about, and the third is a task identifier.

This task is straightforward with Perl. The diamond operator (or null filehandle, as described in the "I/O Operators" section of the perlop manpage) takes care of the boilerplate for iterating over the paths on the command line, opening them, and reading each line. The scalar $ARGV contains the name of the current file.

By default, split separates fields by whitespace, so (split)[-1] gives us the last field, from which we then grab the time and task with a regular expression and record its presence by pushing the entry's path and line number onto an array associated with that time/task pair. After we've processed the logs, these arrays should all be singletons.

The continue clause is a little weird but necessary because the special variable $., the current line number, does not reset on <>'s implicit opens. ARGV is a handle on the file being read.

With this data structure, detecting duplicates is a search for time/task pairs with multiple hits. We count duplicates and let the user know what we found.

#! /usr/bin/perl

use warnings;
use strict;

# e.g., $hits = @{ $seen{$time}{$task} };
my %seen;

sub num { $a <=> $b }

while (<>) {
  if ((split)[-1] =~ /^(\d+)_\d+_(\d+)$/) {
    my($time,$task) = ($1,$2);
    push @{ $seen{$time}{$task} } => "$ARGV:$.";
  }
  else {
    die "$0: $ARGV:$.: bad timestamp/task field\n";
  }
}
continue {
  close ARGV if eof;
}

my $duplicates = 0;
foreach my $time (sort num keys %seen) {
  foreach my $task (sort num keys %{ $seen{$time} }) {
    my @hits = @{ $seen{$time}{$task} };
    next if @hits == 1;

    $duplicates += @hits - 1;
    warn "$0: duplicates for time=$time, task=$task:\n",
         map "    - $_\n", @hits;
  }
}

my $s = $duplicates == 1 ? "" : "s";
print "$0: $duplicates duplicate$s detected.\n";

exit $duplicates == 0 ? 0 : 1;

For comparison, I implemented the same log checker in Haskell. The function allInputs emulates Perl's diamond operator, and instead of a multi-level hash, the association is more direct: time/task pair to a list of hits.

module Main where

import Control.Monad (liftM)
import Data.List (sort)
import Data.Map (empty,filter,fromListWith,toList,unionWith)
import Prelude hiding (filter)
import System.Environment (getArgs,getProgName)
import System.Exit (ExitCode(..),exitWith)
import Text.Printf (printf)

type Time = String
type Task = String
data Duplicates =
  Duplicates { timestamp :: Time
             , taskId    :: Task
             , locations :: [(FilePath, Int)]
             }

main :: IO ()
main = do
  logs <- allInputs
  let multi = dups logs
      n = sum $ map (subtract 1 . length . locations) multi
  mapM_ (msg . lines . dupmsg) multi
  msg $ ndups n
  exitWith $ if n == 0
               then ExitSuccess
               else ExitFailure 1
  where
    msg info = do me <- getProgName
                  putStrLn $ me ++ ": " ++ head info
                  mapM_ putStrLn (tail info)

    ndups 1 = ["1 duplicate detected"]
    ndups n = [show n ++ " duplicates detected"]

    dupmsg (Duplicates tm task ls) = unlines $
      printf "duplicates for time=%s, task=%s:" tm task :
      map (\(path,n) -> printf "    - %s:%d" path n) ls

allInputs :: IO [(FilePath, String)]
allInputs = getArgs >>= go
  where go [] = ((:[]) . (,) "-"`liftM` getContents
        go fs = mapM readFile fs >>= return . zip fs

dups :: [(FilePath, String)] -> [Duplicates]
dups = map (\((tm,task),ds) -> Duplicates tm task ds) .
       sort .
       toList .
       filter ((> 1. length) .
       foldl (unionWith (++)) empty .
       map (\(path, contents) ->
              fromListWith (++$
              map (wrap path . getTimeTask) $
              zip [1..$ lines contents)
  where
    wrap path (tm,task,n) = ((tm,task), [(path,n)])

getTimeTask :: (Int,String) -> (Time,Task,Int)
getTimeTask (n,line) = (tm,tsk,n)
  where
    [tm,_,tsk] = splitBy '_' (last $ words line)

    splitBy :: Eq a => a -> [a] -> [[a]]
    splitBy _ [] = []
    splitBy x xs = h : splitBy x t
      where (h,rest) = break (== x) xs
            t = drop 1 rest

7 comments:

  1. Anonymous4:51 PM

    Seems fine.

    In perl I would've have made lists, I would've just made a sub hash.

    For haskell I would've just used tuples, sorted the list using sortBy and then use partitionBy (all defined in the List module). Then used filter to find partitions of size > 2.

    ReplyDelete
  2. Anonymous5:05 PM

    For 1 log file:

    cat log | \
    sed -e 's/_/\t/g' | \
    awk '{print $1 " " $3}' | \ sort | uniq -c | \
    awk '{if ($1 > 1) { print $0 } }'

    ReplyDelete
  3. Anonymous2:34 AM

    I think the Haskell code is very un-fluent and it takes a lot of practice to write that kind of code concisely. I'm still working on it myself so I thought I would give it a try. I'm sure it's possible to do much better with what I came up with, which follows:

    import Data.List
    import Data.Function (on)
    import Control.Monad
    import System.Environment
    import Text.Printf

    type Timestamp = String
    type Task = String
    type LineNo = Int

    main = do
    ai <- allInputs
    mapM_ report $ groupBy ((==)`on`fst) (sort ai)

    report :: [Quad] -> IO ()
    report qs = do
    when (length qs > 1) $ do
    let ((time,task),_) = head qs
    printf "duplicates for time=%s, task=%s:\n" time task
    forM_ qs $ \q->do
    let (_,(path,lineno)) = q
    printf " - %s:%d" path lineno

    type Quad = ((Timestamp, Task), (FilePath, LineNo))

    allInputs :: IO [Quad]
    allInputs = do
    pathlist <- getArgs
    zq <- (mapM rp pathlist)
    return $ foldr (++) [] zq

    rp :: FilePath -> IO [Quad]
    rp path = do
    s <- readFile path
    let sl = map parse_line (lines s)
    return [((timestamp,task),(path,i))
    | (i,(timestamp,task)) <- zip [1..] sl]

    parse_line :: String -> (Timestamp, Task)
    parse_line line = (x1, tail x4)
    where
    sp = span (/= '_')
    (x1,x2) = sp line
    (x3,x4) = sp (tail x2)

    ReplyDelete
  4. Anonymous2:35 AM

    That came out messed up, so I pastebinned it:

    http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912

    ReplyDelete
  5. Anonymous6:26 AM

    Quick hack in Haskell -- 12 lines excluding imports. I think it probably takes some experience for this to be considered "readable", though...

    ReplyDelete
  6. Another Haskell version, this time with Parsec. Wrapped to 80 columns, it weighs in at nine lines of actual code, not counting imports and optional type signatures and white space. I think it's also quite readable provided that you have an understanding of Parsec, Map, and Monad.

    http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912#a8938

    ReplyDelete
  7. Ah, my version doesn't do everything the other versions do! Disregard gloating about the line count. :)

    ReplyDelete