1251475056672590000_1732248586_4Of the three underscore-separated fields, the first is a timestamp, the second we don't care about, and the third is a task identifier.
This task is straightforward with Perl. The diamond operator (or null filehandle, as described in the "I/O Operators" section of the perlop manpage) takes care of the boilerplate for iterating over the paths on the command line, opening them, and reading each line. The scalar $ARGV contains the name of the current file.
By default, split separates fields by whitespace, so (split)[-1] gives us the last field, from which we then grab the time and task with a regular expression and record its presence by pushing the entry's path and line number onto an array associated with that time/task pair. After we've processed the logs, these arrays should all be singletons.
The continue clause is a little weird but necessary because the special variable $., the current line number, does not reset on <>'s implicit opens. ARGV is a handle on the file being read.
With this data structure, detecting duplicates is a search for time/task pairs with multiple hits. We count duplicates and let the user know what we found.
#! /usr/bin/perl
use warnings;
use strict;
# e.g., $hits = @{ $seen{$time}{$task} };
my %seen;
sub num { $a <=> $b }
while (<>) {
if ((split)[-1] =~ /^(\d+)_\d+_(\d+)$/) {
my($time,$task) = ($1,$2);
push @{ $seen{$time}{$task} } => "$ARGV:$.";
}
else {
die "$0: $ARGV:$.: bad timestamp/task field\n";
}
}
continue {
close ARGV if eof;
}
my $duplicates = 0;
foreach my $time (sort num keys %seen) {
foreach my $task (sort num keys %{ $seen{$time} }) {
my @hits = @{ $seen{$time}{$task} };
next if @hits == 1;
$duplicates += @hits - 1;
warn "$0: duplicates for time=$time, task=$task:\n",
map " - $_\n", @hits;
}
}
my $s = $duplicates == 1 ? "" : "s";
print "$0: $duplicates duplicate$s detected.\n";
exit $duplicates == 0 ? 0 : 1;
For comparison, I implemented the same log checker in Haskell. The function allInputs emulates Perl's diamond operator, and instead of a multi-level hash, the association is more direct: time/task pair to a list of hits.
module Main where
import Control.Monad (liftM)
import Data.List (sort)
import Data.Map (empty,filter,fromListWith,toList,unionWith)
import Prelude hiding (filter)
import System.Environment (getArgs,getProgName)
import System.Exit (ExitCode(..),exitWith)
import Text.Printf (printf)
type Time = String
type Task = String
data Duplicates =
Duplicates { timestamp :: Time
, taskId :: Task
, locations :: [(FilePath, Int)]
}
main :: IO ()
main = do
logs <- allInputs
let multi = dups logs
n = sum $ map (subtract 1 . length . locations) multi
mapM_ (msg . lines . dupmsg) multi
msg $ ndups n
exitWith $ if n == 0
then ExitSuccess
else ExitFailure 1
where
msg info = do me <- getProgName
putStrLn $ me ++ ": " ++ head info
mapM_ putStrLn (tail info)
ndups 1 = ["1 duplicate detected"]
ndups n = [show n ++ " duplicates detected"]
dupmsg (Duplicates tm task ls) = unlines $
printf "duplicates for time=%s, task=%s:" tm task :
map (\(path,n) -> printf " - %s:%d" path n) ls
allInputs :: IO [(FilePath, String)]
allInputs = getArgs >>= go
where go [] = ((:[]) . (,) "-") `liftM` getContents
go fs = mapM readFile fs >>= return . zip fs
dups :: [(FilePath, String)] -> [Duplicates]
dups = map (\((tm,task),ds) -> Duplicates tm task ds) .
sort .
toList .
filter ((> 1) . length) .
foldl (unionWith (++)) empty .
map (\(path, contents) ->
fromListWith (++) $
map (wrap path . getTimeTask) $
zip [1..] $ lines contents)
where
wrap path (tm,task,n) = ((tm,task), [(path,n)])
getTimeTask :: (Int,String) -> (Time,Task,Int)
getTimeTask (n,line) = (tm,tsk,n)
where
[tm,_,tsk] = splitBy '_' (last $ words line)
splitBy :: Eq a => a -> [a] -> [[a]]
splitBy _ [] = []
splitBy x xs = h : splitBy x t
where (h,rest) = break (== x) xs
t = drop 1 rest
7 comments:
Seems fine.
In perl I would've have made lists, I would've just made a sub hash.
For haskell I would've just used tuples, sorted the list using sortBy and then use partitionBy (all defined in the List module). Then used filter to find partitions of size > 2.
For 1 log file:
cat log | \
sed -e 's/_/\t/g' | \
awk '{print $1 " " $3}' | \ sort | uniq -c | \
awk '{if ($1 > 1) { print $0 } }'
I think the Haskell code is very un-fluent and it takes a lot of practice to write that kind of code concisely. I'm still working on it myself so I thought I would give it a try. I'm sure it's possible to do much better with what I came up with, which follows:
import Data.List
import Data.Function (on)
import Control.Monad
import System.Environment
import Text.Printf
type Timestamp = String
type Task = String
type LineNo = Int
main = do
ai <- allInputs
mapM_ report $ groupBy ((==)`on`fst) (sort ai)
report :: [Quad] -> IO ()
report qs = do
when (length qs > 1) $ do
let ((time,task),_) = head qs
printf "duplicates for time=%s, task=%s:\n" time task
forM_ qs $ \q->do
let (_,(path,lineno)) = q
printf " - %s:%d" path lineno
type Quad = ((Timestamp, Task), (FilePath, LineNo))
allInputs :: IO [Quad]
allInputs = do
pathlist <- getArgs
zq <- (mapM rp pathlist)
return $ foldr (++) [] zq
rp :: FilePath -> IO [Quad]
rp path = do
s <- readFile path
let sl = map parse_line (lines s)
return [((timestamp,task),(path,i))
| (i,(timestamp,task)) <- zip [1..] sl]
parse_line :: String -> (Timestamp, Task)
parse_line line = (x1, tail x4)
where
sp = span (/= '_')
(x1,x2) = sp line
(x3,x4) = sp (tail x2)
That came out messed up, so I pastebinned it:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912
Quick hack in Haskell -- 12 lines excluding imports. I think it probably takes some experience for this to be considered "readable", though...
Another Haskell version, this time with Parsec. Wrapped to 80 columns, it weighs in at nine lines of actual code, not counting imports and optional type signatures and white space. I think it's also quite readable provided that you have an understanding of Parsec, Map, and Monad.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8912#a8938
Ah, my version doesn't do everything the other versions do! Disregard gloating about the line count. :)
Post a Comment