Optimize perl script to filter rows based on date in the file -
i beginner programming not perl ! please let me know needs change or how else can done.
need optimize perl code run faster. test run, around 500mb file 3 million rows in it, runtime 28 minutes.
i know tool processes 39 million rows in 15 mins, want acheive running on command prompt without resorting tool.
earlier used date::manip , date::parse , moved on datetime, thinking should faster.
my approach if dates iso-8601 (ie, yyyy-mm-dd) , not need validate them, can compare lexicographically (ie, lt , gt operators.)
- input file date format 07/18/2013 13:45:49
- input file size 42gb.
- number of rows 39 million.
- column delimiter : |~|
- platform : gnu/linux
i have tried ">" , "gt" , did not find difference in runtime.
code snippet: use datetime::format::strptime; $idate = "07/17/2013 00:00:00"; $strp = datetime::format::strptime->new( pattern => '%m/%d/%y %h:%m:%s', ); $inputdt = $strp->parse_datetime($idate); open (file,"myinputfile.dat") or die "could not input file\n"; while (defined(my $line = <file>)) { @chunks = split '[|]~[|]', $line; $fdate = $strp->parse_datetime($chunks[6]); if ( $fdate > $inputdt) { open(fileout, ">>myoutputfile.dat") or die "could not write\n"; print fileout "$line"; } } close(file); close (fileout);
there 2 , half big performance problems here:
- you open output file in every iteration. open once, before loop.
- the
parse_datetime
returns datetime object. object orientation perl implies significant overhead. because pattern defined, can parsing ourself, , remove object orientation. - reading file in gb range takes time. speed up, upgrade hardware (e.g. ssd).
to parse date string sortable representation, reorder various parts string:
# %m/%d/%y %h:%m:%s → %y/%m/%d %h:%m:%s $fdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x; if ($fdate gt $inputdate) { ... }
this lead code
use strict; use warnings; use constant date_field => shift @argv; $inputdate = shift @argv; $inputdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x; <>; # remove header line while (<>) { $filedate = (split /\|~\|/, $_, date_field + 2)[date_field]; $filedate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x; print if $filedate gt $inputdate; }
the in- , output, start date, specified on command line, e.g.
./script 6 '07/17/2013 00:00:00' myinputfile.dat >>myoutputfile.dat
Comments
Post a Comment