Optimize perl script to filter rows based on date in the file -


i beginner programming not perl ! please let me know needs change or how else can done.

need optimize perl code run faster. test run, around 500mb file 3 million rows in it, runtime 28 minutes.

i know tool processes 39 million rows in 15 mins, want acheive running on command prompt without resorting tool.

earlier used date::manip , date::parse , moved on datetime, thinking should faster.

my approach if dates iso-8601 (ie, yyyy-mm-dd) , not need validate them, can compare lexicographically (ie, lt , gt operators.)

  • input file date format 07/18/2013 13:45:49
  • input file size 42gb.
  • number of rows 39 million.
  • column delimiter : |~|
  • platform : gnu/linux

i have tried ">" , "gt" , did not find difference in runtime.

code snippet:  use datetime::format::strptime;  $idate = "07/17/2013 00:00:00";  $strp = datetime::format::strptime->new(                    pattern     => '%m/%d/%y %h:%m:%s',                   );  $inputdt = $strp->parse_datetime($idate);  open (file,"myinputfile.dat") or die "could not input file\n"; while (defined(my $line = <file>)) {     @chunks = split '[|]~[|]', $line;     $fdate = $strp->parse_datetime($chunks[6]);     if ( $fdate > $inputdt) {     open(fileout, ">>myoutputfile.dat") or die "could not write\n";     print fileout "$line";                          } } close(file); close (fileout); 

there 2 , half big performance problems here:

  1. you open output file in every iteration. open once, before loop.
  2. the parse_datetime returns datetime object. object orientation perl implies significant overhead. because pattern defined, can parsing ourself, , remove object orientation.
  3. reading file in gb range takes time. speed up, upgrade hardware (e.g. ssd).

to parse date string sortable representation, reorder various parts string:

# %m/%d/%y %h:%m:%s → %y/%m/%d %h:%m:%s $fdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;  if ($fdate gt $inputdate) { ... } 

this lead code

use strict; use warnings;  use constant date_field => shift @argv;  $inputdate = shift @argv; $inputdate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;  <>; # remove header line  while (<>) {     $filedate = (split /\|~\|/, $_, date_field + 2)[date_field];     $filedate =~ s{^ ([0-9]{2} / [0-9]{2}) / ([0-9]{4}) }{$2/$1}x;     print if $filedate gt $inputdate; } 

the in- , output, start date, specified on command line, e.g.

./script 6 '07/17/2013 00:00:00' myinputfile.dat >>myoutputfile.dat 

Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -