perl - Processing 1.5 Million Line File under 5 Minutes -

September 15, 2010

after long searches on net, decide ask here regarding problem.i have csv file set (36 files total), coming every 5 minutes. each file contain around 1.5 million lines. need process files in 5 minutes. have parse files , create required directory them inside storage zone. each unique line translated file , put inside related directory. related lines written inside related files. see there lots of i/o operation.

i can finish total 12 files around 10 minutes. target finish 36 in 5 minutes. using perl complete operation. seen problem system calls i/o operations.

i want control file handlers , i/o buffer in perl not have go write file every time. here got lost actually. plus creating directories seems consuming time.

i search cpan ,web find lead can put light on way no luck. have suggestion in subject ? should read or how should proceed ? believe perl more capable fix issue, guess not using correct tools.

open(my $data,"<", $file); @lines = <$data>;  foreach (@lines) {     chomp $_;     $line = $_;      @each = split(' ',$line);     if (@each == 10) {        @logt = split('/',$each[3]);        $llg=1;         if ($logt[1] == "200") {            $llg = 9;        }         $urln = new uri::url $each[6];        $netl = $urln->netloc;         $flnm = md5_hex($netl);        $urlm = md5_hex($each[6]);         if ( ! -d $outp."/".$flnm ) {           mkdir $outp."/".$flnm,0644;        }         open(my $csvf,">>".$outp."/".$flnm."/".$time."_".$urlm) or die $!;        print $csvf int($each[0]).";".$each[2].";".$llg."\n";        close $csvf;   #--->> want rid of can use buffer           }     else {        print $badf $line;     }  }

assume above code used inside subroutine , threaded 12 times. parameter above code filename . wanna rid of close. cause every time open , close file makes call system i/o cause slowness. assumption of course , more open suggestion

thanks in advance

it seems possible you'll open same file multiple times. if so, might beneficial collect information in data structure, , write files after loop has completed. avoids testing existence of same directory repeatedly, , opens each output file once.

we should rid of uri::url – creating new object during each loop iteration expensive considering performance requirements. if urls http://user:password@example.com/path/ or https://example.com/, use simple regex instead.

open $data, "<", $file or die "can't open $file: $!";  %entries;  # collect entries here during loop  # read 1 line @ time, don't keep unnecessary ballast around while (my $line = <$data>) {     chomp $line;      @each = split(' ',$line);      if (@each != 10) {         print $badf $line;         next;     }      (undef, $logt) = split('/', $each[3]);     $llg = ($logt == 200) ? 9 : 1;      $url = $each[6];     ($server) = $url =~ m{\a\w+://([^/]+)};      push @{ $entries{$server}{$url} }, sprintf "%d;%s;%d\n", $each[0], $each[2], $llg; }  while (my ($dir, $files) = each %entries) {     $dir_hash = md5_hex($dir);     $dirname = "$outp/$dir_hash";      mkdir $dirname, 0644 or die "can't create $dirname: $!" unless -d $dirname;      while (my ($file, $lines) = each %$files) {         $file_hash = md5_hex($file);         $filename = "$dirname/${time}_${file_hash}";          open $csv_fh, ">>", $filename or die "can't open $filename: $!";         print { $csv_fh } @$lines;     } }

i cleaned other aspects of code (e.g. variable naming, error handling). moved call md5_hex out of main loop, depending on kind of data may better not delay hashing.

Search This Blog

Silver

perl - Processing 1.5 Million Line File under 5 Minutes -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -