perl - Processing 1.5 Million Line File under 5 Minutes -
after long searches on net, decide ask here regarding problem.i have csv file set (36 files total), coming every 5 minutes. each file contain around 1.5 million lines. need process files in 5 minutes. have parse files , create required directory them inside storage zone. each unique line translated file , put inside related directory. related lines written inside related files. see there lots of i/o operation.
i can finish total 12 files around 10 minutes. target finish 36 in 5 minutes. using perl complete operation. seen problem system calls i/o operations.
i want control file handlers , i/o buffer in perl not have go write file every time. here got lost actually. plus creating directories seems consuming time.
i search cpan ,web find lead can put light on way no luck. have suggestion in subject ? should read or how should proceed ? believe perl more capable fix issue, guess not using correct tools.
open(my $data,"<", $file); @lines = <$data>; foreach (@lines) { chomp $_; $line = $_; @each = split(' ',$line); if (@each == 10) { @logt = split('/',$each[3]); $llg=1; if ($logt[1] == "200") { $llg = 9; } $urln = new uri::url $each[6]; $netl = $urln->netloc; $flnm = md5_hex($netl); $urlm = md5_hex($each[6]); if ( ! -d $outp."/".$flnm ) { mkdir $outp."/".$flnm,0644; } open(my $csvf,">>".$outp."/".$flnm."/".$time."_".$urlm) or die $!; print $csvf int($each[0]).";".$each[2].";".$llg."\n"; close $csvf; #--->> want rid of can use buffer } else { print $badf $line; } }
assume above code used inside subroutine , threaded 12 times. parameter above code filename . wanna rid of close. cause every time open , close file makes call system i/o cause slowness. assumption of course , more open suggestion
thanks in advance
it seems possible you'll open same file multiple times. if so, might beneficial collect information in data structure, , write files after loop has completed. avoids testing existence of same directory repeatedly, , opens each output file once.
we should rid of uri::url
– creating new object during each loop iteration expensive considering performance requirements. if urls http://user:password@example.com/path/
or https://example.com/
, use simple regex instead.
open $data, "<", $file or die "can't open $file: $!"; %entries; # collect entries here during loop # read 1 line @ time, don't keep unnecessary ballast around while (my $line = <$data>) { chomp $line; @each = split(' ',$line); if (@each != 10) { print $badf $line; next; } (undef, $logt) = split('/', $each[3]); $llg = ($logt == 200) ? 9 : 1; $url = $each[6]; ($server) = $url =~ m{\a\w+://([^/]+)}; push @{ $entries{$server}{$url} }, sprintf "%d;%s;%d\n", $each[0], $each[2], $llg; } while (my ($dir, $files) = each %entries) { $dir_hash = md5_hex($dir); $dirname = "$outp/$dir_hash"; mkdir $dirname, 0644 or die "can't create $dirname: $!" unless -d $dirname; while (my ($file, $lines) = each %$files) { $file_hash = md5_hex($file); $filename = "$dirname/${time}_${file_hash}"; open $csv_fh, ">>", $filename or die "can't open $filename: $!"; print { $csv_fh } @$lines; } }
i cleaned other aspects of code (e.g. variable naming, error handling). moved call md5_hex
out of main loop, depending on kind of data may better not delay hashing.
Comments
Post a Comment