180 likes | 309 Views
Optimizing in Perl By Peter Wad Sackett. Optimizing the code – minor gains 1. ++$i and $st .= $data instead of $i = $i+1 and $st = $st . $data Use index instead of regular expression when possible $st =~ tr/X/X/; to count occurences of X’es
E N D
Optimizing in Perl By Peter Wad Sackett
Optimizing the code – minor gains 1 ++$i and $st .= $data instead of $i = $i+1 and $st = $st . $data Use index instead of regular expression when possible $st =~ tr/X/X/; to count occurences of X’es Substitute en passant ($st2 = $st1) =~ s/eat/drink/g; Start for loops from the end and count down to 0 Use foreach instead of for Use each %hash instead of keys %hash Use single quotes instead of double quotes on literal strings Don’t use $` $& $’ in regex Don’t use $st =~ m/$var/, use $st =~ m/$var/o
Optimizing the code – minor gains 2 Subroutines have an overhead – avoid in inner loops Use references in subroutine calls to avoid copy of large data Blocks {} have a overhead Some loops can be moved into map and grep functions Large arrays/strings should be initialized in the far end to allocate all memory at once Avoid creating local hashes in inner loops Reject common cases early in loops Remove invariant code from loops Avoid system calls if possible
Optimizing code – bigger gains Make (array) searches into hash lookups – drawback is memory use Consider sorting data Cache results if appropiate – Memoize
Changing the algorithm – huge gains • A program usually spends the bulk of the time in few places, the bottlenecks. • Use a profiler to identify the places if they are not obvious – use Devel::NYTProf; • Make the bottleneck part run faster by • Simple optimization • Avoiding it • Reusing results from previous invocations (caching) • Changing algorithm either by incremental insight or fundamental change
Changing the environment • If the program should run as part of a web server, consider • FastCGI • mod_perl • If the program needs to load a lot of data, consider making a client/server solution. • If using ’outside’ resources like databases, then learn methods to access them fast. • DBI module has both fast and slow methods.
Reading a 60 GB NGS data file The input: 1.6 billion records @HWI-ST301L:255:C0E64ACXX:1:1101:1234:2299 1:N:0: CTGTAACGTACCATAGGTTGACCATACTTCAAAAGCTGTACTCTCATGGCC + =1:BDDFFHDHHHJDHIF@EHGHIIGIIJIGDGGIGJJHCDIIIGGGHIG@ @HWI-ST301L:255:C0E64ACXX:1:1101:1207:2311 1:N:0: CTGTACAGCTGGAGTCANGGGGCCTAGAGCTGTGGGGAGGGAGGTGCAGGG + 8:1ADD?DDDDDDEED@#3AFFEFEFE@EBBDDEDIIBD?D@A>(=@@ADD @HWI-ST301L:255:C0E64ACXX:1:1101:1191:2350 1:N:0: CAGTATGCCCATCGCAGNTCGCTACACGCAGGACGCTTTTTCACGTTCTGG + 1:144AA4+ABDDID?E#33A?EEEE>E@00?0?D:ADDII78@CDDACC9 @HWI-ST301L:255:C0E64ACXX:1:1101:1113:2363 1:N:0: CCGTTAGCCACTGTAAGNACTGCTGGGGACACACTGCAGTCAAGCGAAGCG + 111==DDDDDDDDIIII#3CFFIIIIIIEDIIIIIIIIIEIIIIIDDDIII @HWI-ST301L:255:C0E64ACXX:1:1101:1205:2373 1:N:0: GTCTAGCTGGAGAAGATNTTGAGGAACCTCCAGGAGGAAGAAGCCTCTGGG + =?@DDB;DFDDH?CE<?#33CAFGH@@@ECGHIIFGGD9DBFG;(9BCAA;
Stupid attempt my $file = $ARGV[0]; my @bar = ('GACT','TCCT','CTGT','GTCT','ACCT','ACAT','CAGT','GCGT'); my $prefix = $ARGV[1]; foreach my $bar (@bar){ open(IN,'<',$file) or die; my $id_line; while (my $line = <IN>){ if ($line =~m/^\@F/){$id_line = $line;} my $sequence = <IN>; if ($bar eq substr($sequence,0,4)){ print $id_line, substr($sequence,4); my $line3 = <IN>; print $line3; my $line4 = <IN>; print substr($line4,4); } } close IN; }
How fast can this go ? Reading the file line by line with perl: 431 sec Reading in 64 KB blocks: 122 sec Simple copy with unix cp: 597 sec Perl copy in standard read line – write line fashion: 1197 sec Perl copying in read/write of 64 KB blocks: 632 sec Perl copy, read in 64 KB blocks, cache output in 100 MB: 404 sec Realistic perl copy read in small block, cache output: 562 sec
Version 1, 2193 sec my @bararray = qw (GCGT CTGT CAGT GTCT ACCT); open(IN, '<', $ARGV[0]) or die "Can't open $file, reason: $!\n"; my ($id, $seq, $quality, %filehash); foreach my $tag (@bararray) { my $fh; open ($fh, '>', "$tag.$file") or die "$!"; $filehash{$tag} = $fh; } open (my $unassigned, '>', "untagged.$file") or die "$!"; while (1) { $id = <IN>; last unless defined $id_line; $seq = <IN>; $quality = <IN>; $quality = <IN>; my $tag = substr($seq, 0, 4); substr($seq, 0, 4, ''), substr($quality, 0, 4, '') if exists $filehash{$tag}; my $fh = exists $filehash{$tag} ? $filehash{$tag} : $unassigned; print $fh $id_line . $seq . "+\n" . $quality; } close IN; close $unassigned; foreach my $fh (values %filehash) { close $fh; }
Version 2, cached output, 1305 sec my @barcodes = qw (GCGT CTGT CAGT GTCT ACCT); open(IN, '<', $ARGV[0]) or die "Can't open $file, reason: $!\n"; push(@barcodes, 'untagged'); my ($id, $seq, $quality, $count, %filehandle, %cache); foreach my $code (@barcodes) { open (my $fh, '>', "$code.$file") or die "$!"; $filehandle{$code} = $fh; $cache{$code} = '0' x 10000000; } # allocate memory foreach my $code (@barcodes) { $cache{$code} = ''; } for (;;) { $id_line = <IN>; last unless defined $id_line; $seq = <IN>; $quality = <IN>; $quality = <IN>; my $tag = substr($seq, 0, 4); if (exists $filehandle{$tag}) { $cache{"$tag"} .= $id_line . substr($seq, 4) . "+\n" . substr($quality, 4); } else { $cache{untagged} .= $id_line . $seq . "+\n" . $quality; } next unless ++$count >= 250000; # A million lines read foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = '’; } $count = 0; } close IN; foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; close $fh; }
Version 3, pipe, 1482 sec Same as version 2, just adding a fast reading pipe in front
Version 4, block reading, 2020 sec my ($offset, $buffer, $count) = (0,'’,0); while (read(IN, $buffer, 8192, $offset)) { my $newline = chomp $buffer; my @lines = split(m/\n/, $buffer); my $lastgoodentry = scalar @lines - (scalar @lines) % 4; $lastgoodentry -= 4 if $lastgoodentry == scalar @lines and not $newline; if ($lastgoodentry == scalar @lines) { $buffer = ''; $offset = 0; } else { $buffer = join("\n", splice(@lines, $lastgoodentry)); $buffer .= "\n" if $newline; $offset = length($buffer); } for (my $i = 0; $i < $lastgoodentry; $i+=4) { my ($id, $seq, $plus, $quality) = @lines[$i..$i+3]; my $tag = substr($seq, 0, 4); if (exists $filehandle{$tag}) { $cache{"$tag"} .= "$id\n" . substr($seq, 4) . "\n+\n" . substr($quality, 4) . "\n"; } else { $cache{untagged} .= "$id\n$seq\n+\n$quality\n”;} } next unless ++$count >= 1000; # blocks foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }
Version 5, block reading 2, 2000 sec my ($offset, $buffer,$count) = (0,'',0); while (read(IN, $buffer, 65536, $offset)) { my $recpos = 0; for(;;) { my $seqpos = 1 + index($buffer, "\n", $recpos); last unless $seqpos; my $qualpos = 3 + index($buffer, "\n", $seqpos); last unless $qualpos > 2; my $nextrec = 1 + index($buffer, "\n", $qualpos); last unless $nextrec; my $tag = substr($buffer, $seqpos, 4); my $record = substr($buffer, $recpos, $nextrec - $recpos); if (exists $filehandle{$tag}) { substr($record, $qualpos-$recpos, 4, ''); substr($record, $seqpos-$recpos, 4, ''); $cache{"$tag"} .= $record; } else { $cache{untagged} .= $record; } $recpos = $nextrec; } $buffer = substr($buffer, $recpos); $offset = length($buffer); next unless ++$count >= 1000; # blocks foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }
Version 6, record reading, 1069 sec my ($count, $record, $size, $extra, $tag, $missing) = (0,'',0x9c,'',''); while ($extra or read(IN, $record, $size)) { $record = $extra, $extra = '' if $extra; if (substr($record, -1, 1) ne "\n") { my $where = rindex($record, "\n") + 1; if (length($record) - $where > 5) { do { read(IN, $missing, 1); $size++; $record .= $missing; } until $missing eq "\n"; } else { $extra = substr($record, $where, length($record) - $where, ''); $size -= length($extra); read(IN, $extra, $size - length($extra), length($extra)); } } $tag = substr($record, -106, 4); if (exists $filehandle{$tag}) { substr($record, -106, 4, ''); substr($record, -52, 4, ''); $cache{"$tag"} .= $record; } else { $cache{untagged} .= $record; } next unless ++$count >= 1000000; # records foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }
Version 7, mixed reading, 1060 Sec my ($count, $record, $id, $tag) = (0); while (defined ($id = <IN>)) { read(IN, $record, 0x6a); $tag = substr($record, 0, 4); if (exists $filehandle{$tag}) { substr($record, 0x36, 4, ''); substr($record, 0, 4, ''); $cache{$tag} .= "$id$record"; } else { $cache{untagged} .= "$id$record"; } next unless ++$count >= 1000000; # records foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; } close IN; foreach my $code (@barcodes) { my $fh = $filehandle{$code}; print $fh $cache{$code}; close $fh; }
The next step It is hard to know when you have reached the best possible speed. When you have no more ideas to try out, then sleep on it – or stop. Consider parallelizing your algorithm – all computers these days have more cores to work with. In this NGS example, we could have one core deal with reading the file into memory, one core to separate the records in in right output caches, and one core to write the output files.
Conclusions Declaration of my $variables takes time. Data that are basically line orientated are hard to work with in block reading. Even the simplest perl statement takes too long if executed many times. Usually, the less code you can use to express your algorithm the faster it will go. Optimizing takes a lot of time, you have too choose when to do it. Optimizing is also a lot about trial and error.