220 likes | 323 Views
More loops. Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.<br>"; while ($num != 31) { $num = <STDIN>; } print "correct!<br>";. Loops. my @names = <STDIN>; chomp(@names); my $name; foreach $name (@names) { print "Hello $name!<br>"; }.
E N D
Commands inside a loop are executed repeatedly (iteratively): my $num=0;print "Guess a number.\n";while ($num != 31) { $num = <STDIN>;}print "correct!\n"; Loops my @names = <STDIN>;chomp(@names);my $name;foreach $name (@names) { print "Hello $name!\n";}
The for loop is controlled by three statements: • 1st is executed before the first iteration • 2nd is the stop condition • 3rd is executed before every re-iteration Loops: for my $i=0;while ($i<10){ print "$i\n";$i++;} for (my $i=0; $i<10; $i++) { print "$i\n";} These are equivalent
Breaking out of loops next– skip to the next iteration last– skip out of the loop my @lines = <STDIN>;foreach $line (@lines) { if (substr($line,0,1) eq ">") { next; } if (substr($line,0,8) eq "**stop**") { last; } print $line;}
Breaking out of loops die– end the program and print an error message to the standard error <STDERR> if ($score < 0) { die "score must be positive"; }score must be positive at test.pl line 8. Note: if you end the string with a "\n" then only your message will be printed * warn does the same thing as die without ending the program
The programming process It pays to plan ahead before writing a computer program: • Define the purpose of the program • Identify the required inputs • Decide how to present the outputs • Make an overall design of the program • Refine the design, specify more details • Write the code– one stage at a time and testeach stage • Debug…
(1) (2&3) (4&5) An example: SAGE libraries SAGE (Serial Analysis of Gene Expression) is used to identify all transcripts that are expressed in a tissue: • Double-stranded cDNA is generated from cell extracts • The cDNA is cleaved with a restriction enzyme (NlaIII) • The most 3'-end of the cDNA is then collected by their poly-A • The fragments are ligated to linkers containing a recognition site for a type IIS restriction enzyme and a PCR primer site • This restriction enzyme cuts 15bp away from its recognition site • Ligation, PCR, cleavage, concatenation, cloning, sequencing… A 10bp tag sequence from each mRNA • 10bp sequences are searched in an mRNA database and the corresponding genes are identified
An example: SAGE libraries SAGE (Serial Analysis of Gene Expression) is used to identify all transcripts that are expressed in a tissue:
Predicting the SAGE tag of an mRNA It would be useful to know what tag to expect for each mRNA in the database. So lets write a script: 1.Purpose: To predict the 10bp sequence of the SAGE tag of a given mRNA 2. Inputs: A list of mRNA sequences in FASTA format >gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNA GTGGTATTTCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA......GTCGACGGCGATGGGTTCCTGGACTCTGACGAGTTCGCGCTGGCCTTGCACTTAATCAACGTCAAGCTGGAAGGCTGCGAGCTGCCCACCGTGCTGCCGGAGCACTTAGTACCGCCGTCGAAGCGCTATGACTAGTGTCCTGTAGCATACGCATACGCACACTAGATCACACAGCCTCACAATTCCCAAAAAAAAAAAAAAAA >gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAGGTAGGGCGCTACCGCCTCCGCCCGCCTCTCGCGCTGTTCCTCCGCGGTATGCCCGCGCCGGCAGCCGGC......TATTATATAGAGAAATATATTGTGTATGTAGGATGTGCTTATTGCATTACATTTATCACTTGTCTTAACTAGAATGCATTAACCTTTTTTGTACCCTGGTCCTAAAACATTATTAAAAAGAAAGGCTAAAAAAAAAAAAAAAAA >gi|55742710|ref|NM_153068.2| Mus musculus EH-domain containing 2 (Ehd2), mRNATGAGGGGGCCTGGGGCCCGCCCTGCTCGCCGCTCCTAGCGCACGCGGCCCCACCCGTCTCACTCCACTGC......
3. Decide how to present the results Simply print the header line of each mRNA and then it’s predicted 10bp tag, like so: > gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNAATCACACAGC >gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAAATGCATTAA ......
4. Overall design: • For each mRNA in the input: • Read the sequence • Find the most downstream recognition site of NlaII (CTAG) • Get the 10bp tag after that site • Print it
Start Read sequence Find most downstream CTAG Get the 10bp tag Print the tag End of input? No End Flow diagram:
Refine the design, specify more details: • For each mRNA in the input (use a loop): • Read the sequence • Store its header line in one string variable • Concatenate all lines of the sequence and store it in another string variable • Find the most downstream recognition site of NlaII (CTAG) • Go over the sequence with a loop, starting from the 3’ tail, and going back until the first CTAG is found • Get the 10bp tag after that site • Take a substr of length 10 • Print it • Write the code
Start Read sequence Find most downstream CTAG Get the 10bp tag Print the tag End of input? No End Read line Save header Read line Concatenate to sequence Read line Header? No Yes
Start Read sequence Find most downstream CTAG Get the 10bp tag Print the tag End of input? No End Start pos. at end of sequence Check pos. for “CTAG” “CTAG” at pos? pos-- Yes Pos < 0?
Find most downstream CTAG Start pos. at end of sequence Check pos. for “CTAG” “CTAG” at pos? Pos < 0? pos-- Yes Yes Print “no tag”
Find most downstream CTAG Start pos. at end of sequence Check pos. for “CTAG” “CTAG” at pos? Pos < 0? pos-- Yes Yes Pos < 0? Yes No Print “no tag” Print tag
Start Read line Save header Read line Concatenate to sequence Read line Header? No Yes Do something End of input? No End • Overall design: • Read the sequence • Do something • Let’s see how it’s done… FASTA: Analyzing complex input
Start Read line Save header Read line Concatenate to sequence Read line Header? No Yes Do something End of input? No End $line = <STDIN>; my $endOfInput = 0; while ($endOfInput==0) { # 1.1. Read sequence name from FASTA header if (substr($line,0,1) eq ">") { $name = substr($line,1); } else... # 1.2. Read sequence until next FASTA header $seq = ""; $line = <STDIN>; while (substr($line,0,1) ne ">") { $seq = $seq . $line; $line = <STDIN>; if (!defined($line)) { $endOfInput = 1; last; } } # 2. Do something... }
Start Read line Save header Read line Concatenate to sequence Read line Header? No Yes Do something End of input? No End ################################### # 1. Foreach sequence in the input my (@lines, $line, $name, $seq); $line = <STDIN>; chomp $line; my $endOfInput = 0; while ($endOfInput==0) { ################################ # 1.1. Read sequence name from FASTA header if (substr($line,0,1) eq ">") { $name = substr($line,1); } else { die "bad FASTA format"; } # 1.2. Read sequence until next FASTA header $seq = ""; $line = <STDIN>; chomp $line; # Read until next header or end of input while (substr($line,0,1) ne ">") { $seq = $seq . $line; $line = <STDIN>; if (!defined($line)) { $endOfInput = 1; last; } chomp $line; } ################################ # 2. Do something... }
my @fasta = <STDIN>; my $oneline = join("", @fasta); # Concatenate all lines for ($i=0; $i<length($oneline); $i++) { my $c = substr($oneline,$i,1); my $sub10 = substr($oneline,$i,10); if ($c eq ">") { # Save header start position $start = ($i+1); } if ($c eq "]") { # Save header end position $end = $i; } if(???) { # If we found what we were looking for... # Print last header $name = substr($oneline,$start,$end-$start+1); } } FASTA: An alternative approach(which is more confusing and generally not recommended!)