200 likes | 211 Views
Learn how subroutines can save time and prevent errors in bioinformatics. Discover built-in subroutines and how to find pre-defined ones. Explore file manipulation and basic file operations. Complete examples and assignments provided.
E N D
Subroutines and Files Bioinformatics Ellen Walker Hiram College
Why Subroutines? • Saves typing • Saves potential copy/paste errors • Collect common algorithm in one place for reuse
Built-In Subroutines • Provide common useful functions, e.g. • Index • Length • Substr • Call with arguments, • Index($string, $pat) #$string and $pat are arguments • Different arguments produce different results
Finding Predefined Subroutines • Textbooks (Safari Online has several) • Google (include “Perl” in your string) • Online documentation • http://www.gotapi.com/perl is nicely searchable
my $code = “ACA”; print length($code); print “goodbye\n”; Sub length my $string = shift(@_) my $length = 0; …code to count … return $length; How a Subroutine Works ACA “ACA” 3
Key Components • sub name • Declares this as a subroutine and names it • shift @_ • Pulls the arguments out of the list (in parentheses, one at a time, left to right) • Example: somesub(“ACT”,1) • $a = shift@_ ($a is “ACT) • $b = shift@_ ($b is 1) • return value • Ends the subroutine & gives it a value
Example (p. 122) # find all GC-rich 4-7mers and determine their complements my $GCmatch; while ($someDNA =~m/([GC]{4,7})/g ){ $GCmatch = $1; print “5’ $GCmatch 3’\n\n”; $compl = complement($GCmatch); print “3’ $compl 5’”\n”; }
Subroutine (p. 123) #book version has good documentation sub complement { my $dna = shift(@_); #get first arg my $anti = $dna; $anti =~ tr/ACGTacgt/TGCAtgca/; return $anti; }
Download These (Ch. 7) • Counting nucleotides • countNucleotides( $str, “C”); • countNucleotides( $str, “[CG]”); • Printing sequences with fixed line width • printSequence($str, 80);
Variable Scope • Variables exist from when they are declared (“my”) until the end of the block (closing brace). • Variables in subroutines exist only during the subroutine • Each call to a subroutine re-initializes the variables
Files and Programs • Files are stored on the computer’s hard drive and maintained by the operating system. • Programs are connected to files via special subroutines • “open” creates a file handle • “close” releases the file (important!)
Basic File Manipulation • Open a file and read • my $HANDLE; • open ($HANDLE, ‘<‘, $filename); • $line = <$HANDLE>; • Open a file and write • My $HANDLE; • open($HANDLE, ‘>’, $filename); • print $HANDLE “Hello world!”; • Close a file • close($HANDLE);
Allowing for Errors • If you try to read a file that doesn’t exist, or write a file that does, the open() command will return false • The rest of your program won’t work. • To fix this add: or die(“some message $file :$!”) to the end of the command ($! Contains the system error messages)
Complete Open Examples open ($HANDLE, ‘<‘, $filename) or die(“Cannot open file: $filename: $!); open ($HANDLE, ‘>‘, $filename) or die(“Cannot write file: $filename: $!);
Reading lines • Subroutine chomp removes the ‘\n’ character at the end of each line • $line = <$HANDLE> puts the next line in $line • When there are no more lines, the result is false • Example: put the whole file in one sequence while ($line = <$HANDLE>) { chomp $line $seq = $seq . $line }
Printing to a file • The print commands (print and printf) can optionally be followed with a file handle before the string to print • Examples: • print $HANDLE “Hello\n”; • printf $HANDLE “GC percent is %.1f\n”, $GCcount * 100.0 / $total;
ReadInDNA • Subroutine to read FASTA formatted file (p. 141) • Returns sequence as one long string • Removes whitespace, lines that begin with # (comments), and all digits
FASTA File Format • One header line, begins with > • Many lines of text, sometimes capitalized, sometimes with spaces after every n characters • (ReadInDNA handles these variations)
Getting a FASTA File • Go to NCBI http://www.ncbi.nlm.nih.gov/ • Search for what you want and download the file to your current machine • Send the file to your directory of cs.hiram.edu (Demo to be provided)
Assignment • Using subroutines from your text, determine the GC content of the given genomes. (Examples to be provided)