240 likes | 349 Views
Programming and Perl for Bioinformatics Part II. Basic Data Types. Perl has three basic data types : scalar array (list) associative array (hash). Extract 2 nd item from @names. Extract the sublist from @names. Arrays. An array (list) is an ordered list of scalar values.
E N D
Basic Data Types • Perl has three basic data types: • scalar • array (list) • associative array (hash)
Extract 2nd item from @names Extract the sublist from @names Arrays • An array (list) is an ordered list of scalar values. • ‘@’ is used to refer to the entire array • Example: • (1,2,3) # Array of three values 1, 2, and 3 • ("one","two","three") # Array of 3 values "one", "two", "three" • @names = ("mary", "tom", "mark", "john", "jane"); • $names [1] ; ? • @names [1..4]; # “tom”
More on Arrays • @a = ( ); # empty list • @b = (1,2,3); # three numbers • @c = ("Jan","Joe","Marie"); # three strings • @d = ("Dirk",1.92,46,"20-03-1977"); # a mixed list • Variables and sublists are interpolated in a list • @b = ($a, $a+1, $a+2); # variable interpolation • @c = ("Jan", ("Joe","Marie") ); # list interpolation • @d = ("Dirk", 1.92,46,( ), "20-03-1977"); # empty list interpolation • @e = ( @b, @c ); # same as (1,2,3,"Jan","Joe","Marie") • Practical construction operators ($x..$y) • @x = (1..6) # same as (1, 2, 3, 4, 5, 6) • @y = (2..5, 8, 11..13) # same as (2,3,4,5,8,11,12,13)
Array Example • # Here's one way to declare an array, initialized with a list of four # scalar values. @bases = ('A', 'C', 'G', 'T'); • # Now we'll print each element of the array print "Here are the array elements:"; print "\nFirst element: "; print $bases[0]; print "\nSecond element: "; print $bases[1]; print "\nThird element: "; print $bases[2]; print "\nFourth element: "; • This code snippet prints out: Here are the array elements: • First element: A • Second element: C • Third element: G • Fourth element: T
Print Array • You can print the elements one after another like this: @bases = ('A', 'C', 'G', 'T'); print "\n\nHere are the array elements: "; print @bases; • It produces the output: • Here are the array elements: ACGT
Converting a string to an array split splits a variable into parts and puts them in an array. $dnastring = "ACGTGCTA"; @dnaarray =split ( //, $dnastring ) ; #@dnaarray is now (A, C, G, T, G, C, T, A) @dnaarray =split ( /T/, $dnastring ) ; #@dnaarray is now (ACG, GC, A)
Converting an array to a string • joincombines the elements of an array into a single scalar variable (a string) $dnastring = join('', @dnaarray); spacer (empty here) which array
Array Manipulations reverse Reverses the order of array elements @a = (1, 2, 3); @b = reverse @a; # @b = (3, 2, 1); split Splits a string into a list/array $line = "John Smith 28"; ($first, $last, $age) = split (/\s/, $line); #\s: white spaces [\t\n\f\r] $DNA = "ACGTTTGA"; @DNA = split ("", $DNA); join Joins a list/array into a string $gene = join ( "", ($exon1, $exon3) ) ; $name = join ( "-", ("Zhong", "Hui")) ; scalar Returns the number of elements in @array scalar @array;
Array Manipulations - pop • You can take an element off the end of an array with pop: @bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; print "Here's the element removed from the end: "; print $base1, "\n\n"; print "Here's the remaining array of bases: "; print "@bases"; • which produces the output: Here's the element removed from the end: T Here's the remaining array of bases: A C G
Array Manipulations - shift • You can take a base off of the beginning of the array with shift: @bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; # shift left print "Here's an element removed from the beginning: "; print $base2, "\n\n"; print "Here's the remaining array of bases: "; print "@bases"; • which produces the output: Here's an element removed from the beginning: A Here's the remaining array of bases: C G T
Array Manipulations - push • You can put an element on the end of the array with push: @bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; push (@bases, $base2);# return the number of elements in the array after push print "Here's the element from the beginning put on the end: "; print "@bases\n\n"; • It produces the output: Here's the element from the beginning put on the end: C G T A
Array Manipulations - unshift • You can put an element at the beginning of the array with unshift: @bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; unshift (@bases, $base1); print "Here's the element from the end put on the beginning:"; print "@bases\n\n"; • It produces the output: Here's the element from the end put on the beginning: T A C G
Exercise #Determine freq of nucleotides $dna ="gaTtACataCACTgttca"; ?
Filehandles File I/O (input/output): reading from/writing to files • Files represented in Perl by a filehandle variable (for clarity, written as a bare word in UPPERCASE) • Open a file on a filehandle using the open function • for reading (input): open INFILE, “<datafile.txt”; or open (INFILE, “<datafile.txt”); • for writing (output), overwriting the file: open OUTFILE, “>output”; • for appending to the end of the file: open OUTFILE, “>>output”; • Close a file on a filehandle • Close (OUTFILE);
Special Filehandles Special “files” that are always “open” • STDIN (standard input) • input from command window read only • STDOUT (standard output) • output to command window write only print STDOUT “Have fun with Perl!\n”; or just print “Have fun with Perl!\n”;
Input from Filehandles “Angle Bracket” input operator • reads one line of input (up to newline/carriage return) • from STDIN: print "Enter name of protein: "; $line = <STDIN>; chomp $line;# removes \n from end of $line print “\nYou entered $line.\n”; • from a file: open ( INPUTFILE, “prot1.seq”); $line1 = <INPUTFILE>; # first line chomp $line1; $line2 = <INPUTFILE>; # second line # Perl reads files one line at a time # … etc
sequences.fasta >gi|145536|gb|L04574.1|Escherichia coli DNA polymerase III chi subunit gene, complete cds TAACGGCGAAGAGTAATTGCGTCAGGCAAGGCTGTTATTGCCGGATGCGGCGTGAACGCCTTATCCGACC TACACAGCACTGAACTCGTAGGCCTGATAAGACACAACAGCGTCGCATCAGGCGCTGCGGTGTATACCTG ATGCGTATTTAAATCCACCACAAGAAGCCCCATTTATGAAAAACGCGACGTTCTACCTTCTGGACAATGA CACCACCGTCGATGGCTTAAGCGCCGTTGAGCAACTGGTGTGTGAAATTGCCGCAGAACGTTGGCGCAGC GGTAAGCGCGTGCTCATCGCCTGTGAAGATGAAAAGCAGGCTTACCGGCTGGATGAAGCCCTGTGGGCGC GTCCGGCAGAAAGCTTTGTTCCGCATAATTTAGCGGGAGAAGGACCGCGCGGCGGTGCACCGGTGGAGAT CGCCTGGCCGCAAAAGCGTAGCAGCAGCCGGCGCGATATATTGATTAGTCTGCGAACAAGCTTTGCAGAT TTTGCCACCGCTTTCACAGAAGTGGTAGACTTCGTTCCTTATGAAGATTCTCTGAAACAACTGGCGCGCG AACGCTATAAAGCCTACCGCGTGGCTGGTTTCAACCTGAATACGGCAACCTGGAAATAATGGAAAAGACA TATAACCCACAAGATATCGAACAGCCGCTTTACGAGCACTGGGAAAAGCAGGGCTACTTTAAGCCTAATG GCGATGAAAGCCAGGAAAGTTTCTGCATCATGATCCCGCCGCCGAA
Determine frequency of nucleotides • Input file: sequences.fasta open (INPUTFILE, "sequences.fasta"); #open file for sequence $line1 = <INPUTFILE>; $line2 = <INPUTFILE>; $line3 = <INPUTFILE>; chomp ($line2, $line3); $dna = $line2.$line3; $count_A = 0; $count_C = 0; $count_G = 0; $count_T = 0; @dna = split '', $dna; foreach $base (@dna) { if ($base eq 'A') {$count_A++;} elsif ($base eq 'C') {$count_C++;} elsif ($base eq 'G') {$count_G++;} elsif ($base eq 'T') {$count_T++;} else {print "error!\n";} } print "count of A = $count_A \n"; print "count of C = $count_C \n"; print "count of G = $count_G \n"; print "count of T = $count_T \n";
Read a File: line by line my $my_sequence; open FILE1, “/u/doej01/prot1.seq”; while ($line = <FILE1>){ chomp($line); $my_sequence=$my_sequence.$line; }; close ( FILE1 ); • Dumps the whole file into the variable : my_sequence
Using loops to read in a file • The whileloop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out. • The special variable $_ keeps track of the line of the file we’re on. my $longsequence; open FILE, ‘exampleprotein.txt’; while (<FILE>){ $longsequence = $longsequence . $_ ; chomp $longsequence; } close FILE; • This reads the whole file, and puts each line into the variable $longsequenceone at a time.
Read a File into an Array • Rather than read a file one line at time into a scalar variable, it is often helpful to read the entire file into an array open FILE1, “prot1.seq”; @DNA = <FILE1>; #array of strings
Writing to a File • Writing to a file is similar to reading from it • Use the > operator to open a file for writing: open OUTPUT,‘>/home/achou/output.txt’; • This creates a new file with that name, or overwrites an existing file • Use >> to append text to an existing file • print to the file using the filehandle: print OUTPUT $myoutputdata;