String ($ var ) arrays (@array) conversion and substring extraction

String ($var) arrays (@array) conversion and substring extraction Lecture 6

Split strings • This function can be used to split (divide) data: • Strings into an arrays. • Strings into a list of scalars ($variables) • It can also split each character of a string by using “” as the deliminiter. • >192a8, the lactose gene, e. coli, cambridge university, january 1981 • chomp($line = <>); # read the line into $line • @fields = split ‘,’,$line; #splits a String into an array • ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line; • See SplitExample.pl

Join: elements of an array/ • The join function is the reverse of the split: • Convert an array into a string • To transform arrays (lists) into strings: join • #initialize an array • @seq = (“aaaaaa",“tttttt",“cccccc",“ggggggg"); • $CombinedSeq = join ‘', @seq; • Result of the join is: • aaaaaattttttccccccggggggg • See JoinExample.pl

Concatetion • To concatenate to strings you use the • =. Symbol • Seq1 is a null string: $seq = “”; • We can add (concatenate) a sequence to this by: • $seq .= $input_seq2 • It can be used to read in sequences and join them together so they form one string.

Extracting substrings • Substr: a function to extracting a substring from a string. • Assume the string is: AAAAGGGGCCCCTTTT • To extract the sequence AGG (a codon) from the string we need: • Move to 4 positions [character} of the string] t. • Extract 3 characters or a 3 character substring • The syntax for perlsubstr (substring function) • $sub = substr ($string, offset position[position to begin extraction], size of substring) • Offset is zero based • # more details on substrings can be found at: • # http://perlmeme.org/howtos/perlfunc/substr.html • Extract words from a sentence: Substring.pl • Extract codon from a DNA seqeunce: substring.pl

Perl Functions for determining the ORF of DNA sequences. • The Unpack function: this a function of the perl language that extracts sets of characters from a sequence of characters and assign them to an array. • So they can be used to extract groups of 3 bases from a DNA sequence. E.g.. open reading frames, and assign each set to an element of an array. • @triplets = unpack("a3" x (length($line)/3), $line); • To determining all possible open reading frames (ORFs) for a DNA sequence (reading frame 1, reading frame 2 and reading frame 3) one needs to shift one base when going from reading frame 1 to reading frame 2 and the same when going from reading frame 2 to reading frame 3 subsequent • Frame Shift (1positions to the right) • @triplets = unpack(‘a1’ . “a3” x (length ($line)/3),$line); • Remember if there are only 2 characters at the end/ beginning of a sequence. Unpack will still assign them to an element of the array. If using hash tables do not forget an exist function may be required, • See Unpack_codons.pl (Run to show the output)

Sample Exercise • Write a script to read in the contents of a fasta file (without descriptor line) and print it out as a string containing all the DNA bases/ Amino acids • Modify the unpack function to use substrings instead of unpack.

String ($ var ) arrays (@array) conversion and substring extraction