180 likes | 284 Views
Lecture 7: . P erl pattern handling features. Pattern Matching. Recall =~ is the pattern matching operator A first simple match example print “An methionine amino acid is found ” if $AA =~ / m /; It means if $AA (string) contains the m then print methionine amino acid found.
E N D
Lecture 7: Perl pattern handling features
Pattern Matching • Recall =~ is the pattern matching operator • A first simple match example • print “An methionine amino acid is found ” if $AA =~ /m/; • It means if $AA (string) contains the m then print methionine amino acid found. • What is inside the / / is the pattern and =~ is the pattern matching symbol • It could also be written as • if ($dna =~ /m/) • { • print “An methionine amino acid is found ”; • } • Met.pl
Pattern Matching • If we want to check for the start codon we could use: • if ($seq=~ /ATG/ ) • { • Print “a start codon was found on line number\n” } • Or could write if /ATG / i (where I stands for case) • if we want to see if there is an A or T or G or C in the sequence use: $seq =~ /[ATGC]/ • The main way to use the Boolean OR is • If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) • { • Print “EcoR1 site found!!!”; • } • (note EcoR1 is an important DNA sequence)
Sequence size example • File_size_2 example • #!/usr/bin/perl • # file size2.pl • $length = 0; $lines = 0; • while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to any nucelotide • #{refer to http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml} • $lines = $lines + 1; • } • print "LENGTH = $length\n"; print "LINES = $lines\n"; • The above is a modification of the length of the file example to include only files that have G or A or T or C in the input line. • However this will lead to problems for FASTA files as the descriptor line will be included: Why?
Pattern Matching • A NOT Boolean operator such as to see if the pattern contains letters that are not vowels can be represented via pattern handling by using the ^ symbol and a set of characters: e.g. • If ($seq =~ /[^aeiou]/ {print “no vowel”}; • More flexible pattern syntax: • Quite common to check for words or numbers so perl has represented as: • /[0-9]/ or/ \d/ is adigit • A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word) • / \s/ represents a white space • By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non white space) • A more complete list of what are referred to as “metacharacters” is shown in the next slide (you must of course use =~ in expression)
Pattern matching: metacharacters • Metacharacter Description • . Any character except newline • \. Full stop character • ^ The beginning of a line • $ The end of a line • \w Any word character (non-punctuation, non-white space) • \W Any non-word character • \s White space (spaces, tabs, carriage returns) • \S Non-white space • \d Any digit • \D Any non-digit • You can also specify the number of times [ single, multiple or specific multiple] • More information on metacharacters here: metacharacters and other regular expresions note (abc) \1 \2 are important for comparing sets of characters).
Pattern matching: Quantifiers • Quantifier Description • ? 0 or 1 occurrence • + 1 or more occurrences • * 0 or more occurrences • {N} n occurrences • {N,M} Between N and M occurrences • {N, } At least N occurrences • { ,M} No more than M occurrences
Pattern matching: Quantifiers • Consider the following pattern • DT249 4 (your class code) consists of [one or more word characters; then a space and then a digit so the match is: • { =~/\w+\s\d/ } • If the sequence has the following format: • Pu-C-X(40-80)-Pu-C • Pu [AG] and X[ATGC] • $sequence =~ /[AG]C[GATC]{40,80}[AG]C/; Quantify.pl
Pattern Matching • To determine where to look for a “pattern” in a sequence: • Anchors • The start of line anchor ^ {note it is like the Boolean not operator but it is within [^aeiou]} • /^>/ only those beginning with > • The end of line character $ • />$/ only where the last character is > • /^$/ : what does this mean? • The boundary anchor \b • E.g. Matching a word exactly: • /\bword\b/ where \b boundary: just looks for “word” and not a sequence of the letters such as w o r and d • The non boundary anchor is \B • /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or word
Sequence Size example: modified • File_size_2 example • #!/usr/bin/perl • # file size2.pl • $length = 0; $lines = 0; • while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/; • #Alternative: $length += length if /^[GATCN]+$ / i; • $lines = $lines + 1; • } • print "LENGTH = $length\n"; print "LINES = $lines\n"; • Refer to DNA sequence codes to see meaning of A…N
Extracting Patterns • The second aspect of Perl pattern handling is: • Pattern extraction: • Consider a sequence like • > M185580, clone 333a, complete sequence • M18… is the sequence ID • Clone 33a, com…. : optional comments • Need to stored some of elements of the descriptor line: • $seq =~/ ( \S+)/ part of the match is extracted and put into variable $1;
Extracting patterns • #! /usr/bin/perl –w • # demonstrates the effect of parentheses. • while ( my $line = <> ) • { • $line =~ /\w+ (\w+) \w+ (\w+)/; • print "Second word: '$1' on line $..\n" if defined $1; • print "Fourth word: '$2' on line $..\n" if defined $2; • } • Change it to catch the first and the 3 word of a sentence • More examples in ExtractExample1.pl
Search/replace and trans-literial • s/t/u/ replace (t)thymine with (u) Uracil; once only • s/t/u/g (g = global) so scan the whole string • s/t/u/gi (global and case insensitive) • What about the following : • s/^\s+// • s/\s+$// • s/\s+/ /g (where g stands for global) • The transliteration search and replace function • $seq =~ tr/ATGC/TACG/; gets the compliment of a string of characters. (the normal search and replace works in a different way to the tr function) • Refer to SearchReplace.pl
Search /replace/extract • Write a program that • removes the > from the FASTA line descriptor and assigns each element to appropriate variables. • Example Fastafile_replace.txt • >gi|171361, Saccharomycescerevisiae, cystathionine gamma-lyase • GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC • GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG • ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG • AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT • ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA • GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT • ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG • CCTATACACATAGACATTTGCACCTTATACATATAC
Exercises • Write a script that: • Confirms if the user has input the code in the following format: • Classcode_yearcode(papercode) • E.g dt249 4(w203c) • Many important DNA sequences have specific patters; e.g. TATA write a script to find the position of this sequence in a FASTA file sequence.
Exercises • Write a script that can find the reverse complement of an DNA sequence without using the trfunction. (Hint: a global search and replace will give an incorrect answer) • Coding regions begin win the AUG (ATG) codon and end with a stop codons. Write a perl script that extract a coding sequence from a FASTA file.
Exercise • Modify the Sequence size example from earlier to: • Allow the user to input a file name and determine its length.
Exam Questions • Perl is a important bioinformatics language. Explain the main features of perl that make it suitable for bioinformatics (10 marks) • Write a perl script that illustrates its pattern matching extraction and substitution ability. (6 marks) • (refer to assignment/previous papers perl scripts)