Bioinformatics

Bioinformatics Lecture 8 perl pattern matching features

Questions to think about • Create a hash table that performs the condon to AA conversion and use it to convert codons {entered from the key board} into their corresponding Amino Acids • Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file

Questions to think about • Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches

Introduction • Pattern Matching • Pattern extraction • Pattern Substitution • Split and join functions • Unpack function

Pattern Matching • Recall =~ is the pattern matching operator • A first simple match example • print “EcoRI site found!” if $dna=~ /gat/; • It means if $DNA (string) contains the pattern gat then print Ecori site found. What is inside the 2 / is the pattern and =~ is the pattern matching symbol • More patterns • if ($dna =~ /[GATCgatc]/ ) • if /^[GATC] / i • If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) • Print “EcoR1 site found!!!”;

Pattern Matching • A More flexible pattern: • print “EcoRI site found!” if $dna=~ /GAA[GATC]TTC/; • Pattern where 4th letter is any let within square brackets • [GATC] means any character other than G or A or T or C • [0-9] or \d (digit) [ a-z] [-A-Z] /[AT][GC][TG]/ • /[a-zA-Z0-9_]/ or /\w/ (word) • / \s/ (white space) and to invert \s uppercase the letter \S (non white space)

Pattern matching: metacharacters • Metacharacter Description • . Any character except newline • \. Full stop character • ^ The beginning of a line • $ The end of a line • \w Any word character (non-punctuation, non-white space) • \W Any non-word character • \s White space (spaces, tabs, carriage returns) • \S Non-white space • \d Any digit • \D Any non-digit • You can also specify the number of times [ single, multiple or specific multiple] • More information on variations of metacharacters here: metacharacters

Pattern matching: Quantifiers • Quantifier Description • ? 0 or 1 occurrence • + 1 or more occurrences • * 0 or more occurrences • {N} n occurrences • {N,M} Between N and M occurrences • {N, } At least N occurrences • { ,M} No more than M occurrences

Pattern matching: Quantifiers • Pattern Match the following format: M58200.2 { =~/\w+\.\d+/ } • If the sequence is: Pu-C-X(40-80)-Pu-C • Pu [AG] and X[ATGC] • $sequence = /[AG]C[GATC]{40,80}[AG]C/;

Extracting pattern to variables • Anchors • E..g. Matching a word exactly: • /\bword\b/ \b boundary: just looks for word and not a sequence of the letters w o r and d • The start of line anchor ^ • /^>/ only those beginning with > • The end of line character $ • />$/ only where the last character is > • /^$/ : what does this mean?

Further examples • File_size_base_only.pl example • #!/usr/bin/perl • # file size2.pl • $length = 0; $lines = 0; • while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCgatc]/; • #Alternative: $length += length if /^[GATCN] / i; • $lines = $lines + 1; • } • print "LENGTH = $length\n"; print "LINES = $lines\n";

FASTA files Write and test (file_size_bases_only.pl) using a FASTA file as input: FASTADNA1.txt: example of FASTA file >2L52.1 CE20433 Zinc finger, C2H2 type (CAMBRIDGE) protein id:CAA21776.1 GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC sample of file in EMBL format gccacagattacaggaagtcatatttttagacctaaatcactatcctctatctttcagca 60 agaaaagaacatctacttggtttcgttccctatccaagattcagatggtgaaacgagtga 120 tcatgcacctgatgaacgtgcaaaaccacagtcaagccatgacaaccccgatctacagtt 180 tgatgttgaaactgccgattggtacgcctacagtgaaaactatggcacaagtgaagaaaa 240 Sample of an NCBI record format: 1 atgaaccccaacctgtgggtcgacgcgcagagcacttgcaagagggaatgcgacgctgac 61 ctggagtgcgagacctttgagaagtgctgccccaatgtctgtggaaccaagagctgtgtg 121 gctgctcggtacatggacatcaaggggaagaaggggcctgtggggatgcccaaagaggca 181 acctgtgaccgcttcatgtgcatccagcaaggctcagagtgcgacatctgggacgggcag 241 cctgtctgcaagtgcaaggacaggtgtgagaaggagccgagctttacctgcgcctcggac

Extracting Patterns • Consider a sequence like • >M185580 clone 333a, complete sequence • > • M18… is the sequence ID • Clone 33a, com…. : optional comments • Need to stored some of elements of the descriptor line: • =~/ ( \S+)/ part of the match is extracted and put into variable $1;

Extracting patterns • #! /usr/bin/perl –w • # demonstrates the effect of parentheses. • while ( my $line = <> ) • { • $line =~ /\w+ (\w+) \w+ (\w+)/; • print "Second word: '$1' on line $..\n" if defined $1; • print "Fourth word: '$2' on line $..\n" if defined $2; • } • Change it to catch the first and the 3 word of a sentence

Search and replace • s/t/u/ replace (t)thymine with (u) Uracil; once only • s/t/u/g (g = global) so scan the whole string • s/t/u/gi (global and case insensitive) • What about the following : • s/^\s+// • s/\s+$// • s/\s+$/ /g (where g stands for global) • Write a perl script that reads in the DNA sequences from the FastaDNA1file.txt and replaces all the Thymine bases with the corresponding Uracil bases

Splits and joins • To transform strings into arrays: split • Line 1 looks like: • 192a8,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC • Consider the following code: • chomp($line = <>); # read the line into $line • @fields = split ‘,’,$line; • ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line; • Reads in line 1 and puts each part before the delimiter; e.g. 192a8, into element of array…. • To transform arrays (lists) into strings: join • $tab = join “\t”,@fields; • 192a8 The Sanger Centre GGGTTCCGATTTCCAA CCTTAGGCCAAATTAAGGCC • #initialize an array • my @perlFunc = ("substr","grep","defined","undef"); • my $perlFunc = join " ", @perlFunc; • print "Perl Functions: $perlFunc\n"; • See example split_file.pl

Other useful functions • Other useful functions: • Unpack syntax : • @triplets = unpack("a3" x (length($line)/3), $line); • Frame Shift (1 position to the right) • @triplets = unpack(‘a’ . “a3” x (length ($line)/3),$line); • Unpack_codons.pl

Questions • Modify the file_bases_size_only.pl to count the the number of bases for a file in an EMBL format and one in an NCBI format • Using the FASTADNA1.txt : extract the sections of the descriptor line to appropriate scalar variables. • Assuming the DNA sequence of FastaDNA1file.txt is the complementary or anti-sense strand print the mRNA when the primary strand ( sequence ) is transcribed

Exam Questions • Perl is a important bioinformatics language. Explain the main features of perl that make in appealing to the field of Bioinformatics. • Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file • Write a perl script only reads and prints DNA sequences from a FASTA file. • Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches

FastaDNA1file.txt • Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and illustrates the number of alignment matches to non matches.

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics