1 / 28

Sequence Alignment

Sequence Alignment. Topics: Introduction Exact Algorithm Alignment Models BioPerl functions. Sequence Alignment. Motivation :. Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for s imilarities.

Download Presentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment • Topics: • Introduction • Exact Algorithm • Alignment Models • BioPerl functions

  2. Sequence Alignment Motivation: • Storing,retrieving and comparing DNA sequences in Databases. • Comparing two or more sequences for similarities. • Searching databases for related sequences and subsequences. • Exploring frequently occurring patterns of nucleotides. • Finding informative elements in protein and DNA sequences. • Various experimental applications (reconstruction of DNA, etc.)

  3. ? Similar 3D structure ? Similar sequences produce similar proteins Seq.Align. Protein Function Gene1 Gene2 More than 25% sequence identity ? Similar function

  4. Alignment - inexact matching • Substitution - replacing a sequence base by another. • Insertion - an insertion of a base (letter) or several bases to the sequence. • Deletion - deleting a base (or more) from the sequence. • (Insertion and deletion are the reverse of one another)

  5. Seq. Align. Score Commonly used matrices: PAM250, BLOSUM64

  6. Local Alignment Local Alignment INPUT:Two sequences S and T . QUESTION:What is the maximum similarity between a subsequence of S and a subsequence of T ? Find most similar subsequences.

  7. The IDEA s[1…n] t[1…m] To aligns[1...i]witht[1…j]we have three choices: * align s[1…i-1] with t[1…j-1] and match s[i] with t[j] * align s[1…i] with t[1…j-1] and match a space with t[j] * align s[1…i-1] with t[1…j] and match s[i] with a space s[1…i-1] i t[1…j-1] j s[1… i ] - t[1…j-1] j s[1…i-1] i t[1… j ] -

  8. Local Alignment Smith-Waterman 1981 *Penalties should be negative*

  9. s: xxxcde t: abcxdex xxxcde- - - -- - -abcxdex cxde c-de match 2 mismatch -1 Local alignment

  10. Sequence Alignment Complexity: Time O(n*m) Space O(n*m) (exist algorithm with O(min(n,m)))

  11. Global Alignment Global Alignment INPUT:Two sequences S and T of roughly the same length. QUESTION:What is the maximum similarity between them? Find one of the best alignments.

  12. Global Alignment Needleman-Wunsch 1970 Alignment Score: V(n,m)

  13. Ends free alignment Ends free alignmentINPUT:Two equences S and T (possibly of different length).QUESTION:Find one of the best alignments betweensubsequences ofS and Twhen at least one of these subsequences is a prefix of theoriginal sequence and one (not necessarily theother) is a suffix. or

  14. Ends free alignment m n

  15. Gap Alignment Definition:A gap is the maximal contiguous run of spaces in a single sequence within a given alignment.The length of a gap is the number of indel operations on it. A gap penalty function is a function that measure the cost of a gap as a (nonlinear)function of its length. Gap penaltyINPUT:Two sequences S and T (possibly of differentlength).QUESTION:Find one of the best alignments between the two sequencesusing the gap penalty function. Affine Gap: Wtotal = Wg + qWs Wg – weight to open the gap Ws – weight to extend the gap

  16. BioPerl “Bioperl is a collection of perl modules that facilitate the development of perl scripts for bio-informatics applications.” Bioperl is open source software that is still under active development. www.bioperl.org Tutorial Documentation

  17. BioPerl • Accessing sequence data from local and remote databases • Transforming formats of database/ file records • Manipulating individual sequences • Searching for "similar" sequences • Creating and manipulating sequence alignments • Searching for genes and other structures on genomic DNA • Developing machine readable sequence annotations

  18. BioPerl library at TAU BioPerl is NOT yet installed globally on CS network. In each script you should add the following two lines: use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib"; use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib/i686-linux";

  19. Sequence Object Seq – stores sequence, identification labels (id, accession number, molecule type = DNA, RNA, Protein, …), multiple annotations and associated “sequence features”.

  20. Sequence Object • $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact', • '-desc'=>'Sample Bio::Seq object', • '-display_id' => 'something', • '-accession_number' => 'accnum', • '-moltype' => 'dna' ); Usually Seq isnot created this way.

  21. Sequence Object $seqobj->display_id(); # the human read-able id of the sequence $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->moltype(); # one of 'dna','rna','protein' $seqobj->primary_id(); # a unique id for this sequenceirregardless # of its display_id or accession number

  22. Accessing Data Base • Databases: genbank, genpept, swissprotandgdb. • $gb = new Bio::DB::GenBank(); • $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); • $seq2 = $gb->get_Seq_by_acc('AF303112')); • $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));

  23. Seq module use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_acc('AF303112'); $seq2=$seq1->trunc(1,90); print $seq2->seq(), "\n"; $seq3=$seq2->translate; print $seq3->seq(), “\n“; ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAAGCTAGCATCTAAGGATGTGGGCGGAGGGTCAGATCCATAC MEPKQGYLLVKLIEARKLASKDVGGGSDPY

  24. SeqIO object $seq = $gb->get_Seq_by_acc('AF303112')); $out = Bio::SeqIO->new('-file' => ">f.fasta", '-format' => 'Fasta'); $out->write_seq($seq); SeqIO can read/write/transform data in the following formats : Fasta, EMBL. GenBank, Swissprot, PIR, GCG, SCF, ACE, BSML

  25. Transforming Sequence Files $in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", '-format' => ‘EMBL'); $out->write_seq($in->next_seq()); #for several sequences while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); } #better$in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", '-format' => ‘EMBL'); # Fasta<->EMBL format converter:print $out $_ while <$in>;

  26. BioPerl: Pairwise Sequence Alignment Smith-Waterman Algorithm use Bio::Tools::pSW; $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla', '-gap' => 12, '-ext' => 2, ); $factory->align_and_show($seq1, $seq2, STDOUT); Currently works only on protein sequences.

  27. Alignment Objects SimpleAlignhandles multiple alignments of sequences. #pSW module $aln = $factory->pairwise_alignment($seq1, $seq2); foreach $seq ( $aln->eachSeq() ) { print $seq->seq(), "\n"; } $alnout = Bio::AlignIO->new(-format => 'fasta', -fh => \*STDOUT); $alnout->write_aln($aln);

  28. Homework • Write a cgi script (using Perl) that performs pairwise Local/Global Alignment for DNA sequences. All I/O is via HTML only. • Input: • Choice for Local/Global alignment. • Two sequences – text boxes or two accession numbers. • Values for match, mismatch, ins/dels. • Number of iterations for computing random scores. • Output: • Alignment score. • z-score value (z= (score-average)/standard deviation.) Remarks: 1) you are allowed to use only linear space. 2)To compute z-score perform random shuffling: srand(time| $$); #init, $$-proc.id int(rand($i)); #returns rand. number between [0,$i]. 3)Shuffling is done in windows (non-overlapping) of 10 bases length. Number of shuffling for each window is random [0,10].

More Related