Sequence Alignment

Sequence Alignment • Topics: • Introduction • Exact Algorithm • Alignment Models • BioPerl functions

Sequence Alignment Motivation: • Storing,retrieving and comparing DNA sequences in Databases. • Comparing two or more sequences for similarities. • Searching databases for related sequences and subsequences. • Exploring frequently occurring patterns of nucleotides. • Finding informative elements in protein and DNA sequences. • Various experimental applications (reconstruction of DNA, etc.)

? Similar 3D structure ? Similar sequences produce similar proteins Seq.Align. Protein Function Gene1 Gene2 More than 25% sequence identity ? Similar function

Alignment - inexact matching • Substitution - replacing a sequence base by another. • Insertion - an insertion of a base (letter) or several bases to the sequence. • Deletion - deleting a base (or more) from the sequence. • (Insertion and deletion are the reverse of one another)

Seq. Align. Score Commonly used matrices: PAM250, BLOSUM64

Local Alignment Local Alignment INPUT:Two sequences S and T . QUESTION:What is the maximum similarity between a subsequence of S and a subsequence of T ? Find most similar subsequences.

The IDEA s[1…n] t[1…m] To aligns[1...i]witht[1…j]we have three choices: * align s[1…i-1] with t[1…j-1] and match s[i] with t[j] * align s[1…i] with t[1…j-1] and match a space with t[j] * align s[1…i-1] with t[1…j] and match s[i] with a space s[1…i-1] i t[1…j-1] j s[1… i ] - t[1…j-1] j s[1…i-1] i t[1… j ] -

Local Alignment Smith-Waterman 1981 *Penalties should be negative*

s: xxxcde t: abcxdex xxxcde- - - -- - -abcxdex cxde c-de match 2 mismatch -1 Local alignment

Sequence Alignment Complexity: Time O(n*m) Space O(n*m) (exist algorithm with O(min(n,m)))

Global Alignment Global Alignment INPUT:Two sequences S and T of roughly the same length. QUESTION:What is the maximum similarity between them? Find one of the best alignments.

Global Alignment Needleman-Wunsch 1970 Alignment Score: V(n,m)

Ends free alignment Ends free alignmentINPUT:Two equences S and T (possibly of different length).QUESTION:Find one of the best alignments betweensubsequences ofS and Twhen at least one of these subsequences is a prefix of theoriginal sequence and one (not necessarily theother) is a suffix. or

Ends free alignment m n

Gap Alignment Definition:A gap is the maximal contiguous run of spaces in a single sequence within a given alignment.The length of a gap is the number of indel operations on it. A gap penalty function is a function that measure the cost of a gap as a (nonlinear)function of its length. Gap penaltyINPUT:Two sequences S and T (possibly of differentlength).QUESTION:Find one of the best alignments between the two sequencesusing the gap penalty function. Affine Gap: Wtotal = Wg + qWs Wg – weight to open the gap Ws – weight to extend the gap

BioPerl “Bioperl is a collection of perl modules that facilitate the development of perl scripts for bio-informatics applications.” Bioperl is open source software that is still under active development. www.bioperl.org Tutorial Documentation

BioPerl • Accessing sequence data from local and remote databases • Transforming formats of database/ file records • Manipulating individual sequences • Searching for "similar" sequences • Creating and manipulating sequence alignments • Searching for genes and other structures on genomic DNA • Developing machine readable sequence annotations

BioPerl library at TAU BioPerl is NOT yet installed globally on CS network. In each script you should add the following two lines: use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib"; use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib/i686-linux";

Sequence Object Seq – stores sequence, identification labels (id, accession number, molecule type = DNA, RNA, Protein, …), multiple annotations and associated “sequence features”.

Sequence Object • $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact', • '-desc'=>'Sample Bio::Seq object', • '-display_id' => 'something', • '-accession_number' => 'accnum', • '-moltype' => 'dna' ); Usually Seq isnot created this way.

Sequence Object $seqobj->display_id(); # the human read-able id of the sequence $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->moltype(); # one of 'dna','rna','protein' $seqobj->primary_id(); # a unique id for this sequenceirregardless # of its display_id or accession number

Accessing Data Base • Databases: genbank, genpept, swissprotandgdb. • $gb = new Bio::DB::GenBank(); • $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); • $seq2 = $gb->get_Seq_by_acc('AF303112')); • $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));

Seq module use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_acc('AF303112'); $seq2=$seq1->trunc(1,90); print $seq2->seq(), "\n"; $seq3=$seq2->translate; print $seq3->seq(), “\n“; ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAAGCTAGCATCTAAGGATGTGGGCGGAGGGTCAGATCCATAC MEPKQGYLLVKLIEARKLASKDVGGGSDPY

SeqIO object $seq = $gb->get_Seq_by_acc('AF303112')); $out = Bio::SeqIO->new('-file' => ">f.fasta", '-format' => 'Fasta'); $out->write_seq($seq); SeqIO can read/write/transform data in the following formats : Fasta, EMBL. GenBank, Swissprot, PIR, GCG, SCF, ACE, BSML

Transforming Sequence Files $in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", '-format' => ‘EMBL'); $out->write_seq($in->next_seq()); #for several sequences while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); } #better$in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", '-format' => ‘EMBL'); # Fasta<->EMBL format converter:print $out $_ while <$in>;

BioPerl: Pairwise Sequence Alignment Smith-Waterman Algorithm use Bio::Tools::pSW; $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla', '-gap' => 12, '-ext' => 2, ); $factory->align_and_show($seq1, $seq2, STDOUT); Currently works only on protein sequences.

Alignment Objects SimpleAlignhandles multiple alignments of sequences. #pSW module $aln = $factory->pairwise_alignment($seq1, $seq2); foreach $seq ( $aln->eachSeq() ) { print $seq->seq(), "\n"; } $alnout = Bio::AlignIO->new(-format => 'fasta', -fh => \*STDOUT); $alnout->write_aln($aln);

Homework • Write a cgi script (using Perl) that performs pairwise Local/Global Alignment for DNA sequences. All I/O is via HTML only. • Input: • Choice for Local/Global alignment. • Two sequences – text boxes or two accession numbers. • Values for match, mismatch, ins/dels. • Number of iterations for computing random scores. • Output: • Alignment score. • z-score value (z= (score-average)/standard deviation.) Remarks: 1) you are allowed to use only linear space. 2)To compute z-score perform random shuffling: srand(time| $$); #init, $$-proc.id int(rand($i)); #returns rand. number between [0,$i]. 3)Shuffling is done in windows (non-overlapping) of 10 bases length. Number of shuffling for each window is random [0,10].

Sequence Alignment