370 likes | 583 Views
From Pairwise Alignment to Database Similarity Search. Global vs Local Alignment. Global Alignment. ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT. Local Alignment. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT. Global vs. Local alignment.
E N D
From Pairwise Alignment to Database Similarity Search
Global vs Local Alignment Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment Alignment of two Genomic sequences >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Mouse DNA CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment Alignment of two Genomic sequences Global Alignment Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** *** Human:CATGCGACTGAC Mouse:CATGCGTCTGAC Human:ATCGATCATA Mouse:ATCGAT-ATA Local Alignment
Global vs. Local alignment Alignment of Genomic DNA and mRNA >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA
Global vs. Local alignment Alignment of Genomic DNA and mRNA Global Alignment DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA mRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ ********** DNA: CATGCGACTGAC mRNA:CATGCGACTGAC DNA: ATCGATCATA mRNA:ATCGATCATA Local Alignment
Why do we care to align sequences? • Sequences that are similar probably have the same function
new sequence ? Similar function ≈ Discover Function of a new sequence Sequence Database
Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?
Complexity for genomes • Human genome contains3 109base pairs • Searching an mRNA against HG requires~1012 cells • -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.
Searching databases Solution: Use a heuristic (approximate) algorithm
Heuristic strategy • Remove regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession
Heuristic strategy • Remove regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession
What sequences to remove? • AAAAAAAAAAA • ATATATATATATA • Transposable elements 53% of the genome is repetitive DNA Low complexity sequences (JUNK???)
Low Complexity Sequences What's wrong with them? * Not informative * Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA
Heuristic strategy • Remove low-complexity regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession
BLAST Basic Local Alignment Search Tool • General idea - a good alignment contains subsequences of high identity: • First, identify very short almost exact matches. • Next, the best short hits from the 1st step are extended to longer regions of similarity. • Finally, the best hits are optimized using the Smith-Waterman algorithm. Altschul et al 1990
BLAST(Protein Sequence Example) • Search the database for matching words • Example: • Protein sequence …FSGTWYA… • Words of length 3: FSG, SGT, GTW, TWY, WYA • All words in database (bag of words): • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS….
BLAST(Protein Sequence Example) • Search the database for matching words • Example: • Protein sequence …FSGTWYA… • Words of length 3: FSG, SGT, GTW, TWY, WYA… • All words in database (bag of words): • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS….
BLAST(Protein Sequence Example) 1.Search the database for matching word pairs (L= 3) 2.Extend word pairs as much as possible,i.e., as long as the total score increases • High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN Q= query sequence, D= sequence in database
BLAST 3. Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment
Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score: • The score is a measure of the similarity of the query to the sequence shown. • How do we know if the score is significant? • -Statistical significance • -Biological significance
How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value) The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p (p-value). page 105
BLAST- E value: Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system • and background residue frequencies
What is a Good E-value (Thumb rule) • E values of less than 0.00001 show that sequences are almost always homologues. • Greater E values, can represent homologues as well. • Generally the decision whether an E-value is biologically significant depends on the size of database that is searched • Sometimes a real match has an E value > 1 • Sometimes a similar E value occurs for a short exact match and long less exact match
How to interpret a BLAST search: • The score is a measure of the similarity of the query to the sequence shown. • How do we know if the score is significant? • -Statistical significance • -Biological significance
Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Sometimes correction to the model are needed to infer biological significance
Gap Scores • Standard solution: affine gap model wx = g + r(x-1) wx : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm
Significance of Gapped Alignments • Gapped alignments use same statistics • and K cannot be easily estimated • Empirical estimations and gap scores determined by looking at random alignments
BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein