650 likes | 776 Views
Indexed Alignment. Tricks of the Trade Ross David Bayer 18 th October, 2005. Note: many diagrams taken from Serafim’s CS 262 class. Roadmap. Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds
E N D
Indexed Alignment Tricks of the Trade Ross David Bayer 18th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class
Roadmap • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds
Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds
Motivation • We have a newly discovered gene: • Does it occur in other species? • How fast does it evolve? • We want to “find” this gene in other species • But there will be mutations
Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Global Alignment Needleman-Wunsch (Dynamic Programming) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Running Time: O(MN)
Local Alignment Smith-Waterman M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Modifications: • Store 0 instead of –ve values • Search entire table for maximum N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Running Time: O(MN)
Alignment Applications • We have our newly discovered gene: • Does it occur in other species? • How fast does it evolve?
Complete genomes today About 300 complete genomes have been sequenced
GenBank Growth • Exponential growth in total sequence data • Recently exceeded 100 Gbp (1011 base pairs)
Alignment Applications • We have our newly discovered gene: • Does it occur in other species? • How fast does it evolve? • Assume we try Smith-Waterman: The entire genomic database 1011 Our new gene 104 1015 cells
Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: • Construct a dictionary of all words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB
BLAST Step 1 (Basic): Construct dictionary of query words • Query indexed by all words of size k • Query indexed by all words of size k = 3 (in our examples) • Query indexed by all words of size k ≈ 11 AGG GGC GCT CTA TAT ATC TCA CAC GGC ACC TGA CGC GAC ACC CCT CTC TCC CCA CAG CCT GCG CTG AGG GCT CTA ATG TGC GCC CCC CCT CTA TAG AGC GCC CCG TAT ATC TCA CAC ACG CGA GAC ACC CCG GAT CGA Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… INDEX
BLAST Step 1 (Advanced): Relative Generation • For each query word, generate all relatives • A relative is a word with alignment score ≥ T • All relatives are updated to point to new location Query: AGGCTATCACCTGACCTCCAGGCCG… Query word: GGC Threshold: T = 28 Relatives: GGC 30 AGC 28 GAC 28AAC 26 GGT 25 GGA 24 ... INDEX
BLAST Step 2: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AGC GCT Query: AGC GCT INDEX
BLAST A C G A A G T A A G G T C C A G T Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A
BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool • Builds index (dictionary) for database, scans linearly through query • Alignment extensions allow for gaps as well
BLAT A C G A A G T A A G G T C C A G T Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG C T G A T C C T G G A T T G C G A
Perfect Match Results • Perfect Match: no relatives generated
Interpreting Results Word size k
Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%
Interpreting Results Sensitivity • Probability of a particular homologous area being identified • Larger k decreases probability (exact match less likely) • Straightforward mathematics Skip math
Sensitivity Calculation Query • Suppose k = 7: Database (genome) Homologous area: Conservation rate: 81% Mutation rate: 19% Probability whole word is conserved: 0.817≈ 23% 7
Sensitivity Calculation Query • Suppose k = 7: Database (genome) Homologous area: 23% 23% 23% 23% 23% 23% 23% 23% 23% 23% Words: 10 Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – 0.7710≈ 93%
Interpreting Results Specificity • Expected number of alignments initiated by chance • Based on 500 bp query and 3 Gbp database • This is essentially an indication of SPEED
Interpreting Results SPEED • Expected number of alignments initiated by chance • Based on 500 bp query and 3 Gbp database • This is essentially an indication of SPEED
The Classic BLAST Tradeoff As we increase k … • Sensitivity gets worse • Speed gets better
Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds
Wildcards Relative Generation • Any match: 1 • Any mismatch: 0 • Threshold: T = k – 1 • Exact matches unlikely for larger values of k • Include variants with one “wildcard”placed in each position GTA *TA G*A GT*
Wildcard Results Better?
Wildcard Results Perfect match: For the same sensitivity, wildcard variant is about 440 times faster Wildcards:
Wildcard Results Perfect match: For the same sensitivity, wildcard variant is about 40 times faster Wildcards:
Wildcard Results • Better • Sensitivity/speed tradeoff consistently improved
Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds
Multiple Words • N perfect matches • Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… • All separations less than distance W 7 bp 12 bp TAC TGC TGC NO INITIATION INITIATE ALIGNMENT TAC TGC 12 bp Skip math
Intuition Behind Multiple Words Query • If we use a single word of size k = 16: Database (genome) Homologous area: Conservation rate: 81% Mutation rate: 19% Probability whole word is conserved: 0.8116≈ 3% 16
Intuition Behind Multiple Words Query • If we use a single word of size k = 16: Database (genome) Homologous area: 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% Words: 10 Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – 0.9710≈ 29%
Intuition Behind Multiple Words Query • If we use a single word of size k = 16:Probability of a match = 29% • If we use N = 2 words of size k = 8: Database (genome) Homologous area: 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% Words: 20 Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – 0.8120 – 20 × 0.19 × 0.8119 ≈ 91% Probability a particular word is conserved: 0.818≈ 19%
Intuition Behind Multiple Words Query • If we use a single word of size k = 16:Probability of a match = 29% • If we use N = 2 words of size k = 8:Probability of a match = 91% Database (genome) 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%
Multiple Words Results Single perfect match: For the same sensitivity, multiple words variant about 1,200 times faster Multiple perfect matches:
Multiple Words Results Single perfect match: For the same sensitivity, multiple words variant about 75,000 times faster Multiple perfect matches:
Multiple Words Results • Much better than single matches • Bigger improvement even than wildcards
Multiple Words Results • Why not combine them:Multiple Wildcard Matches?
Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds
Seed Patterns • Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× • Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× GTCAGTACGT GTATTAGGCG
Intuition Behind Seed Patterns Patterns increase the likelihood of at least one match within a long conserved region Consecutive Positions Non-Consecutive Positions 6 common 5 common 7 common 3 common On a 100-long 70% conserved region: ConsecutiveNon-consecutive Expected # hits: 1.07 0.97 Prob[at least one hit]: 0.30 0.47
Advantage of Patterns 11 positions 11 positions 10 positions