430 likes | 695 Views
BLAT – The B LAST- L ike A lignment T ool. Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖 田知本. BLAT overview. Use an index to find regions in genome homologous to query. Do a detailed alignment between query and homologous regions.
E N D
BLAT – The BLAST-Like Alignment Tool Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖 田知本
BLAT overview • Use an index to find regions in genome homologous to query. • Do a detailed alignment between query and homologous regions. • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.
Index • Database : non-overlapping • Query : overlapping … K-mer K-mer K-mer … K-mer K-mer
Example • Database: cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 • Query: aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6
Search Criteria • Single Perfect Matches • Single Near Perfect Matches • Multiple Perfect Matches
Notation • K : K-mer size • M : The match ratio between homologous area • H : Homologous region size • G : Query sequence size • A : The alphabet size
Single Perfect Matches (1) K-mer Homologous region Perfect Match
Single Perfect Matches (2) H K K K K K K K Homologous region The prob of at least one k-mer perfect match : (Sensitivity)
Single Perfect Matches (3) • The number of k-mer in the database = G / K • The number of k-mer in the query = Q – K + 1 The number of k-mer that are expected to matched by chance : (Specificity)
Case (perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99 max K = 7 chance matches = 13078962 (query = 500 , database = 3 billion)
Single Near Perfect Matches (1) Almost Perfect : One letter may mismatch K-mer Homologous region Near Perfect Match
Single Near Perfect Matches (2) • Sensitivity • Specificity
Case (near perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99 max K = 12 chance matches = 275671 (query = 500 , database = 3 billion)
Single Near Perfect Nucleotide K-mer Matches as Search Criterion
Multiple Perfect Matches • Hit is triggered : • there must be N perfect matches • each no further than W letters from each other in the database coordinate • have the same diagonal coordinate
Example a Query Coordinate W b c d Target Coordinate The hits a, b, c, and d are all k letters long. Hits b and d have the same diagonal coordinate within W letters of each other. Therefore, they would match the 2 perfect K-mer search criteria.
Multiple Perfect Nucleotide K-mer Matches as Search Criterion
Default • Nucleotide • two perfect 11-mer • Protein • single perfect 5-mer for standalone version • three perfect 4-mer for client/server version
BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.
BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT
BLAST Step2: Scan sequence B for hits.
BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)
Algorithm • Search Stage • Use an index to find regions in genome homologous to query • Alignment Stage • Do a detailed alignment between query and homologous regions • Stitching and Filling In • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole
Search Stage • Build an index which contains positions of each K-mer in database. • Step through each overlapping K-mer in query and look it up in index • Get list of ‘hits’ - positions in query and in database that match for K bases • Cluster hits to find homologous regions
Search Stage • Clump hits
Search Stage • Eliminate small clumps • Clump ‘clumps’ homologous region
Alignment Stage (nucleotide) • Start from scratch with regions defined with K-mers • Index on smaller K-mers, but extend each K-mer until it becomes specific • Extend in both direction without mismatches or gaps and merge overlapping or continues alignments • Recurse on gaps with smaller K until gap or hits are eliminated
Alignment Stage (nucleotide) recursive
Alignment Stage (protein) • Extend hits into maximal scoring ungapped alignment (HSPs) with +2/-1 scoring scheme • Create a graph of all possible HSP merges • Use dynamic programming to traverse the graph
Alignment Stage (protein) query HSP homologous region
Stitching and Filling In • The alignment of gene is often scattered across multiple homologous regions found in the search stage query database
Stitching and Filling In query homologous region database
Evaluation • Comparison with Other Tools: • mRNA/Genome Alignments • Remapped 713 mRNAs corresponding to annotated chromosome 22 • BLAT took 26 sec while Sim4 took 17,468 sec (almost 5h)
Evaluation • Comparison with Other Tools: • Translated Mouse/Human Alignments • 13 million mouse genomic reads vs. human chromosome 22
BLATvs.BLAST • Index • Query vs. Database • Hits • Perfect vs. Near Perfect • Alignment • Separate vs. Together
Magic 4 Prediction ! No mind ! Great ! 3 3 .5 4
Reference • http://amber.cs.umd.edu/class/838-s04/nada.ppt • http://bioportal.weizmann.ac.il/course/ATIB/ATIB03_lecture3.print.pdf