BLAT – The B LAST- L ike A lignment T ool

BLAT – The BLAST-Like Alignment Tool Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖田知本

BLAT overview • Use an index to find regions in genome homologous to query. • Do a detailed alignment between query and homologous regions. • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.

Index • Database : non-overlapping • Query : overlapping … K-mer K-mer K-mer … K-mer K-mer

Example • Database: cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 • Query: aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6

Search Criteria • Single Perfect Matches • Single Near Perfect Matches • Multiple Perfect Matches

Notation • K : K-mer size • M : The match ratio between homologous area • H : Homologous region size • G : Query sequence size • A : The alphabet size

Single Perfect Matches (1) K-mer Homologous region Perfect Match

Single Perfect Matches (2) H K K K K K K K Homologous region The prob of at least one k-mer perfect match : (Sensitivity)

Single Perfect Matches (3) • The number of k-mer in the database = G / K • The number of k-mer in the query = Q – K + 1  The number of k-mer that are expected to matched by chance : (Specificity)

Single Perfect Nucleotide K-mer Matches as Search Criterion

 Case (perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99  max K = 7 chance matches = 13078962 (query = 500 , database = 3 billion)

Single Near Perfect Matches (1) Almost Perfect : One letter may mismatch K-mer Homologous region Near Perfect Match

Single Near Perfect Matches (2) • Sensitivity • Specificity

 Case (near perfect match) • Comparing mouse and human coding sequences at the nucleotide level : H = 100 M = 86% Sensitivity = 0.99  max K = 12 chance matches = 275671 (query = 500 , database = 3 billion)

Single Near Perfect Nucleotide K-mer Matches as Search Criterion

Multiple Perfect Matches • Hit is triggered : • there must be N perfect matches • each no further than W letters from each other in the database coordinate • have the same diagonal coordinate

Example a Query Coordinate W b c d Target Coordinate The hits a, b, c, and d are all k letters long. Hits b and d have the same diagonal coordinate within W letters of each other. Therefore, they would match the 2 perfect K-mer search criteria.

Multiple Perfect Nucleotide K-mer Matches as Search Criterion

Default • Nucleotide • two perfect 11-mer • Protein • single perfect 5-mer for standalone version • three perfect 4-mer for client/server version

BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.

BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT

BLAST Step2: Scan sequence B for hits.

BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

Algorithm • Search Stage • Use an index to find regions in genome homologous to query • Alignment Stage • Do a detailed alignment between query and homologous regions • Stitching and Filling In • Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole

Search Stage • Build an index which contains positions of each K-mer in database. • Step through each overlapping K-mer in query and look it up in index • Get list of ‘hits’ - positions in query and in database that match for K bases • Cluster hits to find homologous regions

Search Stage • Clump hits

Search Stage • Eliminate small clumps • Clump ‘clumps’ homologous region

Alignment Stage (nucleotide) • Start from scratch with regions defined with K-mers • Index on smaller K-mers, but extend each K-mer until it becomes specific • Extend in both direction without mismatches or gaps and merge overlapping or continues alignments • Recurse on gaps with smaller K until gap or hits are eliminated

Alignment Stage (nucleotide) recursive

Alignment Stage (protein) • Extend hits into maximal scoring ungapped alignment (HSPs) with +2/-1 scoring scheme • Create a graph of all possible HSP merges • Use dynamic programming to traverse the graph

Alignment Stage (protein)

Alignment Stage (protein) query HSP homologous region

Stitching and Filling In • The alignment of gene is often scattered across multiple homologous regions found in the search stage query database

Stitching and Filling In query homologous region database

Evaluation • Comparison with Other Tools: • mRNA/Genome Alignments • Remapped 713 mRNAs corresponding to annotated chromosome 22 • BLAT took 26 sec while Sim4 took 17,468 sec (almost 5h)

Evaluation • Comparison with Other Tools: • Translated Mouse/Human Alignments • 13 million mouse genomic reads vs. human chromosome 22

BLATvs.BLAST • Index • Query vs. Database • Hits • Perfect vs. Near Perfect • Alignment • Separate vs. Together

Magic Time !

Magic 4 Prediction ! No mind ! Great ! 3 3 .5 4

Reference • http://amber.cs.umd.edu/class/838-s04/nada.ppt • http://bioportal.weizmann.ac.il/course/ATIB/ATIB03_lecture3.print.pdf

BLAT – The B LAST- L ike A lignment T ool

BLAT – The B LAST- L ike A lignment T ool

Presentation Transcript

Searching Molecular Databases with BLAST

Demo: Exploiting the UCSC Genome Browser

C ontrast R epetition A lignment P roximity

4. Meeting: Informal Sector and Informal Practices in Everyday Life A View on India The Meaning of Blat in the Soviet

CRAP design model

Advising First Generation College Students

4. Meeting: Informal Sector and Informal Practices in Everyday Life A View on India

JUNE 2005

Being Tire Smart in 2004

Effective Measurement for Internal Communication - f ive key steps

Version 2007B Hands-On Workshop Extended Version with the BLAT (V1.0) Model and

Introduction to ontologies and tools; some examples

T he A lignment F actor tm

BLAST Programming