CAP5510 – Bioinformatics Database Searches for Biological Sequences

CAP5510 – BioinformaticsDatabase Searches for Biological Sequences Tamer Kahveci CISE Department University of Florida

Goals • Understand how major heuristic methods for sequence comparison work • FASTA • BLAST • Understand how search results are evaluated

What is Database Search ? Many long sequences One giant sequence . . . query query

What is Database Search ? Two giant sequences

What is Database Search ? • Find a particular (usually) short sequence in a database of sequences (or one huge sequence). • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. • Databases always return some kind of hit, how much attention should be paid to the result? • A similar problem is the global alignment of two large sequences • General idea: good alignments contain high scoring regions.

Database Search Issues • How can we search massive space quickly? • How can we evaluate the significanceof the result?

Database Search Methods • Hash table based methods • FASTA family • FASTP, FASTA, TFASTA, FASTAX, FASTAY • BLAST family • BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST • Others • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Suffix tree based methods • Mummer, AVID, Reputer, MGA, QUASAR

Hash Table

Hash Table • K-gram = subsequence of length K • Ak entries • A is alphabet size • Linear time construction • Constant lookup time

FASTP Lipman & Pearson, 1985

FASTP • Three phase algorithm • Find short good matches using k-grams • K = 1 or 2 • Find start and end positions for good matches • Use DP to align good matches

FASTP: Phase 1 (1) position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offset amino acid protein A protein B pos A - posB ----------------------------------------------------- a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 - ----------------------------------------------------- Note the common offset for the 3 amino acids c,s and p A possible alignment can be quickly found : protein 1 n c s p t a | | | protein 2 a c s p r k

FASTP: Phase 1 (2) • Similar to dot plot • Offsets range from 1-m to n-1 • Each offset is scored as • # matches - # mismatches • Diagonals (offsets) with large score show local similarities • How does it depend on k?

FASTP: Phase 2 • 5 best diagonal runs are found • Rescore these 5 regions using PAM250. • Initial score • Indels are not considered yet

FASTP: Phase 3 • Sort the aligned regions in descending score • Optimize these alignments using Needleman-Wunsch • Report the results

FASTP - Discussion • Results are not optimal. Why ? • How does performance compare to Smith-Waterman? • What is the impact of k? • How does this idea work for DNAs ? • K = 4 or 6 for DNA

FASTA – Improvement Over FASTP Pearson 1995

FASTA (1) • Phase 2: Choose 10 best diagonal runs instead of 5

FASTA (2) • Phase 2.5 • Eliminate diagonals that score less than some given threshold. • Combine matches to find longer matches. It incurs join penalty similar to gap penalty

BLAST Altschul, Gish, Miller, Myers, Lipman, 1990

BLAST (or BLASTP) • BLAST – Basic Local Alignment Search Tool • An approximation of Smith-Waterman • Designed for database searches • Short query sequence against long database sequence or a database of many sequences • Sacrifices search sensitivity for speed

MCGPFILGTYC CGP MCG BLAST Algorithm (1) • Eliminate low complexity regions from the query sequence. • Replace them with X (protein) or N (DNA) • Hash table on query sequence. • K = 3 for proteins

BLAST Algorithm (2) • For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62 • 20k candidates • ~50 on the average per k-gram • ~50n for the entire query • Build hash table PQGMCGPFILGTYC QGM PQG PQG PQG 18 PEG 15 PRG 14 PSG 13 PQA 12 T = 13

BLAST Algorithm (3) • Sequentially scan the database and locate each k-gram in the hash table • Each match is a seed for an ungapped alignment.

BLAST Algorithm (4) • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, X

BLAST Algorithm (5) • Keep only the extended matches that have a score at least S. • Determine the statistical significance of the result

What is Statistical Significance? • Two one-on-one games, two scores. • Which result is more significant? • Expected: maybe a random result. • Unexpected: significant, may have significant meanings. 13 : 15 13 : 15

Statistical Significance • E-value: The expected number of matches with score at least S • E = Kmne-lambda.S • m, n : sequence lengths • S : alignment score • K, lambda: normalization parameters • P-value: The probability of having at least one match with score at least S • 1 – e-E • The smaller these values are, the more significant the result • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

K (k-gram) Lower: more sensitive. Slower. T (neighbor cutoff) Lower: Find distant neighbors. Introduces noise X (extension cutoff) Higher: lower chances of getting into a local minima. Slower. BLAST - Analysis

Sample Query • http://www.ncbi.nlm.nih.gov/BLAST/ Dhal_ecoli I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I

BLASTN • BLAST for nucleic acids • K = 11 • Exact match instead of neighborhood search.

BLAST Variations

Even More Variations • PsiBLAST (iterative) • BLAT, BLASTZ, MegaBLAST • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Main differences are • Seed choice (k, gapped seeds) • Additional data structures

Suffix Trees

Suffix Tree • Tree structure that contains all suffixes of the input sequence • TGAGTGCGA • GAGTGCGA • AGTGCGA • GTGCGA • TGCGA • GCGA • CGA • GA • A

Suffix Tree Example

Suffix Tree Analysis • O(n) space and construction time • 10n to 70n space usage reported • O(m) search time for m-letter sequence • Good for • Small data • Exact matches

Suffix Array • 5 bytes per letter • O(m log n) search time • Better space usage • Slower search

Mummer

Other Sequence Comparison Tools • Reputer, MGA, AVID • QUASAR (suffix array)

CAP5510 – Bioinformatics Database Searches for Biological Sequences

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Presentation Transcript

CAP5510 – Bioinformatics Substitution Patterns

Semantic Modeling of Biological Sequences

Bioinformatics Data Representation and Integration

Biomolecular databases

Bioinformatics/HPC

Structural Bioinformatics

Biology

Speeding the Database Searches and Sequence Alignments with Multi-Motif PHI-BLAST

Current Abstractions

Lecture 2: Introduction to Computational Biology

Algorithms in Bioinformatics

From Databases to Dynamics

Biological Databases, Integration, and Semantic Web

Bioinformatics

Biological sequences and SO

Biological databases Function and pathways databases - KEGG

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Bioinformatics I

Multiply Aligning RNA Sequences

Bioinformatics Practice Considerations

Bioinformatics