Class 4: Fast Sequence Alignment

Class 4: Fast Sequence Alignment .

Alignment in Real Life • One of the major uses of alignments is to find sequences in a “database” • Such collections contain massive number of sequences (order of 106) • Finding homologies in these databases with the standard dynamic programming can take too long • Example: • query protein : 232 AAs • NR protein DB: 2.7 million sequences; 748 million AAs • m*n = ~ 1.7 *1011cells !

Heuristic Search • Instead, most searches rely on heuristic procedures • These are not guaranteed to find the best match • Sometimes, they will completely miss a high-scoring match • We now describe the main ideas used by some of these procedures • Actual implementations often contain additional tricks and hacks

Basic Intuition • The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy! • Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches. • These heuristics try to find significant local gap-less matches and then extend them.

Banded DP • Suppose that we have two strings s[1..n] and t[1..m] such that nm • If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal s t

Banded DP • To find such a path, it suffices to search in a diagonal region of the matrix • If the diagonal band has presumed width a, then the dynamic programming step takes O(an) • Much faster than O(n2) of standard DP in this case s a t

Banded DP Problem (for local alignment): • If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match • However, we do not know i,j,p,q ! • How do we select which sub-sequences to align using banded DP?

FASTA Overview • Main idea: Find (fast!) “good” diagonals and extend them to complete matches • Suppose that we have a relatively long gap-less local match (diagonal): …AGCGCCATGGATTGAGCGA… …TGCGACATTGATCGACCTA… • Can we find “clues” that will let us find it quickly?

s t Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA

FASTA • Given s and t, and a parameter k • Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly • Locate sets of pairs that are on the same diagonal by sorting according to i-j thus… • Locating diagonals that contain many close pairs. • This is faster than O(nm) ! s i i+k j j+k t

FASTA • Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches. • Try to combine close diagonals to potential gapped matches, picking the best-scoring matches. • Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments. • Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)

BLAST Overview • FASTA drawback is its reliance on perfect matches • BLAST (Basic Local Alignment Search Tool)uses similar intuition, but relies on high scoringmatches rather than exact matches • Given parameters: length k, and threshold T • Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T

High-Scoring Pair • Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k-substring of s. • Note: not all k-mers have an HSP in s

BLAST: phase 1 • Phase 1: compile a list of word pairs (k=3) • above threshold T • Example: for the following query: …FSGTWYA… (query word is in green) • A list of words (k=3) is: • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS

BLAST: phase 1 scores GTW 6,5,11 22 neighborhood ASW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits below threshold (T=11)

BLAST: phase 2 • Search the database for perfect matches with neighborhoodwords. Those are “hits” for further alignment. • We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).

s t Extending Potential Matches • Once a hit is found, BLAST attempts to find a local alignment that extends it. • Seeds on the same diagonal tend to be combined (as in FASTA)

Two HSP diagonal • An improvement: look for 2 HSPs on close diagonals • Extend the alignment between them • Fewer extensions considered • There is a version of BLAST, involving gapped extensions. • Generally faster then FASTA, arguably better. s t

Blast Variants • blastn (nucleotide BLAST) • blastp (protein BLAST) • tblastn (protein query, translated DB BLAST) • blastx (translated query, protein DB BLAST) • tblastx (translated query, translated DB BLAST) • bl2seq (pairwise alignment)

Class 4: Fast Sequence Alignment

Class 4: Fast Sequence Alignment

Presentation Transcript

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Platform Identifications

Image alignment

BLAST Similarity Searching

MSA- multiple sequence alignment

DNA SEQUENCE DATA - From template DNA to Sequence Alignment…

Fast Food

Main Sequence Lifetimes

RNA secondary structure

PPI Network Alignment

Pairwise sequence alignment

Homology and sequence alignment.

Sequence Alignment and Phylogenetic Analysis

Multiple Alignment

Sequence Alignment

Sequence Alignment

Drill:

DATA BASE ALIGNMENT PROGRAM (DAPs)

Global Sequence Alignment by Dynamic Programming

Shaft Alignment

NGS Bioinformatics Workshop 1.3 Sequence Alignment and Searching