370 likes | 467 Views
Alignment to a database of sequences. Biology 162 Computational Genetics Todd Vision 7 Sep 2004. Preview. How to speed up pairwise alignment How to statistically evaluate alignments and database matches How to use BLAST What database should you search?. Need for fast pairwise alignment.
E N D
Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004
Preview • How to speed up pairwise alignment • How to statistically evaluate alignments and database matches • How to use BLAST • What database should you search?
Need for fast pairwise alignment • Even O(mn) is too slow or memory intensive for some applications • Shotgun assembly (Phrap) • Database search • Alignment of long sequences
History • FASTA (1983) • First practical tool for fast pairwise alignment • Restricts alignment to path-graph “hot-spots” • BLAST (1990) (and WU-BLAST) • Basic local alignment search tool • Refinement of ideas for locating “hot-spots” • Searching dB for “hot-spots” in sublinear time • Probability of alignment scores • Gapped BLAST and FASTA3 (1998, 2000) • The two programs have largely converged • Can now both produce gapped alignments with accurate statistics
Why is BLAST so fast? • Exclusion of database sequences unlikely to contain good alignments • Preprocessing of database to enable fast matching algorithms • Filling out only a small part of the path graph
How BLAST works • Segment pair - a pair of equal length substrings in query and dB • Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end • High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence • BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold
How BLAST works • Query broken into short overlapping words (default 11nt or 3aa) • Up to 50 high scoring word neighbors (with minimum score T) determined for each word in query LRE LRO LRQ LRT LRG LRQNMLAC LRT RQN
How BLAST works • Preprocessed database searched for exact matches to all high scoring words • Generation of words and search for matches are both linear in length of query • A memory intensive data structure allows a fast algorithm • We will see how this can be achieved with suffix trees later
How BLAST works • Segment pairs on the same diagonal and within A cells of each other are extended in both directions • Extension proceeds until the dropoff from the highest score is too great • If the score is sufficiently high, and no other overlapping LMSP is higher, the highest scoring extension is considered to be an HSP • Extension is the time consuming step of BLAST • Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step • HSPs are joined by a gapped alignment
Alignment probabilities ELVIS ||| | ELVYS • Each HSP has an associated score (Ssij) • We know the scores for every possible amino acid pair i, j • We can compute the frequency of each i, j in a random alignment • Why not simply compute the probability of obtaining a score at least as good as S?
BLAST statistics • Scores are inflated over a naive probability model for two reasons • We have optimized the alignment • We are taking the MSP from that alignment and not a random pair of aligned segments • We are only reporting the highest scoring alignments after searching a large database • The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution
opt E() < 20 173 0:== 22 0 0: one = represents 122 library sequences 24 2 0:= 26 1 1:* 28 8 16:* 30 37 95:* 32 170 367:== * 34 714 996:====== * 36 1809 2045:=============== * 38 3729 3379:===========================*=== 40 5555 4713:======================================*======= 42 6991 5761:===============================================*========== 44 7264 6355:====================================================*======= 46 6786 6473:=====================================================*== 48 6059 6197:==================================================* 50 4920 5655:========================================= * 52 4204 4972:=================================== * 54 3410 4247:============================ * 56 2957 3547:========================= * 58 2469 2912:===================== * 60 2280 2359:===================* 62 1872 1891:===============* 64 1450 1504:============* 66 1176 1189:=========* 68 1010 935:=======*= 70 817 733:======* 72 668 573:====*= 74 501 446:===*= 76 431 347:==*= 78 342 270:==* 80 315 210:=*= 82 214 160:=* 84 161 127:=* 86 116 98:* 88 116 76:* inset = represents 2 library sequences 90 85 59:* 92 47 46:* :======================*= 94 55 35:* :=================*========== 96 32 27:* :=============*== 98 28 21:* :==========*=== 100 22 16:* :=======*=== 102 10 13:* :===== * 104 8 10:* :====* 106 10 8:* :===*= 108 6 6:* :==* 110 6 5:* :==* 112 4 4:* :=* 114 3 3:* :=* 116 2 2:* :* 118 1 2:* :* >120 67 1:* :*=================================
Bit scores, E-values • Raw score cannot be compared among searches • Bit score is a function of raw score, scoring matrix and amino acid frequencies in database. • Can be compared among searches • Requires parameters l and K • Expect value (E) depends on length of query (m) and subject (n) • Cannot be compared among searches of different dBs
Caution • Preceding theory does not hold for global alignments
Where do l and K come from? • qi is the frequency of residue i • pij is the frequency of aligned residues i and j with score sij • lis the unique solution to • K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency
How to interpret E • E is the expected # of matches S’ ≥ S’obs • Given the lengths of the two sequences and assuming both are random • Will be approximately Poisson under null hypothesis • E can be much smaller or much greater than 1 • When E << 1, it is approximately equal to the probability that any match would have S’ ≥ S’obs • There may be many matches with E << 1 if there are multiple homologs in the dB
Additional considerations • Edge effects • Length of expected random alignment must be substracted from length of sequence and query • Multiple comparisons • When searching D such sequences for a given E threshold, the expected frequency of matches is E’ ~ DE • BLAST actually reports E’
Gapped versus ungapped alignments • For gapped alignments • l and K can be theoretically computed • For ungapped alignments • l and K must be simulated for each matrix and dB • E values are approximate for gapped alignment • Assumes gaps are expensive/rare • BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself
Calculating critical scores • Two sequences of length 100 • What is S’crit for E ≤ 0.05? • Answer: 13.3 bits
Masking low-complexity regions • Alignment statistics require that symbols occur randomly in strings • Long substrings of one or a few symbols violate this assumption • Sequences are preprocessed to identify such low-complexity regions (LCR) • LCR are masked • Do not contribute to alignment or score • Appear as X’s in BLAST output • Tools include DUST (for DNA) and SEG (protein) • Other types of repeats may also be masked by BLAST
Complexity where L=window length fi is frequency of symbol i, i=1..n Example: GGGG L=4, fG=4,fA=fC=fT=0, 4!=4*3*2*1=24, 0!=1 Complexity=(1/4) * log4(24/4*1*1*1) = 0
Which matches should you care about? • Beware of redundancy • Search the right set of sequences and no more • Choosing a significance threshold requires judgement
Are you searching the right dB? • Protein • nr (Non-redundant, sort of) • Swissprot (Annotated) • Pat (Patented) • PDB (3D structures in Protein Data Bank) • Yeast, E. coli, Drosophila, Human, etc. • Month • Nucleotide • nt (nucleotide version of nr) • EST (Expressed sequence tags) • GSS (Genome survey sequence - low pass) • HTGS (High Throughput Genomic Sequence) • Chromosome (completed genomes, chromosomes) • etc.
BLAST parameters • -W wordsize [Integer] default = 11 nucleotides, 3 proteins • -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins • -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins • Dropoffs for blast extensions • Scoring matrix for proteins • -q Penalty for nucleotide mismatch [Integer] default = -3 • -r reward for nucleotide match [Integer] default = 1 • Masking of low-complexity and repeat sequences • -e expect value [Real] default = 10
Making your own BLAST dB • Start with a MultiFASTA file • Formatdb program from NCBI converts MultiFASTA file into BLAST dB • Words are preprocessed • K and l are calculated for allowed scoring matrices • MultiFASTA file can also be used a query to do many searches against one database • All-by-all search - same file as query and dB
Don’t use BLAST blindly • When BLAST is not appropriate • Perfect matches (faster methods available) • Primer sequences (also too short) • Assembly (more specialized tools exist) • Really distant homology (pairwise alignment is not sufficiently sensitive) • Pay attention to program choice/parameters when • Aligning cDNA against a genome • Aligning two very long sequences • Aligning sequences with many repeats
Cautionary tale: BRCA1 • Initial BLAST search showed match to granins with E = 0.00175 • Granins are typically found in endocrine cell secretory vesicles • Is a cancer protein excreted outside of the cell?! • Now known to be a spurious match
Summary • To attain high speed when searching a dB, BLAST • Sacrifices some sensitivity by only extending HSPs • Preprocesses the database and holds it in memory • Meaningful statistics are key to BLAST’s widespread use • Conversion of • Raw score to bit score: depends on scoring matrix and amino acid frequencies • Bit score to E value: depends on database size • BLAST is easy to use and versatile, which makes it awfully tempting to misuse
Reading assignment • Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs. 63-101