1 / 37

Alignment to a database of sequences

Alignment to a database of sequences. Biology 162 Computational Genetics Todd Vision 7 Sep 2004. Preview. How to speed up pairwise alignment How to statistically evaluate alignments and database matches How to use BLAST What database should you search?. Need for fast pairwise alignment.

thad
Download Presentation

Alignment to a database of sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004

  2. Preview • How to speed up pairwise alignment • How to statistically evaluate alignments and database matches • How to use BLAST • What database should you search?

  3. Need for fast pairwise alignment • Even O(mn) is too slow or memory intensive for some applications • Shotgun assembly (Phrap) • Database search • Alignment of long sequences

  4. History • FASTA (1983) • First practical tool for fast pairwise alignment • Restricts alignment to path-graph “hot-spots” • BLAST (1990) (and WU-BLAST) • Basic local alignment search tool • Refinement of ideas for locating “hot-spots” • Searching dB for “hot-spots” in sublinear time • Probability of alignment scores • Gapped BLAST and FASTA3 (1998, 2000) • The two programs have largely converged • Can now both produce gapped alignments with accurate statistics

  5. Why is BLAST so fast? • Exclusion of database sequences unlikely to contain good alignments • Preprocessing of database to enable fast matching algorithms • Filling out only a small part of the path graph

  6. How BLAST works • Segment pair - a pair of equal length substrings in query and dB • Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end • High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence • BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold

  7. How BLAST works • Query broken into short overlapping words (default 11nt or 3aa) • Up to 50 high scoring word neighbors (with minimum score T) determined for each word in query LRE LRO LRQ LRT LRG LRQNMLAC LRT RQN

  8. How BLAST works • Preprocessed database searched for exact matches to all high scoring words • Generation of words and search for matches are both linear in length of query • A memory intensive data structure allows a fast algorithm • We will see how this can be achieved with suffix trees later

  9. How BLAST works • Segment pairs on the same diagonal and within A cells of each other are extended in both directions • Extension proceeds until the dropoff from the highest score is too great • If the score is sufficiently high, and no other overlapping LMSP is higher, the highest scoring extension is considered to be an HSP • Extension is the time consuming step of BLAST • Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step • HSPs are joined by a gapped alignment

  10. Alignment probabilities ELVIS ||| | ELVYS • Each HSP has an associated score (Ssij) • We know the scores for every possible amino acid pair i, j • We can compute the frequency of each i, j in a random alignment • Why not simply compute the probability of obtaining a score at least as good as S?

  11. BLAST statistics • Scores are inflated over a naive probability model for two reasons • We have optimized the alignment • We are taking the MSP from that alignment and not a random pair of aligned segments • We are only reporting the highest scoring alignments after searching a large database • The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution

  12. opt E() < 20 173 0:== 22 0 0: one = represents 122 library sequences 24 2 0:= 26 1 1:* 28 8 16:* 30 37 95:* 32 170 367:== * 34 714 996:====== * 36 1809 2045:=============== * 38 3729 3379:===========================*=== 40 5555 4713:======================================*======= 42 6991 5761:===============================================*========== 44 7264 6355:====================================================*======= 46 6786 6473:=====================================================*== 48 6059 6197:==================================================* 50 4920 5655:========================================= * 52 4204 4972:=================================== * 54 3410 4247:============================ * 56 2957 3547:========================= * 58 2469 2912:===================== * 60 2280 2359:===================* 62 1872 1891:===============* 64 1450 1504:============* 66 1176 1189:=========* 68 1010 935:=======*= 70 817 733:======* 72 668 573:====*= 74 501 446:===*= 76 431 347:==*= 78 342 270:==* 80 315 210:=*= 82 214 160:=* 84 161 127:=* 86 116 98:* 88 116 76:* inset = represents 2 library sequences 90 85 59:* 92 47 46:* :======================*= 94 55 35:* :=================*========== 96 32 27:* :=============*== 98 28 21:* :==========*=== 100 22 16:* :=======*=== 102 10 13:* :===== * 104 8 10:* :====* 106 10 8:* :===*= 108 6 6:* :==* 110 6 5:* :==* 112 4 4:* :=* 114 3 3:* :=* 116 2 2:* :* 118 1 2:* :* >120 67 1:* :*=================================

  13. Bit scores, E-values • Raw score cannot be compared among searches • Bit score is a function of raw score, scoring matrix and amino acid frequencies in database. • Can be compared among searches • Requires parameters l and K • Expect value (E) depends on length of query (m) and subject (n) • Cannot be compared among searches of different dBs

  14. Caution • Preceding theory does not hold for global alignments

  15. Where do l and K come from? • qi is the frequency of residue i • pij is the frequency of aligned residues i and j with score sij • lis the unique solution to • K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency

  16. How to interpret E • E is the expected # of matches S’ ≥ S’obs • Given the lengths of the two sequences and assuming both are random • Will be approximately Poisson under null hypothesis • E can be much smaller or much greater than 1 • When E << 1, it is approximately equal to the probability that any match would have S’ ≥ S’obs • There may be many matches with E << 1 if there are multiple homologs in the dB

  17. Additional considerations • Edge effects • Length of expected random alignment must be substracted from length of sequence and query • Multiple comparisons • When searching D such sequences for a given E threshold, the expected frequency of matches is E’ ~ DE • BLAST actually reports E’

  18. Gapped versus ungapped alignments • For gapped alignments • l and K can be theoretically computed • For ungapped alignments • l and K must be simulated for each matrix and dB • E values are approximate for gapped alignment • Assumes gaps are expensive/rare • BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself

  19. Calculating critical scores • Two sequences of length 100 • What is S’crit for E ≤ 0.05? • Answer: 13.3 bits

  20. Masking low-complexity regions • Alignment statistics require that symbols occur randomly in strings • Long substrings of one or a few symbols violate this assumption • Sequences are preprocessed to identify such low-complexity regions (LCR) • LCR are masked • Do not contribute to alignment or score • Appear as X’s in BLAST output • Tools include DUST (for DNA) and SEG (protein) • Other types of repeats may also be masked by BLAST

  21. Complexity where L=window length fi is frequency of symbol i, i=1..n Example: GGGG L=4, fG=4,fA=fC=fT=0, 4!=4*3*2*1=24, 0!=1 Complexity=(1/4) * log4(24/4*1*1*1) = 0

  22. Which matches should you care about? • Beware of redundancy • Search the right set of sequences and no more • Choosing a significance threshold requires judgement

  23. Are you searching the right dB? • Protein • nr (Non-redundant, sort of) • Swissprot (Annotated) • Pat (Patented) • PDB (3D structures in Protein Data Bank) • Yeast, E. coli, Drosophila, Human, etc. • Month • Nucleotide • nt (nucleotide version of nr) • EST (Expressed sequence tags) • GSS (Genome survey sequence - low pass) • HTGS (High Throughput Genomic Sequence) • Chromosome (completed genomes, chromosomes) • etc.

  24. BLAST parameters • -W wordsize [Integer] default = 11 nucleotides, 3 proteins • -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins • -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins • Dropoffs for blast extensions • Scoring matrix for proteins • -q Penalty for nucleotide mismatch [Integer] default = -3 • -r reward for nucleotide match [Integer] default = 1 • Masking of low-complexity and repeat sequences • -e expect value [Real] default = 10

  25. Making your own BLAST dB • Start with a MultiFASTA file • Formatdb program from NCBI converts MultiFASTA file into BLAST dB • Words are preprocessed • K and l are calculated for allowed scoring matrices • MultiFASTA file can also be used a query to do many searches against one database • All-by-all search - same file as query and dB

  26. Don’t use BLAST blindly • When BLAST is not appropriate • Perfect matches (faster methods available) • Primer sequences (also too short) • Assembly (more specialized tools exist) • Really distant homology (pairwise alignment is not sufficiently sensitive) • Pay attention to program choice/parameters when • Aligning cDNA against a genome • Aligning two very long sequences • Aligning sequences with many repeats

  27. Cautionary tale: BRCA1 • Initial BLAST search showed match to granins with E = 0.00175 • Granins are typically found in endocrine cell secretory vesicles • Is a cancer protein excreted outside of the cell?! • Now known to be a spurious match

  28. Summary • To attain high speed when searching a dB, BLAST • Sacrifices some sensitivity by only extending HSPs • Preprocesses the database and holds it in memory • Meaningful statistics are key to BLAST’s widespread use • Conversion of • Raw score to bit score: depends on scoring matrix and amino acid frequencies • Bit score to E value: depends on database size • BLAST is easy to use and versatile, which makes it awfully tempting to misuse

  29. Reading assignment • Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs. 63-101

More Related