Alignment to a database of sequences

Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004

Preview • How to speed up pairwise alignment • How to statistically evaluate alignments and database matches • How to use BLAST • What database should you search?

Need for fast pairwise alignment • Even O(mn) is too slow or memory intensive for some applications • Shotgun assembly (Phrap) • Database search • Alignment of long sequences

History • FASTA (1983) • First practical tool for fast pairwise alignment • Restricts alignment to path-graph “hot-spots” • BLAST (1990) (and WU-BLAST) • Basic local alignment search tool • Refinement of ideas for locating “hot-spots” • Searching dB for “hot-spots” in sublinear time • Probability of alignment scores • Gapped BLAST and FASTA3 (1998, 2000) • The two programs have largely converged • Can now both produce gapped alignments with accurate statistics

Why is BLAST so fast? • Exclusion of database sequences unlikely to contain good alignments • Preprocessing of database to enable fast matching algorithms • Filling out only a small part of the path graph

How BLAST works • Segment pair - a pair of equal length substrings in query and dB • Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end • High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence • BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold

How BLAST works • Query broken into short overlapping words (default 11nt or 3aa) • Up to 50 high scoring word neighbors (with minimum score T) determined for each word in query LRE LRO LRQ LRT LRG LRQNMLAC LRT RQN

How BLAST works • Preprocessed database searched for exact matches to all high scoring words • Generation of words and search for matches are both linear in length of query • A memory intensive data structure allows a fast algorithm • We will see how this can be achieved with suffix trees later

How BLAST works • Segment pairs on the same diagonal and within A cells of each other are extended in both directions • Extension proceeds until the dropoff from the highest score is too great • If the score is sufficiently high, and no other overlapping LMSP is higher, the highest scoring extension is considered to be an HSP • Extension is the time consuming step of BLAST • Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step • HSPs are joined by a gapped alignment

Alignment probabilities ELVIS ||| | ELVYS • Each HSP has an associated score (Ssij) • We know the scores for every possible amino acid pair i, j • We can compute the frequency of each i, j in a random alignment • Why not simply compute the probability of obtaining a score at least as good as S?

BLAST statistics • Scores are inflated over a naive probability model for two reasons • We have optimized the alignment • We are taking the MSP from that alignment and not a random pair of aligned segments • We are only reporting the highest scoring alignments after searching a large database • The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution

opt E() < 20 173 0:== 22 0 0: one = represents 122 library sequences 24 2 0:= 26 1 1:* 28 8 16:* 30 37 95:* 32 170 367:== * 34 714 996:====== * 36 1809 2045:=============== * 38 3729 3379:===========================*=== 40 5555 4713:======================================*======= 42 6991 5761:===============================================*========== 44 7264 6355:====================================================*======= 46 6786 6473:=====================================================*== 48 6059 6197:==================================================* 50 4920 5655:========================================= * 52 4204 4972:=================================== * 54 3410 4247:============================ * 56 2957 3547:========================= * 58 2469 2912:===================== * 60 2280 2359:===================* 62 1872 1891:===============* 64 1450 1504:============* 66 1176 1189:=========* 68 1010 935:=======*= 70 817 733:======* 72 668 573:====*= 74 501 446:===*= 76 431 347:==*= 78 342 270:==* 80 315 210:=*= 82 214 160:=* 84 161 127:=* 86 116 98:* 88 116 76:* inset = represents 2 library sequences 90 85 59:* 92 47 46:* :======================*= 94 55 35:* :=================*========== 96 32 27:* :=============*== 98 28 21:* :==========*=== 100 22 16:* :=======*=== 102 10 13:* :===== * 104 8 10:* :====* 106 10 8:* :===*= 108 6 6:* :==* 110 6 5:* :==* 112 4 4:* :=* 114 3 3:* :=* 116 2 2:* :* 118 1 2:* :* >120 67 1:* :*=================================

Bit scores, E-values • Raw score cannot be compared among searches • Bit score is a function of raw score, scoring matrix and amino acid frequencies in database. • Can be compared among searches • Requires parameters l and K • Expect value (E) depends on length of query (m) and subject (n) • Cannot be compared among searches of different dBs

Caution • Preceding theory does not hold for global alignments

Where do l and K come from? • qi is the frequency of residue i • pij is the frequency of aligned residues i and j with score sij • lis the unique solution to • K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency

How to interpret E • E is the expected # of matches S’ ≥ S’obs • Given the lengths of the two sequences and assuming both are random • Will be approximately Poisson under null hypothesis • E can be much smaller or much greater than 1 • When E << 1, it is approximately equal to the probability that any match would have S’ ≥ S’obs • There may be many matches with E << 1 if there are multiple homologs in the dB

Additional considerations • Edge effects • Length of expected random alignment must be substracted from length of sequence and query • Multiple comparisons • When searching D such sequences for a given E threshold, the expected frequency of matches is E’ ~ DE • BLAST actually reports E’

Gapped versus ungapped alignments • For gapped alignments • l and K can be theoretically computed • For ungapped alignments • l and K must be simulated for each matrix and dB • E values are approximate for gapped alignment • Assumes gaps are expensive/rare • BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself

Calculating critical scores • Two sequences of length 100 • What is S’crit for E ≤ 0.05? • Answer: 13.3 bits

Masking low-complexity regions • Alignment statistics require that symbols occur randomly in strings • Long substrings of one or a few symbols violate this assumption • Sequences are preprocessed to identify such low-complexity regions (LCR) • LCR are masked • Do not contribute to alignment or score • Appear as X’s in BLAST output • Tools include DUST (for DNA) and SEG (protein) • Other types of repeats may also be masked by BLAST

Complexity where L=window length fi is frequency of symbol i, i=1..n Example: GGGG L=4, fG=4,fA=fC=fT=0, 4!=4*3*2*1=24, 0!=1 Complexity=(1/4) * log4(24/4*1*1*1) = 0

Which matches should you care about? • Beware of redundancy • Search the right set of sequences and no more • Choosing a significance threshold requires judgement

Are you searching the right dB? • Protein • nr (Non-redundant, sort of) • Swissprot (Annotated) • Pat (Patented) • PDB (3D structures in Protein Data Bank) • Yeast, E. coli, Drosophila, Human, etc. • Month • Nucleotide • nt (nucleotide version of nr) • EST (Expressed sequence tags) • GSS (Genome survey sequence - low pass) • HTGS (High Throughput Genomic Sequence) • Chromosome (completed genomes, chromosomes) • etc.

BLAST parameters • -W wordsize [Integer] default = 11 nucleotides, 3 proteins • -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins • -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins • Dropoffs for blast extensions • Scoring matrix for proteins • -q Penalty for nucleotide mismatch [Integer] default = -3 • -r reward for nucleotide match [Integer] default = 1 • Masking of low-complexity and repeat sequences • -e expect value [Real] default = 10

Making your own BLAST dB • Start with a MultiFASTA file • Formatdb program from NCBI converts MultiFASTA file into BLAST dB • Words are preprocessed • K and l are calculated for allowed scoring matrices • MultiFASTA file can also be used a query to do many searches against one database • All-by-all search - same file as query and dB

Don’t use BLAST blindly • When BLAST is not appropriate • Perfect matches (faster methods available) • Primer sequences (also too short) • Assembly (more specialized tools exist) • Really distant homology (pairwise alignment is not sufficiently sensitive) • Pay attention to program choice/parameters when • Aligning cDNA against a genome • Aligning two very long sequences • Aligning sequences with many repeats

Cautionary tale: BRCA1 • Initial BLAST search showed match to granins with E = 0.00175 • Granins are typically found in endocrine cell secretory vesicles • Is a cancer protein excreted outside of the cell?! • Now known to be a spurious match

Summary • To attain high speed when searching a dB, BLAST • Sacrifices some sensitivity by only extending HSPs • Preprocesses the database and holds it in memory • Meaningful statistics are key to BLAST’s widespread use • Conversion of • Raw score to bit score: depends on scoring matrix and amino acid frequencies • Bit score to E value: depends on database size • BLAST is easy to use and versatile, which makes it awfully tempting to misuse

Reading assignment • Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs. 63-101

Alignment to a database of sequences