390 likes | 473 Views
How do you gauge significance in sequence comparisons?. G G A GA C TG TAG A CA GCTAATGC TAT A G A A CG C CC TAG C CA CGAGCCCT TAT C Match at 11/26 positions…would expect 6.5/26 if random, equal probabilities
E N D
How do you gauge significance in sequence comparisons? • GGAGACTGTAGACAGCTAATGCTATA • GAACGCCCTAGCCACGAGCCCTTATC • Match at 11/26 positions…would expect 6.5/26 if random, equal probabilities • Probability of 11/26 matches given random, equal probabilities at each position is ~0.04
Understanding sequence variation • Evolution results in complex patterns of similarity among biological sequences. • Sequences change due to mutation, natural selection, and genetic drift. • Understanding sequence alignment strategies (ie. BLAST) requires a thorough understanding of these events
Mutations • Change in DNA sequence • TTA TTG Silent or synonymous (L) • TTA TTT Missense (L F) • Conservative or nonconservative substitution based on structure/function • TTA TAA Nonsense • Frameshift • Consequences of mutations?
Natural Selection • There must be variation within a population • The variation must be heritable • There must be differential reproduction based on variation • Selective advantages (ie. longer necks) lead to increased fitness, more offspring, etc.
The why and what of natural selection • Variation exists at the DNA level: alleles • This variation is inexhaustible (something important to remember when looking at new genome sequences) • These differences are subjected to selection: • Changes in protein structure are typically unfavorable and as a result, selected against • However, some changes in structure/function are selected for: sickle cell anemia/malaria
Genetic drift • Random genetic drift is a stochastic process (by definition). • One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. • Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation
Neutral Theory of Evolution - Kimura • Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection • Compare Fugu with humans..most conserved sequences are the genes • http://www.sciencemag.org/cgi/content/full/297/5585/1301 • Debate over importance of natural selection and neutral evolution in development of species
Sequence Alignment • Sequence alignment is simply the “optimal” assignment of substitution and indel events to a pair of sequences. • Global alignment: align entire sequences • Local alignment: find best matching regions of sequences
Measuring Alignment Quality • Good alignments should have … • “many” exact matches • “few” mismatches • “many” of the mismatches should be similar residues • “few” gaps
Measuring Alignment Quality Begin with... QTRPQNVLNPP ||| STRQNVINPWAAQ Longest Exact Match S = 3a S=alignment score a=match score
Measuring Alignment Quality … allow some mismatches QTRPQNVLNPP ||| || STRQNVINPWAAQ S=alignment score a=match score b=mismatch penalty S = 5a - 1b
Measuring Alignment Quality …and finally, introduce some gaps QTRPQNVLNPP || ||| || STR-QNVINPWAAQ S=alignment score a=match score b=mismatch penalty c=gap penalty S = 7a - 1b -1c
Global Dynamic Programming • Full sequence must be aligned • Gaps at ends are penalized as much as internal ones • F(n,m) is the best score for alignment • Traceback can give >1 correct alignment • Used to examine closely related sequences • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html
Local alignments - How • Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one) • The alignment can end anywhere in the matrix • Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0
Local Alignment – Why? • This alignment strategy identifies islands of similarity amongst a sea of sequences
Overlap recursion is similar to global alignment • The top row is and left column are set to 0 at each position, choose the maximum point along the bottom row or right column, continue traceback until hit a 0
Scoring Issues • Relative costs of matches, mismatches, and gaps should depend on their probabilities (rare events receive higher penalties) • In practice, the appropriate costs are rarely known. • A variety of scoring matrices are available.
Notation for algorithmic complexity • It is useful to know how an algorithm’s performance in CPU time and required memory storage will scale with the size of the problem • In previous dynamic programming problems, we are storing (n +1) x (m +1) numbers and each number costs us three sums and a max • These algorithms work in O(nm) time, with O(nm) memory, since n and m are often comparable, will often times be described as O(n2) – feasible but slow
Heuristic alignment algorithms • Shortcuts are important • Searching a sequence length of 1000 against a database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours. • BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.
Five BLAST programs • BLASTN (nucleotide to nucleotide) • Use: mapping oligos, CDNA or PCR products to genome, screening repeats, finding vector sequences • BLASTP (protein to protein) • Identifying related proteins and domains among proteins • BLASTX (nucleotide to protein) • Finding protein coding genes in genome DNA or cDNA • TBLASTN (protein to nucleotide) • Identifying transcripts similar to a given protein, mapping a protein to genomic DNA • TBLASTX (nucleotide to nucleotide via protein intermediate) • Cross species gene prediction at genome or transcript level, searching for missing genes
Seeding • BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide) • It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.
Extension • The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming. • These locally optimal alignments are called high scoring segment pairs or HSP’s • Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments
Alignment example • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment • You’ll hit a mismatch c/e should you continue and how far? • Generate a variable X to measure how far the score drops off. • Set X = 5 and try the alignment… • Set X = 2 and try again … • A large X value will increase the speed, however, speed is often modulated by word size and other parameters…
Gapped BLAST – a time saver • Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A) • A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver • Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit • T is adjustable, higher the T then the smaller the search space
Evaluation • Once seeds are extended to generate alignments, these alignments are tested for statistical significance. • Can establish thresholds for reporting
BLASTN variations • BLASTN seeds are always identical words; T is never used • To make BLASTN faster, increase word size, to make it more sensitive decrease word size • MegaBLAST increases word size to 28 • The minimum word size is 7 • http://monod.uwaterloo.ca/papers/02ph.pdf
BLASTP implementation • To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default) • Affine gaps • Decreased penalty for gap extension relative to gap introduction
Also, FASTA • Similar to Gapped BLAST – except bigger neighborhood • Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6 • Once identified, looks for diagonals with many mutually supporting word matches • Extensions similar to BLAST
Scoring Matrices • Scoring matrix specifies a score, sij, for aligning sequence Iwith sequence II. • Choice of matrix depends on the divergence level of desired/expected hits. • Examples: PAM, BLOSUM • Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) • Advice: try several matrices when possible.
Dayhoff Family of Matrices • Dayhoff model measures sequence evolution in units of “PAMs” • One PAM unit represents the evolutionary distance in which 1% of the amino acids have changed. • Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) • Mutability of alanine was defined to be 100.
Dayhoff Family of Matrices • Problems with the original Dayhoff scheme • It does not consider the genetic code. • Not all amino acid substitutions can occur by a single nucleotide substitution event. • Parameters were estimated from a small sample of closely related proteins. • Evolution at the “average site” of the “average protein” is being modeled.
BLOSUM Scoring Matrices Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)
PSI-BLASTPosition Specific Iterated BLAST 1. BLAST with query 2. Keep hits w/ E < E* (adjustable constant) 3. Multiple alignment of HSPs from step (2) 4. Build profile 5. BLAST with profile 6. Iterate (1)-(5) until no new hits are found
PSI-BLASTPosition Specific Iterated BLAST Use with great caution!!! Once an unrelated sequence is mistakenly incorporated into the profile, subsequent iterations will incorporate homologues of the unrelated sequence (catastrophic transitivity). Human intervention is essential.