CSE182-L4: Database filtering

CSE182-L4: Database filtering CSE 182

Summary (through lecture 3) • A2 is online • We considered the basics of sequence alignment • Opt score computation • Reconstructing alignments • Local alignments • Affine gap costs • Space saving measures • The local alignment algorithm (for biological strings) was given by Temple Smith, and Mike Waterman, and is called the Smith-Waterman algorithm (SW) • Can we recreate Blast? CSE 182

The birthday question • True or False: 2 people in our class share a birthday? • How large should the class be that you can answer this reliably CSE 182

Basic Probability • Q: Toss a coin many times: If it is HEADS, you get a dollar. Otherwise you lose a dollar. • What is the money you expect to get (lose) after 100 tosses? CSE 182

Basic Probability:Expectation • Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar. • What is the money you expect to get (lose) after 100 tosses? CSE 182

Blast and local alignment • Concatenate all of the database sequences to form one giant sequence. • Do local alignment computation with the query. CSE 182

Database (n) Query (m) Large database search Database size n=100M, Querysize m=1000. O(nm) = 1011 computations CSE 182

Why is Blast Fast? CSE 182

Observations • For a typical query, there are only a few ‘real hits’ in the database. • Much of the database is random from the query’s perspective • Consider a random DNA string of length n. • Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 • Assume for the moment that the query is all A’s (length m). • What is the probability that an exact match to the query can be found? CSE 182

Basic probability • Probability that there is a match starting at a fixed position i = 0.25m • What is the probability that some position i has a match. • Dependencies confound probability estimates. CSE 182

Total money you expect to earn Basic Probability:Expectation • Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar. • What is the money you expect to get after n tosses? • Let Xi be the amount earned in the i-th toss CSE 182

Expected number of matches • Expected number of matches can still be computed. i • Let Xi=1 if there is a match starting at position i, Xi=0 otherwise • Expected number of matches = CSE 182

Expected number of exact Matches is small! • Expected number of matches = n*0.25m • If n=107, m=10, • Then, expected number of matches = 9.537 • If n=107, m=11 • expected number of hits = 2.38 • n=107,m=12, • Expected number of hits = 0.5 < 1 • Bottom Line: An exact match to a substring of the query is unlikely just by chance. CSE 182

Suppose we are looking for a database string with greater than 90% identity to the query (length 100) • Partition the query into size 10 substrings. At least one much match the database string exactly Observation 2 • What is the pigeonhole principle? CSE 182

Why is this important? • Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. • Assume that the mismatches are randomly distributed. • What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? • Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. • The above equation does not take dependencies into account • Reality is better because the matches are not randomly distributed CSE 182

Just the Facts • Consider the set of all substrings of the query string of fixed length W. • Prob. of exact match to a random database string is very low. • Prob. of exact match to a true homolog is very high. • Keyword Search (exact matches) is MUCH faster than sequence alignment CSE 182

BLAST Database (n) • Consider all (m-W) query words of size W (Default = 11) • Scan the database for exact match to all such words • For all regions that hit, extend using a dynamic programming alignment. • Can be many orders of magnitude faster than SW over the entire string CSE 182

Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107,no TP • SW = O(nm) = 1000*107 = 1010 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*100=2.384*107 • Ratio=1010/(2.384*107)=420 • Further speed improvements are possible CSE 182

Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. AATCA 567 CSE 182

Silly Quiz • Name a famous Bioinformatics Researcher • Name a famous Bioinformatics Researcher who is a woman CSE 182

Scoring DNA • DNA has structure. CSE 182

DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) CSE 182

Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA • Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. • “One size does not fit all” • Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant • Different proteins might evolve at different rates and we need to normalize for that CSE 182

PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] CSE 182

PAM1 matrix • Align many proteins that are very similar • Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix • S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) CSE 182

PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b) CSE 182

1 PAM 1 PAM PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 2 PAM CSE 182

Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power? CSE 182

END of L4 CSE 182

CSE182-L4: Database filtering

CSE182-L4: Database filtering

Presentation Transcript

Cross-Selling with Collaborative Filtering

An Introduction to Collaborative Filtering

Filtering Geophysical Data: Be careful!

Internet Filtering

Database

Using Excel as a database tool

Image Filtering

Collaborative Filtering

Fingerprint Enhancement by Directional Filtering

CSE182-L11

Image Processing: Linear Filtering

CSE182-L6

Enterprise Internet Filtering

CSE182-L10

CSE182-L6

LOFAR RFI Mitigation spatial filtering at station level

CSE182-L6

Event Filtering

Data Filtering and Software

Value filtering

3.7 Adaptive filtering