BLAST: Algorithms in Bioinformatics

Bioinformatics Algorithms and Data Structures BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007

BLAST Q: What is BLAST? A: A: Uhmmm, actually no, BLAST is an acronym: Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at: http://www.ncbi.nlm.nih.gov/BLAST/

BLAST • Q: Why do you care? • A: Because you are going to do a project. • U51112 Membrane protein that transports sodium and hydrogen • J03581 Tyrosinase. . people lacking this are albino • NM_000245 MET, an oncogene. . .mutations in this cause cancer • NM_010849 MYC, another oncogene • NM_007409 Alcohol Dehydrogenase. . good to have when drinking • NM_002475 Myosin. . .one of the muscle proteins • XM_086788 Crystallin, the major protein in the lens • M30047 Myelin basic protein..protects the neurons • NM_000518 Hemoglobin, oxygen carrying protein in RBC • NM_000477 Albumin, major serum protein. . .does lot of things • NM_008476 Keratin, skin and integument protein

BLAST • BLAST is designed to efficiently find alignments of a target string s against large databases • Motivation: increase the speed of finding fewer and better hotspots. • Idea: Find high scoring matches using a substitution matrix rather than exact matches. • We are still searching only for gapless matches.

High-Scoring Pair • Two strings s and t are a high scoring pair(HSP) if d(s,t) > T • Given a query s[1..n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s • Each such match to such word in the database is called a hit • Typical k: 12 for nucleotides, 3-5 for amino acids.

High-Scoring Pair • Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs • Threshold is chosen so that a random match with such a score is unlikely .

Finding Potential Matches We can locate seed words in a large database in a single pass • Construct a FSA that recognizes seed words • Use hashing techniques to locate matching words

s t Extending Potential Matches • Once a seed is found, BLAST attempts to find a local alignment that extends the seed • Seeds on the same diagonalare combined (as in FASTA)

Which programs are used? • Originally Blast did not allow gaps. • Now people use gapped-Blast • Gapped blast joins different diagonals. • For proteins Blast is superior • For nucleotides Fasta is better.

Review: Unrelated Sequences • Our model of unrelated sequences is simple • Each position is sampled independently from a distribution over the alphabet  • We assume there is a distribution q() that describes the probability of letters in such positions • Then: • R denotes the assumption that s and t are random unrelated strings

Review: Related Sequences • We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor • Let p(a,b) be a distribution over pairs of letters. • p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters • Here M denotes the assumption that s and t are related strings.

Review: Ratio Test for Alignment • Taking logarithm of both sides, we get

Review: Probabilistic Interpretation of Scoring Rule • If we take • then the score of an alignment is the log-ratio between the two models: • Score > 0R is more “probable” • Score < 0U is more “probable”

Problems with Scoring Rule When searching for an optimal alignment in a big database, there are a number of problems that arise with this simple scheme. • We are assuming P(M)=P(R), this assumes there are an equal number of related and unrelated sequences in the database. • When searching through a big database, there is high probability that an unrelated sequence will receive a high score • When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence.

Prior Probability on the models • What we really wish to calculate is: • The log score being:

Prior Probability on the models • Our threshold should be:

The Hazard of Large Databases • Define • This is the probability that two unrelated sequences will match with score >  by chance • Assume there are N strings in our database • Assuming that they are independent of each other, and all are unrelated to s, we have

The Hazard of Large Databases 1 f(x,0.001) f(x,0.0001) f(x, 0.00001) f(x, 0.000001) 0.8 0.6 0.4 0.2 0 0 20000 40000 60000 80000 100000

Local Matching • Question: Which local alignment query is expected to give a higher score: • To a short sequence • To a long sequence? • A local match can begin at any of the nm entries in the DP matrix. • The score is the optimal of all these starting points. • If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials.

Score Significance-Fasta • How meaningful is a score? • Calculate distribution of scores and related scores • Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution.

Extreme Value Distribution (BLAST) • We ask the following questions: Given a database of size n and a sequence of size m • What is the expected number of hits with score at least S? This number is called an E-score • Notice this is a Poisson distribution. • K corrects for the dependencies •  depends on the scoring matrix • Doubling n, the length of sequence, doubles expectation • Doubling S, the score, causes E() to decrease exponentially

Blast P-value • Recall the Poisson distribution: • Probability of finding no hits with a score => S • Therefore probability of finding at least one hit with score => S is • This is called the P-value.

A Typical Genebank entry

Sequence Information

The Sequence

BLAST programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database

BLAST Search

BLAST Output • List of hits • Database accession codes, name, description. • Score in bits (Usually >30 bits is significant ) • Expectation value E() • For each hit • A header including hit name, description, length • Each hit may contain several HSPs • Score and expectation value • how many identical residues • how many residues contributing positively to the score • The local alignment itself

BLAST Output

PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: • Perform BLAST on database. • Use Significant alignments to construct a “position specific” score matrix. • This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today.

Multiple Alignment • Proteins can be classified into families: • Common structure. • Common function. • Common evolutionary origin. • For a set of sequences belonging to some family • Each pair has some differences • But, there are some common motifs in almost all sequences of the family • A multiple alignment carries more information than pairwise alignment

Protein Families • Consider Zinc Fingers: • All have the same function: • Bind to DNA • All have similar structure • They constitute a Protein Family • In a protein family some parts of the sequence (the functional parts) are more conserved than others.

Definition A multiple alignment of strings S1,S2,…,Skis a series of strings with blanksS’1,S’2,…,S’k such that: • |S’1|=|S’2|=…=|S’k| • S’j is an extension of Sjobtained by insertion of blanks.

Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG ACTA...TAAC...

Example

Sum of Pairs • The sum of pairwise distances between all pairs of sequences for some scoring matrix • Not only assumes that alignment of each column is independent, but also each pair of sequences. • Each sequence is scored as if descended from k-1 sequences instead of one common ancestor.

Calculation of Multiple Alignment • The optimal alignment can be calculated exactly using k-dimensional dynamic programming. • Space complexity O(nk) • Time complexity O(2knk) • A Heuristic Program called ClustalW quickly finds a good multiple alignment.

Creating a PSSM • After aligning the sequences we see that there are some conserved regions. • We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. • This matrix represents information from a whole family, it is more strict in highly conserved regions.

PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: • Perform BLAST on database. • Use Significant alignments to construct a “position specific” score matrix. • This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today.

BLAST: Algorithms in Bioinformatics

BLAST: Algorithms in Bioinformatics

Presentation Transcript

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures