1.31k likes | 1.32k Views
Using BLAST to Search Sequence Databases. Recherche dans des bases de données de séquences biologiques. Cédric Notredame. Outline. -Evolution and Sequence Similarity. - The inside of BLAST. - Using BLAST. - Adapting BLAST to your needs. - Searching Protein Domains with BLAST.
E N D
Using BLAST to Search Sequence Databases Recherche dans des bases de données de séquences biologiques Cédric Notredame
Outline -Evolution and Sequence Similarity -The inside of BLAST -Using BLAST -Adapting BLAST to your needs -Searching Protein Domains with BLAST -Digging Genomes
An Alignment is a STORY ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN Mutations + Selection ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN Mutations + Selection ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Deletion Insertion Mutation An Alignment is a STORY
How Do Sequences Evolve ? + - - In the core, SIZE MATTERS On the surface, CHARGE MATTERS OmpR, Cter Domain In a structure, each Amino Acid plays a Special Role
Why Does It Make Sense To Align Sequences ? Same Sequence Same Origin Same Function Same 3D Fold
How Can We Compare Sequences ? The Twilight Zone Similar Sequence Similar Structure Different Sequence Structure ???? 30% %Sequence Identity Same 3D Fold 30 Twilight Zone Length 100
Different molecular clocks for different proteins--another prediction
A few Definitions Query : Your sequence Subject: The database against which you search Heuristic: Algorithm that does not guaranty the optimal solution
Other Important Definitions Identity Proportion of IDENTICAL residues between two sequences. Depends on the Alignment. Unit: the % id Similarity Proportion of SIMILAR residues Two residues are similar if their substitution cost is higher than 0.Depends on the matrixUnit: the %similarity Homology Sequences SIMILAR enough are sometimes HOMOLOGOUS HOMOLOGY COMMON ANCESTOR Unit: Yes or No! DIFFERENT sequences can also be Homologous
More Important Definitions Hit A sequence that matches your sequence and reported by BLAST. E-Value Expectation value How many times would you expect to find a hit by chance only? Depends on the alignment. Depends on the matrix Depends on the database Sensitive to Low complexity regions Unit: must be lower than 0.0001 to mean something
BLAST Basic Local Alignment Search Tool BLAST is a Program Designed for RAPIDLY Comparing Your Sequence With every Sequence in a database and REPORT the most SIMILAR sequences
2-Comparison Engine LOCAL Alignment Database Search 1-Query 3-Database 4-Statistical Evaluation (E-Value) PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW
SW Q 1.10e-20 10 1.10e-100 1.10e-2 1.10e-1 10 3 1 3 6 BLAST 1.10e-2 1 20 15 13 Database Search
This is where Blast SAVES TIME This is where it LOSES HITS Most BLAST parameters refer to this step BLAST Basic Local Alignment Search Tool BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared
BLAST Basic Local Alignment Search Tool BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared 2-Check the most promising Hits 3-Compute the E-value of the most interesting Hits
BLAST Heuristic Algorithms A Bit of History • Smith and Waterman • Exact Local Dynamic Programming, 1981 • FASTA • Lipman and Pearson, 1985 • Looks for similar words (k-tup) on the same diagonal. • Comparison on the sequences one by one… • BLAST • Altschul et al., 1990 • The most widely cited tool in Biology • www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
RSL score > T LKP AAA AAC AAD YYY score < T ACT RSL TVF ... ... ... List of all the 3AA words that Can be found in the database Words with a score > T LKP Inside BLAST Step 1: finding the worthy words Query REL
ACT ACT RSL RSL RSL TVF RSL TVF Inside BLAST Step 2: Eliminate the database sequences that do not contain any interesting word Sequences within the database Look for «interesting» words ACT RSL TVF ... ... List of « interesting » words > T • Sequences containing interesting words (Hits)
Database sequence Query X Extension by limited Dynamic Programming Inside BLAST: the end Step 3: Extension of the Hits Database sequence Query X • 2 "Hits" on the same diagonal distant by less than X
BLAST Statistics: Raw Score • Evaluation of the score • Raw Score • Sum of the substitutions and gap penalties. • Not very informative
BLAST Statistics: P Values • Derived Statistics • p-value • Probability of finding an alignment with such a score, by chance. • The lower, the better
BLAST Statistics: P-Values Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution. Extreme value distribution (Gumbel) normal distribution
BLAST Statistics: P-Values P-Value: Probability that a random alignments obtains a score superior or Equal to X K must be calibrated with the database composition Lambda is calibrated with the matrix being used
BLAST Statistics: E-Values • Derived Statistics • E-value • Number of alignments expected by chance • The lower, the better: <0.00001 For Values Lower than 0.0001, E-Value ~ P-Value The E-Values are easier to compare than P-Values
BLAST Statistics: Bit-Score • Bit Score • Evaluates the amount of information in the alignment • Makes it possible to compare alignments
BLAST Statistics: Booby Trap! The E-Value depends on N, the Database size. If N increases, some Hits can be lost
P31383 Vs YEAST P31383 Vs UniProt
The Many Flavorsof BLAST
Database Against Database: « Farm-Blast » Genome 1 Genome 2 Ideal for finding Orthologues
The Classics 1 SequenceVs A sequence Db
nucleotide blastn nucleotide blastx nucleotide VS protein protein tblastn nucleotide VS protein protein nucleotide nucleotide tblastx VS protein protein The Many Flavors of BLAST Program Query Database protein protéine blastp
protein protein Psi-blast protein RPS-blast Domain protein DNA DART-blast mega-blast protein Large DNA The Many Flavors of BLAST Program Query Database
Using BLAST: The Basic Way
Database Search Database Search Result=Prediction Protein X IS or IS NOT homologous to the QUERRY.