500 likes | 749 Views
Blastology and Open Source: Needs and Deeds Iddo Friedberg, Ph.D. The Burnham Institute February, 2003. Prologue. BLAST – Basic Local Alignment Search Tool: fast sequence similarity searching, query vs. database (1990) Gapped BLAST – now we can use gaps in the alignment (1996)
E N D
Blastology and Open Source: Needs and DeedsIddo Friedberg, Ph.D.The Burnham InstituteFebruary, 2003
Prologue • BLAST – Basic Local Alignment Search Tool: fast sequence similarity searching, query vs. database (1990) • Gapped BLAST – now we can use gaps in the alignment (1996) • PSI-BLAST Position Specific Iterated BLAST Iterated BLAST search increase sensitivity. (1997) 7800 citations over 6 years
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing possibilities • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
A 029001100003200 C 000070000000000 . . Y 002000080202000 MGLLTREIF--ILQQ using profile MGLLTREIF--ILQQ FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignment A 027005101003200 C 000070000000000 . . Y 202000060202000 A 029001100003200 C 000070000000000 . . Y 002000080202000 Construct a new profile PSI BLAST 101 Take a sequence Search for similar sequences in a full sequence database FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Sequences are multiply aligned • After several iterations of this procedure we have: • Sequence information, inc. links to annotation • Several sets of multiple alignments. • Profiles, derived by us or by PSI-BLAST • Thresholding information (alignment statistics) Construct a profile, andrepresent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Post-BLAST Information FlowWetlab Typicale PSI-BLAST Sequence Alignments Statistics Annotations Locating homologs Function Prediction (if function unknown)
Enter Bioinformatics, Stage Left… • Process many queries • More sophisticated post-processing, e.g. • Structure prediction • Phylogenetics • Function prediction: using annotation / structural data / phylogenetic data • “Unusual” searching: • Need to change parameter default values
Post-BLAST Information FlowBioinformatics PSI-BLAST Annotations Sequence Alignments Statistics Profiles Locating homologs Function Prediction (if function unknown) Homology Modeling Fold prediction Tree building
PDB-BLAST: Sensitive Fold Recognition(Li & Godzik) PSI-BLAST Large sequence Database (nr85) PSI-BLAST Structure Database (PDB) Fold recognition Statistics Sequence Alignments Profiles
PSI-PRED 2ndary Structure Prediction (David Jones) • PSI-BLAST • Filtered database: • No Xmembrane • No coiled-coils Profiles Windows of Length 15 1st Neural Network 2nd Neural Network 3-state Prediction
PSI-BLAST is used for: • Distant homology detection • Fold assignment • profile-profile comparison • Domain identification • Evolutionary Analysis (e.g. tree building) • Sequence Annotation / function assignment • Profile export to other programs • Sequence clustering • Structural genomics target selection PSI BLAST’s ability to do all of the above has been evaluated. So have competing programs, which used PSI-BLAST as a standard for comparison
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Why Profiles? • More informative than sequences • More accurate than regexps (“motifs”) • PSI-BLAST’s consecutive profiles enable us to obtain an “evolutionary vista” • PeCoP: illustrating the use of iterated profiles to detect Persistently Conserved Positions
PeCoP: locating important residues(Friedberg & Margalit) PSI-BLAST Large sequence Database (nr) Sequence Alignments Statistics Profiles Locate important residues Find Conserved Positions
What is a Conserved Position? • A conserved position has a high frequency of any single amino-acid type in the MSA column. • Conservation is usually measured by determining the information content or the relative entropy of a position
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: getting profiles from PSI-BLAST • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Information Content I: Uncertainty Uncertainty: the number of “yes / no” questions to verify a state: • Coin toss: 1 question. (“Is it heads?”) • Nucleotide in a DNA sequence: 2 questions (“Is it a purine?”) -> (“Is it an adenine?”) • Uncertainty is measured in bits • Maximum uncertainty: log2(number of possible states) Coin toss: log22 = 1 bit DNA: log24 = 2 bits Proteins: log220 = 4.32 bits
Information Content II: MeasuringPositional Conservation Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”
Information Content II: MeasuringPositional Conservation . . .D. . . . . . D . . . . . . D . . . . . . E. . . . . . G . . . PD = 3/5 = 0.6 PE = 1/5 = 0.2 PG = 1/5 = 0.2 Uncertainty “After”: Information content: Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”
Information Content II: MeasuringPositional Conservation Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Division by Prior Frequencies: “Conserved” vs. “Distinct” • A conserved position has a high frequency of any given amino-acid type in the MSA column. • “High Frequency” meaning: • 1) a high frequency in the column? or • 2) a higher-than-expected frequency in the column? • Higher-than-expected: based on the frequencies of residue types in the “sequence universe”. (SwissProt). Question: ``How conserved is a position?’’ Do not divide by priors. Use Question: ``How distinct is a position?’’ Divide by priors. Use Surprise! When dividing by priors: relative entropy
20 Amino Acids… or Less? • A conserved position has a high frequency of any given amino-acid type in the MSA column. • “Amino acid type” meaning: • 1) There are 20 amino acid types • 2) There are less, because they can be grouped into similar physico-chemical types
Representative letter Physico-chemical property Included residue types F Hydrophobic A, V, L, I, M, C R Aromatic F, W, Y, H O Polar S, T, N, Q T Positive R, K N Negative E, D P Proline P G Glycine G 20 Amino Acids… or Less?
IC: Remember This • Information content == reduction in uncertainty. Used for measuring positional conservation • “The more conserved a position, the higher its information content” • We can divide (or not) by expected prior frequencies • We can group (or not) the 20 amino acids into a smaller alphabet
PSI-BLAST Nucleation Center Detection Possible Schemes for Calculating Positional Conservation 20-letter Alphabet Reduced Alphabet Priors No Priors
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
PeCoP: locating important residues(Friedberg & Margalit) PSI-BLAST Large sequence Database (nr) Sequence Alignments Statistics Profiles Locate important residues Find Conserved Positions
Find Conserved Positions: Set a Threshold • Threshold is determined by normalizing the IC distribution over a sequence tomean == 0, SD == 1 • Then set a threshold
Find Conserved Positions: Conservation over Profiles • Determine conservation in a profile according to one of the four schemes discussed • But PSI-BLAST gives us several profiles (nIterations -1) • Therefore, a position is conserved if it retains conservation through successive iterations. • But retention does not have to be 100%
Retention Schemes • Majority vote: if a position is conserved in x out of n iterations, it is considered conserved. • Persistent conservation: conservation in the first & last iteration
Persistent Conservation • Positions conserved in close family members may be conserved due to evolutionary non-divergence, and not solely due to a structural / functional role. Hence, a supply of false positives. • Positions conserved in distant family members may be marked as such due to an observed drift from the original sequence. False positives again, but for a different reason. The intersection of the above two findings minimizes both types of errors
PeCoP • Determine conservation according to the following parameters: • Either one of the four IC schemes AND • Set a threshold AND • Choose a retention scheme PeCop Submission PeCoP Results
Getting PSI-BLAST Profiles According to Different Conservation Schemes In ncbitools: ncbi/tools/posit.c lines1826 – 2689 #ifdef POSIT_DEBUG // the code here is concerned with matrix output, // and normally commented out //play around with it… #endif Can NCBI provide this output by use of a command-line argument?
Why Not Parse PSI-BLAST Alignments? Speed • Slow, esp. When using a scripting language • Not all alignments appear on output (default 250) • Sequence weighting, profile construction, all already provided for. • NCBI keep changing format: programmer has to keep changing the parser.
Why Parse PSI-BLAST Alignments? Gain more information: • Assign sequence weight and filtering parameters according to specific needs • Use annotation: inline or linked. • Realign sequences, and construct own profile • PSI-BLAST source code keeps changing • As of v. 2.1.2: XML and (2.2.1) tabulated (no alignment) output
Post-blast Information FlowBioinformatics PSI-BLAST Annotations Sequence Alignments Statistics Profiles Locating homologues Function Prediction (if function unknown) Homology Modeling Fold prediction Tree building
Post Blast Processing Many modules, but: • Most are application-specific. • Some are web-resources only. • Bad licenses, machine-specific, not written for distribution purposes, etc. Result: need to rewrite the same stuff over (and over.. and over..).
Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?
Bio*.org Projects • Collaborative projects aimed at providing programming tools for bioinformatics under an open-source license • Bio{Perl | Java | Python} : procedural • Bio{CORBA | MOBY}: interface, web access standardization The Open Bioinformatics Foundation
NCBI and Post Blast Processing • Language: C/C++ • ASN.1 was around long before XML • seqalign.asn • Now (v. 2.1.1) there is also XML output format, DTDs are there. • Web APIs, for WWW-based PSI-BLAST runs • Public domain, no license
What is Needed? • Annotation handling. PB output has rudimentary annotation only. The rest is served by links. Transfer into MySQL? • Translate parsed output into multiple sequence alignment objects, and then into PSSMs • Direct PB residue frequency output • CORBA: do we need a format-aware object? • Anything else you can think of………
Summary • PSI-BLAST profiles have become the method-of-choice for “doing things” when a high detection sensitivity is required BUT… • Profiles can and should be interpreted carefully • Results should be interpreted carefully • Do NOT write your own PSI-BLAST parser. Please write something we need!
Further Reading • http://www.ncbi.nlm.nih.gov • http://open-bio.org Books: Durbin R. et al. Biological Sequence Analysis. Cambridge University Press (Chapter 9) Papers: • http://www.ncbi.nlm.nih.gov/BLAST/blast_references.html Blastology: • W. Li , F. Pio, K. Pawlowski and A. Godzik: Saturated Blast: detecting distant homology using automated multiple intermediate sequence Blast search Bioinformatics (2000) 16:1105-1110 • W. Li, L. Jaroszewski and A. Godzik: Clustering of highly homologous sequences from large sequence protein databases Bioinformatics, (2001) 17:282-283. • W. Li, L. Jaroszewski and A. Godzik: Tolerating some redundancy significantly speeds up clustering of large protein databases Bioinformatics (2002) 18:77-82 • W.Li and A.Godzik: Discovering new genes with advanced homology detection Trends in Biotech, (2002) 20:315-6. • I. Friedberg, T. Kaplan, and H. Margalit: Evaluation of PSI-BLAST Alignment Accuracy in Comparison to Structural Alignments. (2000) Protein Science,Nov;9(11):2278-84 • I. Friedberg and H. Margalit: Persistently Conserved Positions inStructurally-Similar, Sequence Dissimilar Proteins: Roles in PreservingProtein Fold and Function (2002) Protein Science 11(2):350-360 • I. Friedberg and H. Margalit: PeCoP: automatic determination of persistently conserved positions in protein families. Bioinformatics 18(9): 1276-77(2002) Conserved positions: • Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999 Aug 6;291(1):177-96. • Reddy BV, Li WW, Shindyalov IN, Bourne PE. Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins. Proteins. 2001 Feb 1;42(2):148-63. • Landgraf R, Xenarios I, Eisenberg D.Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol. 2001 Apr 13;307(5):1487-502.
Thanks to.. • Hanah Margalit • Adam Godzik • Bio{java | perl | python}.org folks • Jeff Bizzaro http://bioinformatics.org/pecop
Check the Following when Running PSI-BLAST for PBP: • Number of sequences printed (if making own profile from printed sequences). • E-value inclusion threshold for next iteration (rec: 0.001). • Low complexity masking? • Substitution matrix used?
PSI-BLAST 101 (contd.) Exports: • Multiple sequence alignments • Annotation links • Statistical data