5th Meeting on U.S. Government Chemical Databases and Open Chemistry

5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011 The PDBbind Database: A Comprehensive Collection of the Binding Data and Structures of the Complexes in the Protein Data Bank RenxiaoWang

Outline • What is the PDBbind database and why to develop it? • How is the PDBbind database compiled? • What information is provided by the PDBbind database? • Possible applications of the PDBbind database

What is the PDBbind Database? Protein Data Bank (1) Complexes formed between small-molecule ligands and biomacromolecules, and (2) those between biomacromolecules. Biomolecular complexes Complexes with binding data PDBbind web site Structural information and binding data http://www.pdbbind-cn.org/

Why to Create the PDBbind Database? Both structural and energetic information are indispensable for an in-depth understanding of the recognition between small molecules and biological macromolecules. It is especially important for the development and calibration of computationalmethods for the estimation of protein-ligand binding affinity.

Why to Create the PDBbind Database? • Three-dimensional structures of biomolecular complexes are available from the Protein Data Bank : • More than 74,000 structures have been deposited in PDB by Aug 1st, 2011. Nearly half of them are complexes of all types. • However, binding affinity data of these complexes, if available, used to scatter in literature and thus are difficult to access. • Before PDBbind, no other database has attempted to collect such binding data in a systematic manner. The PDBbind database aims at providing a comprehensive collection of the binding data for all types of biomolecular complexes in PDB.

Why to Create the PDBbind Database? The old approach: Assemble the data sets reported by other researchers. For example, the X-Score scoring function was developed by using a set of 230 protein-ligand complexes with known binding data. This data set was compiled by assembling several smaller data sets reported previously, which was the largest collection of this type at that time. Wang R. et al., J. Comput.-Aided Mol. Des.2002, 16, 11-26. • Disadvantages of this approach • It is difficult to verify those binding data since original references are often not given: Some data are IC50 values; Some data are not binding affinity data; There are even typographical errors! • Regular updates are not possible.

History of the PDBbind Database Apr, 2001: Preliminary trial & launch of the project (University of Michigan) May, 2004: PDBbind v.2004 was publicly released at http://www. pdbbind.org/ (University of Michigan) v.2005 and v.2006 were released. Nov, 2007: The PDBbind-CN server was launched at http://www. pdbbind-cn.org/ (Shanghai Institute of Organic Chemistry,Chinese Academy of Sciences) v.2008, v.2009, and v.2010 were released. Aug, 2011: The current version (v.2011), providing binding data for ~8,000 complexes in PDB (1) Wang, R. et al. J. Med. Chem.2004, 47, 2977-2980. (2) Wang, R. et al. J. Med. Chem.2005, 48, 4111-4119. (3) Cheng, T. et al. J. Chem. Inf. Model. 2009, 49, 1079-1093. 7

How is the PDBbind Database Compiled? The entire PDB I. Classification of complexes Biomolecular complexes Complexes with binding data II. Collection of binding data from original references Integrate into the PDBbind web site III. Data processing & web design

Step I. Classification of Biomolecular Complexes A given PDB entry Contain a protein? Contain a nucleic acid? Misc. oligomer NO NO YES YES Contain a nucleic acid? Contain a small molecule? Protein-nucleic acid complex Apo- nucleic acid YES NO NO YES Contain a small molecule? YES Protein-ligand complex Nucleic acid- ligand complex NO Contain two proteins? Protein-protein complex Apo-protein YES NO The entire classification process is automated by a set of computer programs.

Classification of the Entire Protein Data Bank protein-ligand complexes special protein-ligand complexes (cofactor-containing) protein-nucleic acid complexes protein-protein complexes nucleic acid-ligand complexes apo-nucleic acids apo-proteins * Based on the PDB contents released by Jan 1st 2011, 70,224 entries in total

Step II. Collection of Binding Data from Literature Binding affinity data of a given complex could be reported or cited in the “primary citation” of the PDB file (success rate  30%).

Collection of Binding Data from Literature Accepted binding affinity data include dissociation constants(Kd), inhibition constants(Ki), and concentrations at 50% inhibition (IC50). • A computer program is developed to process PDF files, filtering out the papers containing no binding data. • Each remaining paper is then examined independently by two persons. Consensus must be reached before the binding data are recorded.

Collection of Binding Data from Literature • Over 17,800 references have been processed so far. • Each primary reference is saved as a PDF file, in which the binding data are clearly marked. • Mistakes are still possible during manual data curation. Nevertheless, >98% of the binding data in PDBbind are correct. The primary reference for PDB entry 1BXO

Outcomes of Binding Data Collection Proteins 6,070 1,427 Small Molecules Proteins Nucleic Acids 66 428 PDBbind v.2011 includes binding data for 7,991 complexes in PDB.

Updates of the PDBbind Database It is critical to update PDBbind regularly to keep up with the constant growth of PDB. PDBbind is now updated annually, and it grows by 20-30% each year.

Step III. Build the PDBbind-CN Web Site Biomolecular complexes in PDB Binding Data Structures Browse information RCSB PDB Search information PDBsum Download information PubMed Deposit binding data PubChem PDBbind-CN http://www.pdbbind-cn.org/

On-line Information @ PDBbind-CN The basic information of each complex is summarized on a single page.

Multiple display modes are provided by ChemAxon and JMol Java applets on the web interface of PDBbind-CN.

On-line Search @ PDBbind-CN Various types of queries may be used in the searching of binding data. Results are given in well-organized forms, which can be output in either the PDF format or the Excel format.

On-line Search @ PDBbind-CN Substructure/similarity search among the small-molecule ligands in all protein-ligand complexes in PDB (>12,000 entities), not limited to those with known binding data.

On-line Search @ PDBbind-CN Similarity search among all protein and nucleic acid sequences in PDB, not limited to those with known binding data.

What can be downloaded from PDBbind-CN? • Tables of binding data for all categories of complexes. • “Clean” structural files of most of the protein-ligand complexes with known binding data (6,023 in v.2011), which can be readily utilized by most molecular modeling software. • A complete “biological unit” of each complex is split into a protein molecule and a ligand molecule. • The protein molecule is saved in the PDB format and the ligand molecule is saved in the SYBYL Mol2 format after necessary processing. • The “refined set” and the “core set” of selected protein-ligand complexes, providing a high-quality benchmark for docking/scoring studies.

Selection of the Refined Set The refined set consists of protein-ligand complexes meeting higher standards: • Concerns on quality of the structure: crystal structures with resolution<2.5 Å & R-factor<0.250; both the protein and the ligand structures need to be complete. • Concerns on quality of the binding data:Binding data are given in Kd or Ki; and 2.0<-logKd <12.0 (i.e. Kd=10mM~1pM); binding data cannot be an estimated value; the protein as well as the ligand used in the binding assay need to match exactly the ones observed in the crystal structure. • Concerns on nature of the complex:must be non-covalent binding; must be binary complex; ligand MW<1000; ligand does not contain B, Be, Si, and metal atoms. In v.2011, a total of 2,476protein-ligand complexes are selected into the refined set, accounting for 41% of all of the protein-ligand complex with known binding data.

Selection of the Core Set The refined set (2,476) Clustering Selection The core set (243) In v.2011, the core set consists of a total of 81families of 243protein-ligand complexes. The core set will be controlled under 300 complexes (100 families) in the future.

Selection of the Core Set The core set is selected to provide a representative, non-redundant sampling of the refined set, so that serves better as a benchmark for validating docking/scoring approaches. Methods • Clustering: Group the protein-ligand complexes in the refined set into families by protein sequence similarity (cutoff = 90%). • Selection of clusters: Only consider the families that have at least 5 members. The highest binding affinity in each valid family must be at least 100-fold higher than the lowest binding affinity. • Selection of representatives: For each remaining family, select the complex with the highest binding affinity (the “topper”), the lowest binding affinity (the “lower”), and the one closest to the mean value (the “middler”) as the representatives of this family.

Possible Applications of the PDBbind Database • Provide high-quality data sets for theoretical and computational studies on molecular recognition • Binding data available for protein-ligand, protein-protein, and protein-nucleic acid complexes • Specially compiled “refined set” and “core set” • Provide useful clues to medicinal chemists and other researchers for the discovery of bioactive small-molecule compounds or potential targets According to our literature survey, 30~40 applications of the PDBbind database are published each year.

? ? ? ? What ligands bind to it What high-affinity ligands look like What low-affinity ligands look like If these chemical moieties may interact with other proteins (new targets or side effects) A known target Critical chemical moieties (pharmacophore)

Docking Scoring Data Mining Tools Chemical Databases Useful Hits Pharmacophore Models Binding site analysis Complex Families Binding Affinity Data Pharmaceutical Implications 3D Structures Data Compilation To Build an Integrated Platform for Data Mining Protein-Ligand & Nucleic Acid-Ligand Complexes Protein Data Bank

Answer to FAQ Why does not PDBbindprovide experimental details in addition to the binding data? • Such information is not always given in the reference. • Of course it takes a lot of extra efforts to retrieve such information, and it is difficult to format such information. • The users can always check the original reference if they really need such information.

Answer to FAQ What is the difference between PDBbind and Binding MOAD? Binding MOAD (Mother Of All Databases) was independently developed by Prof. Heather Carlson’s group at the University of Michigan, and was released to the public in 2005. Proteins: Structure, Function, and Bioinformatics,Volume 60, Issue 3, pages 333–340, 15 August 2005. Binding MOAD also collects the binding data of protein-ligand complexes, which is also based on a systematic mining of the Protein Data Bank. Thus, the contents of Binding MOAD overlap with part of PDBbind, and these two databases are similar in some technical aspects.

Summary: Significance of the PDBbind Database • More binding data: The latest version provides binding data for ~8,000 complexes • Systematic mining of the entire PDB • Covering all major categories of biomolecular complexes, not only for selected protein-ligand complexes • Better in quality • Reasonable classification of biomolecular complexes • Binding affinity data carefully collected from original references • Regularly updated since the first public release in 2004. Binding data increase by 20~30% each year. • Widely popular:User-friendly web interface; over 1,600 registered users from some 40 countries across the world.

Acknowledgments Thanks to the following contributors in my group: Liu,Z. Li,J. Li,Y. Li,X. Lin,F. Special thanks to Prof. Shaomeng Wang and his group at the University of Michigan！ The Natural Science Foundation of China (NSFC), the Ministry of Sciences and Technologies of China (MOST), and the Chinese Academy of Sciences (CAS).

Why to Create the PDBbind Database? • Recognitions and Interactions between various types of molecules are essential at the molecular level for various biological processes. Protein-small molecule binding Protein-protein binding Protein-nucleic acid binding

Difficulty in Complex Classification As a matter of fact, most PDB entries contain multiple heterogenmolecules in addition to the primary molecule (protein or nucleic acid). Is this a meaningful protein-ligand complex?

What are classified as “valid” small-molecule ligand molecules: • “Regular” organic molecules • Oligo-peptides containing < 10 amino acid residues • Oligo-nucleic acids containing < 4 nucleotides What are classified as “special” ligand molecules: • Cofactors/coenzymes: CoA, NAD, FAD, Heme & their derivatives What are classified as “junk” molecules: • Inorganic species • Organic solvents and buffer components • Saccharide molecules with high occurrences Difficulty in Complex Classification

Difficulty in Complex Classification A complex may be classified into more than one category. Small-molecule ligand Protein A Protein B Is this a protein-protein complex or a protein-ligand complex?

www.pdbbind.org www.pdbbind-cn.org Univ. Michigan Shanghai Inst. Org. Chem. The PDBbind database has >1,600registered users all over the world by now.

Process the Complex Structures • Split a complete “biological unit” of each complex into a protein molecule and a ligand molecule. • Save the protein molecule in the PDB format. • Remove redundant structural units; • Add hydrogen atoms; • Keep the water and metal ions with the protein. • Save the ligand molecule in the Mol2 format. • Correct atom types and bond types • Add hydrogen atoms and partial charges • Handle tautomers correctly These processed structural files can be readily utilized by most molecular modeling software.

My Scoring Function Tripod 发展亲合性打分函数 Scoring Function Development 评估亲合性打分函数 Scoring Function Assessment The PDBbind-CN Database 蛋白-配体复合物三维结构及亲合性数据库 (1) J. Comput.-Aided Mol. Des.2002, 16, 11-26. (2)J. Med. Chem.2003, 46, 2287-2303. (3)J. Med. Chem.2004, 47, 2977-2980. (4)J. Chem. Inf. Comput. Sci.2004, 44, 2114-2125. (5)J. Med. Chem.2005, 48, 4111-4119. (6)Proteins,2006, 64, 1058-1068. (7) J. Chem. Theory Comput. 2008, 4, 1959–1973. (8)J. Chem. Inf. Model. 2009, 49, 1079-1093. (9)J. Chem. Inf. Model.2009, 49, 1033-1048. (10)Mol. Informatics, 2010, 29,87-96. (11) J. Comp. Chem.2010, 31, 2109-2125. (12)BMC Bioinformatics, 2010, 11, 193-208. (13)J. Chem. Theory Comput. 2010, 6, 1852-1870. (14)J. Chem. Inf. Model., 2010, 50 , 682–1692.

1PXP Ki = 220 nM 1E1X Ki = 1300 nM 1E1V Ki = 12000 nM 1PXN Ki = 70 nM 1PXO Ki = 2.0 nM Some CDK-2 inhibitors recorded in PDBbind

5th Meeting on U.S. Government Chemical Databases and Open Chemistry

5th Meeting on U.S. Government Chemical Databases and Open Chemistry

Presentation Transcript

REFERENCE CHEMISTRY DATABASES

5th Meeting

U.S. Government

5th Grade U.S. History

U.S. Government

Open Government and

Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

U.S. Government

U.S. Government

U.S Government

U.S Government

U.S. Government

Chemical Abstracts Databases

U.S. Government

U.S. Government

U.S Government

U.S Government

Welcoming Remarks – and a Very Brief History of U.S. Govt. Chemical Databases and Open Chemistry

U.S Government

5th international meeting on biotechnology

5th International Conference and Exhibition on Polymer Chemistry

U.S. Government