Worldwide Protein Data Bank wwpdb

Worldwide Protein Data Bank www.wwpdb.org

Agenda • Welcome and Introductions • Overview of recent wwPDB progress • Introduction to the BMRB • Theoretical model policy • Issues for discussion and advice Break • wwPDB group interactions • wwPDB plans for 2007 • Long term aims, funding, and stability • Executive session • Feedback to wwPDB • Set next meeting date (July 2007; Salt Lake City, UT?)

wwPDB AchievementsAugust 2005-October 2006 • Continued growth of archive • Website updates • Publications and presentations • Time stamped archive • wwPDB team building • Annotation document • Remediation • BMRB formally a member of wwPDB

Deposition issues

The never ending story

Deposition since establishment of 3 sites

PDB entry processing • 1-1-2000 10,997 entries in PDB • Today 1-Oct-2006 39,323 entries in PDB Total size is 3.6 times when the 3 sites started • In 1999 2361 entries deposited • In 2005 6678 entries deposited We handle 2.8 as many entries per year with less staff - and all 3 sites produce high quality annotated PDB entries NO CURRENT BACKLOG UN-PROCESSED ENTRIES

Time-stamped copies of the archive • 24 Gbytes of data for 2005, released January 3, 2006 • Includes: • PDB format entries • mmCIF format entries • PDBML format entries • Experimental data • Dictionary, schema and format documentation

Outreach • wwPDB website • Publications and meetings

Joint publications and presentations • Nucleic Acids Research 2007 Database Issue • Ensuring a single, uniform archive of PDB data • Methods in Molecular Biology 2007 • Data deposition and annotation at the wwPDB • Nature Structural & Molecular Biology, 2006 • Is one solution good enough? (response) • CODATA (October 23-25, 2006; Beijing, China) • The Worldwide Protein Data Bank • Encyclopedia of Genomics, Proteomics, and Bioinformatics, 2005 • The Protein Data Bank and the wwPDB

The wwPDB Team

wwPDB interactions this year • Exchange visits • MSD/RCSB (6) (thanks to WT) • PDBj/RCSB (1), • BMRB/RCSB-PDB (3) • Phone conference with site directors-twice a year • VTC’s among staff • BMRB/RCSB twice a month (ADIT-NMR) • MSD/RCSB-twice a week (annotation procedures, remediation) • Email among staff • MSD/RCSB~2 per day • PDBj/RCSB~2 per day

What is the PDB? • Content • Processes to ensure quality (annotation project)

Annotation project

Annotation project GOALS • Standardize annotation rules and policies among wwPDB sites • Document annotation rules and policies • Create venue to update annotation rules and policies as necessary

Annotation project How did we get there? • Review and discuss each PDB field by email and VTC • Write document and review by all staff • Final review by site directors • Implement software compliant to new annotation procedures • Test software and train annotators • Publish document on Web

Annotation project Resultant document • Specification of ALL fields in PDB file • Clarification of policies • Assignment of PDB IDs • Release of files and information • Changes to entries • Clarification of data representation • Chain ID for all atoms in the file • Multi-model representation for alternate conformation or disorder • Chimeras • Microheterogenity

Remediation

Remediation: scope 34,528 Entries Checked • Primary citations • Sequences & taxonomy • Ligand stereochemistry and nomenclature • Symmetry and coordinate transformations for virus entries • Diffraction source & beamline • Miscellaneous uniformity issues

Remediation: statistics • Citations: • All primary citations checked • 8508 citations manually examined • 7037 citations confirmed and updated • Sequence and taxonomy: • 47917 sequences checked • 20068 updated sequence data references • 11087 taxonomic references updated • Virus entries • 250 entries checked and revised • Diffraction source • 10985 entries revised • Miscellaneous uniformity corrections • 1041 entries revised

Remediation: statistics • Ligand stereochemistry and nomenclature • 7568 ligand definitions checked • 1758 new ligand definitions added • 185 ligand definitions obsoleted • 152,000 ligand instances checked • 138,230 ligand instances OK • 6815 ligand instances renamed

Remediation process • Corrections contributed and reviewed by wwPDB members • Corrections on the archival mmCIF data files tracked in a version tracking system, CVS • New PDB exchange, PDBML and PDB format data files being produced now • Each wwPDB group will validate and load the resulting files into their database systems • Invited public testing will begin January 2007 • General availability will start April 2007

Remediation: Ligand dictionary rewrite • Model and idealized coordinates provided • Stereochemical configuration assignments • Aromatic atoms and bonds flagged • Definitions provided for “Chemistry Catalog” state with leaving atom candidates flagged • Nonstandard atom names revised (e.g. dinucleotides) • Duplicate ligand definitions marked as obsolete • Metal hydrate definitions obsoleted • Alternate atom name aded to store legacy atom names • SMILES and INCHI descriptors provided

Remediation: major entry level corrections • Citations: • PubMed identifiers provided where available • Unpublished citations checked and flagged • Sequence and taxonomy: • UniProt sequence database references • Taxonomies from NCBI Taxonomy database • Diffraction source • Synchrotron facility and beamlines names consistently specified in coordination with BioSync

Remediation: major ATOM record changes • Nomenclature changes • IUPAC H-atom names for standard amino acids and nucleotides • DNA and RNA differentiated (AD (DNA) & A (RNA)) • Modified nucleotides expressed as 3-letter codes (removed +’s) • PDB asterisks replaced by single quotes in atom names • Noncompliant ligands flagged in data files

Remediation: Major REMARK changes • Virus entries • Transformations from deposited frame to point symmetry and crystallographic frame provided • NCS and point symmetry transformations properly differentiated

EM standards • New dictionary for electron microscopy • MAP orientation conventions

Worldwide Protein Data Bank www.wwpdb.org BMRB John Markley

Worldwide Protein Data Bank www.wwpdb.org Introduction to the BMRB • BMRB is the worldwide archival site for biomolecular NMR data • NMR data related to structures are cross referenced to PDB entries • PDBj mirrors BMRB and supports external BMRB depositions • As RCSB members, BMRB and PDB have worked closely to capture and annotate NMR data associated with deposited coordinate sets • Recognizing that the biomolecular NMR community would be best served by having a “one stop” deposition system for NMR structures, BMRB has been pursuing this goal in collaboration with the RCSB-PDB • BMRB plans to institute the same policy with MSD EBL

Worldwide Protein Data Bank www.wwpdb.org wwPDB NMR experimental data flow BMRB (deposition/processing/export) ADIT-NMR central archive Processed NMR-STAR CERM-BMRB (export) Mirror site Processed NMR-STAR Processed NMR-STAR Deposited data Raw NMR-STAR Deposited data Processed NMR-STAR Raw NMR-STAR Deposited data MSD/EBI (deposition/export) CCPN PDBj-BMRB (deposition/processing/export) ADIT-NMR Mirror site RCSB-PDB (deposition) ADIT-NMR

Worldwide Protein Data Bank www.wwpdb.org Major developments related to BMRB’s role in the wwPDB • “One-stop” BMRB-PDB ADIT-NMR deposition site for structures and NMR data developed in collaboration with PDB is operational, with BMRB assigning PDB accession codes • Restraints database for legacy structures is nearing completion as part of the wwPDB “clean-up”; new tools to automate this process were developed in collaboration with MSD EBI • NMR-STAR v3 dictionary has been extended and released • Graphical interface with Jmol displays integrates PDB coordinate data with associated NMR parameters • BMRB is working with SG groups to improve efficiency of capturing protein NMR data • BMRB participates in the “PDB-BMRB Task Group on NMR”

Worldwide Protein Data Bank www.wwpdb.org New “one-stop” deposition of NMR structures/ data

Worldwide Protein Data Bank www.wwpdb.org Deposition interface features • BMRB and RCSB-PDB depositions are now generated from a joint interface • BMRB interface has been streamlined • RCSB-PDB interface for NMR has been extended with optional fields for conformer and constraint statistics • Files in PDB format, mmCIF, and NMR-STAR can be uploaded to pre-populate a deposition • Many fields (i.e., experiment name, software name, software author, etc.) have pull-down lists to choose from for convenience and to improve uniformity • Fields common to multiple forms are linked to eliminate the need to retype information (i.e., uploaded data file names, author names, molecule names and others) • Help and examples have been improved

Worldwide Protein Data Bank www.wwpdb.org Restraints grid is keyed to NMR structural entries

Worldwide Protein Data Bank www.wwpdb.org Coordinated displays of NMR data and structures

Theoretical Models Policy Haruki Nakamura

Models • Define line between “pure” models and models based on data • Large experimental spectrum e.g. X-ray, NMR, EM, SAX, FRET models • Homology models especially as derived from structural genomics • Need a way to archive models that is totally compatible with PDB

Defining a policy for models Workshop at Rutgers (November 19-20, 2005) • Attended by modelers, structural genomicists, electron microscopists • Policies and suggested implementations developed • Outcome published in Structure • “Outcome of a Workshop on Archiving Structural Models of Biological Macromolecules”, Helen M. Berman, Stephen K. Burley, Wah Chiu, Andrej Sali, Alexei Adzhubei, Philip E. Bourne, Stephen H. Bryant, Roland L. Dunbrack, Jr., Krzysztof Fidelis, Joachim Frank, Adam Godzik, Kim Henrick, Andrzej Joachimiak, Bernard Heymann, David Jones, John L. Markley, John Moult, Gaetano T. Montelione, Christine Orengo, Michael G. Rossmann, Burkhard Rost, Helen Saibil, Torsten Schwede, Daron M. Standley, John D. Westbrook, Structure, 2006 14/8:1211-1217.

Models Recommendations • PDB depositions will be restricted to atomic coordinates that are substantially determined by experimental measurements on specimens containing biological macromolecules. • A central, publicly available archive or portal should be established for models that are the explicit subject of peer review. • Methods for assessing model quality are essential for the integrity and long-term success of any publicly available model portal, either from a central repository or a set of linked resources. There was no consensus as to which single method or group of methods should be applied.

Proposed Portal for Multiple Databases for Protein Structures Berman, H. et al. (2006) Structure, 14, 1211-1217. Theoretical Model DBs Theoretical Model DB 2 Theoretical Model DB 1

Characteristics of portal • Data Standards for Models • Access Models for a Central Portal of Models • The minimum contents for this portal require a unique identifier for each model registered with the system, each model's polypeptide chain sequence, and quality assessment information. • Additional information should be available, including: keywords, structural motifs, standard test sets of data, bound ligands, domains, flexibility, surface electrostatic properties, coding & noncoding SNPs, alternative splicing, oligomeric state, macromolecular interactions, literature references, subcellular localization, pathways, transcript profiling, & drugability. • Access to these data should be free and constantly available to a diverse worldwide user community of both model producers and users. Several levels of access are required for the different levels of users of the portal.

Implementation of models policy • August 15, 2006: Policy announced with 60 day period of review • August 15-October 15, 2006: Transition Plan • All existing un-processed theoretical model entries as well as entries deposited during this time were not validated or processed. Entries will be released as-is without author review or corrections. • Authors had the choice of correcting their entries by withdrawing the original entry and then re-submitting the corrected version before October 15, 2006. • October 15, 2006: Theoretical model depositions no longer accepted

Discussion Issues Kim Henrick

SAX - New EXP TYPE • Hamburg to provide templates for consideration

4-letter code? • Use of PDB 4-letter code can be extended by allowing alpha-numeric in 1st character to 35 x 36 x 36 x 36 = 1,632,960 combinations

Patent Office • The structures in the patent office may not represent a major loss of structures – current investigations indicate most patent structures are in the PDB. • A much larger set of structures are in the Pharma on ligand bound structures.

wwPDB SAC input request

What is a PDB Entry? Rules for the smallest structure that can be submitted • Carbohydrate chains? • How long is a peptide? (24) • Non-gene product macromolecular biological ligands (e.g. antibiotics)? Particular request from NMR depositors

Issues Annotation: EXP details Experimental Details • Twinning – twin factor in REMARK 3 requested and original un-twinned structure factors • TLS and conventional atomic B factor • Author derived Validation software and procedures/results – no longer accepted as in REMARK 42 – now a REMARK to carry software used and function

Worldwide Protein Data Bank wwpdb