220 likes | 356 Views
3D databases and data warehouse technology. Overall Strategy Terms and background Populating the databases Clean up processes How can I use the database? What next. Overview. By the term ‘database’ we refer to the system rather than the data Indexed file space
E N D
Overall Strategy Terms and background Populating the databases Clean up processes How can I use the database? What next Overview
By the term ‘database’ we refer to the system rather than the data Indexed file space Also used as a shorthand for a database management system (DBMS) Methods for accessing and changing data Controls for referential integrity What is a database?
Data fields in a normalised database appear only once Normalisation RESIDUE COMPONENT CHAIN CHAIN ID SEQ COMP ID ID attr ID attr A 1 ASP ASP -1 A 185 A 2 LYS LYS +1 ... ... ... ... ... ... ... • Data fields in a denormalised database are repeated in different places RESIDUE COMPONENT CHAIN COMPattr CHAIN ID SEQ COMP ID CHAINattr ID attr ID attr -1 A 1 ASP 185 ASP -1 A 185 A 2 LYS 185 +1 LYS +1 ... ... ... ... ... ... ... ... ...
Structural hierarchy assembly molecule (entity) chain residue
ASU and assemblies assembly ASU chain chain residues residues
The pipeline archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution
The first steps archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution
A series of scripts Parses non-standard header records Fills in chain identifiers Outputs a first cut clean file Manual editing ~1000 entries require manual editing The result is a PDB format file that can be passed to the subsequent automatic steps The first steps
bizarre errors … 1ew1 ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 -2.985 ... ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 2.985 ...
automatic processing archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution
Automatic cleanup (d2c) Incorporates quaternary structure information Runs a lot of checks and corrections Outputs mmCIF file Loading Metadata-driven custom loader Load through views with insert triggers Many heuristics also applied to data within these triggers process details
Using reference data $COLI COLI E. COLI E.COLI ESCHERCHIA COLI ESCHERICHI $COLI ESCHERICHIA $ COLI ESCHERICHIA $COLI ESCHERICHIA COLI ESCHERICHIA COLI. EXCHERICHIA COLI EXPRESCHERICHIA COLI • Variations in legacy data • Hinders accurate searches • Hinders links to other services • Match data against controlled vocabularies • Within scripts • Within database during load • Semi-automated • Use string matching algorithms • Effective when controlled vocabulary well maintained
More difficult to dealwith Where coordinates and nomenclature do not agree, have to make a judgement on which, if either, are correct We maintain a curated database of compounds, against which legacy data is compared atom nomenclature – ongoing; relatively easy to correct where the compound has been correctly identified Stereochemistry – may indicate that the compound name is incorrect Chemical Components
Ligands are often named inconsistently or even entirely incorrectly, e.g. a-D-mannose (MAN) vs. b-D-mannose (BMA) Errors are detected using a graph-based structure comparison algorithm Ligand nomenclature MAN BMA
not all cases resolvable 1d7t DTY 4 in chain A, model 1 - is it D or L ?? HEADER DE NOVO PROTEIN 19-OCT-99 1D7T TITLE NMR STRUCTURE OF AN ENGINEERED CONTRYPHAN CYCLIC PEPTIDE TITLE 2 (MOTIF CPXXPXC) ... MODRES 1D7T DTY A 4 TYR D-TYROSINE ... HET DTY A 4 21 ... HETNAM DTY D-TYROSINE ... FORMUL 1 DTY C9 H11 N1 O3
post-load processing archive PDB services pdb data warehouse edited PDB archiveDB pdb cif post-load processes manual edit distribution
Involved in deriving data and building crosslinks to other services Geometric information Analysing non-polymer components and assembling full entities from individual components Links to taxonomy and sequence databases process details
transformation to DW archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution
Set of SQL scripts Supports Oracle (routinely) and MySQL (development) Periodically undertake full transform takes a couple of weeks Provide weekly incremental patches much faster Supports transforms into different data marts process details
Continuing cleanup HET group curation Sequence cross-references Citations More choice on downloads Data marts (even single tables) Groups of entries Release of clean PDB files (end 2006) coming soon …
who did what archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution