3D databases and data warehouse technology

3D databases and data warehouse technology

Overall Strategy Terms and background Populating the databases Clean up processes How can I use the database? What next Overview

By the term ‘database’ we refer to the system rather than the data Indexed file space Also used as a shorthand for a database management system (DBMS) Methods for accessing and changing data Controls for referential integrity What is a database?

Data fields in a normalised database appear only once Normalisation RESIDUE COMPONENT CHAIN CHAIN ID SEQ COMP ID ID attr ID attr A 1 ASP ASP -1 A 185 A 2 LYS LYS +1 ... ... ... ... ... ... ... • Data fields in a denormalised database are repeated in different places RESIDUE COMPONENT CHAIN COMPattr CHAIN ID SEQ COMP ID CHAINattr ID attr ID attr -1 A 1 ASP 185 ASP -1 A 185 A 2 LYS 185 +1 LYS +1 ... ... ... ... ... ... ... ... ...

Structural hierarchy assembly molecule (entity) chain residue

ASU and assemblies assembly ASU chain chain residues residues

The pipeline archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

The first steps archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

A series of scripts Parses non-standard header records Fills in chain identifiers Outputs a first cut clean file Manual editing ~1000 entries require manual editing The result is a PDB format file that can be passed to the subsequent automatic steps The first steps

bizarre errors … 1ew1 ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 -2.985 ... ... ATOM 47 N6 A A 2 2.068 5.433 -2.482 ... ATOM 59 1H6 A A 2 1.160 5.722 -2.818 ATOM 60 2H6 A A 2 2.901 5.700 2.985 ...

automatic processing archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

Automatic cleanup (d2c) Incorporates quaternary structure information Runs a lot of checks and corrections Outputs mmCIF file Loading Metadata-driven custom loader Load through views with insert triggers Many heuristics also applied to data within these triggers process details

Using reference data $COLI COLI E. COLI E.COLI ESCHERCHIA COLI ESCHERICHI $COLI ESCHERICHIA $ COLI ESCHERICHIA $COLI ESCHERICHIA COLI ESCHERICHIA COLI. EXCHERICHIA COLI EXPRESCHERICHIA COLI • Variations in legacy data • Hinders accurate searches • Hinders links to other services • Match data against controlled vocabularies • Within scripts • Within database during load • Semi-automated • Use string matching algorithms • Effective when controlled vocabulary well maintained

More difficult to dealwith Where coordinates and nomenclature do not agree, have to make a judgement on which, if either, are correct We maintain a curated database of compounds, against which legacy data is compared atom nomenclature – ongoing; relatively easy to correct where the compound has been correctly identified Stereochemistry – may indicate that the compound name is incorrect Chemical Components

Ligands are often named inconsistently or even entirely incorrectly, e.g. a-D-mannose (MAN) vs. b-D-mannose (BMA) Errors are detected using a graph-based structure comparison algorithm Ligand nomenclature MAN BMA

not all cases resolvable 1d7t DTY 4 in chain A, model 1 - is it D or L ?? HEADER DE NOVO PROTEIN 19-OCT-99 1D7T TITLE NMR STRUCTURE OF AN ENGINEERED CONTRYPHAN CYCLIC PEPTIDE TITLE 2 (MOTIF CPXXPXC) ... MODRES 1D7T DTY A 4 TYR D-TYROSINE ... HET DTY A 4 21 ... HETNAM DTY D-TYROSINE ... FORMUL 1 DTY C9 H11 N1 O3

post-load processing archive PDB services pdb data warehouse edited PDB archiveDB pdb cif post-load processes manual edit distribution

Involved in deriving data and building crosslinks to other services Geometric information Analysing non-polymer components and assembling full entities from individual components Links to taxonomy and sequence databases process details

transformation to DW archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

Set of SQL scripts Supports Oracle (routinely) and MySQL (development) Periodically undertake full transform takes a couple of weeks Provide weekly incremental patches much faster Supports transforms into different data marts process details

Continuing cleanup HET group curation Sequence cross-references Citations More choice on downloads Data marts (even single tables) Groups of entries Release of clean PDB files (end 2006) coming soon …

who did what archive PDB services pdb data warehouse edited PDB archive DB pdb cif post-load processes manual edit distribution

3D databases and data warehouse technology

3D databases and data warehouse technology

Presentation Transcript

Data and Databases

Data Warehouse and Data Cube

Data Warehouse

Data warehouse and data mining

The Data Warehouse and Technology

Data Warehouse

DATA WAREHOUSE AND OLAP TECHNOLOGY PART - 2

Instant Data Warehouse Extends Databases Supported

Data Warehouse

Data Warehouse

DATABASES AND WEB TECHNOLOGY

handling data in data warehouse technology

Data Warehouse

Data Warehouse

3D Warehouse