440 likes | 554 Views
A community database for biological research Christoph Best European Bioinformatics Institute, Cambridge, UK Matthew T. Dougherty NCMI - Baylor College of Medicine Houston, Texas. Digital Archives for Molecular Microscopy. Bioimage Informatics. Informatics in support of biological imaging
E N D
A community database for biological research Christoph Best European Bioinformatics Institute, Cambridge, UK Matthew T. Dougherty NCMI - Baylor College of Medicine Houston, Texas Digital Archives forMolecular Microscopy
Bioimage Informatics Informatics in support of biological imaging Why? Image data rapidly increasing (Confocal) Fluorescence microscopy (Cellular B.) EMDB: Electron Microscopy (Structural Biology) High-throughput methods (Genome Biology) Enabling science by making data accessible, reliable, and understandable S.Haertel, U. Chile EMDB, EBI J. Swedlow, U. Dundee Open Microscopy Environment Quality assessment Standards&Conventions Public Databases
Structural Databases at EBI Protein Databank (PDB) Atomic structures (positions of atoms) PDB file format, mmCIF Derived from X-ray crystallography Long tradition, curated data base Huge: 65,000+ entries, 3 wwPDB sites Electron Microscopy Databank (EMDB) Part of PDB at EBI and Rutgers 600 density maps of macromolecular structures and subcellular complexes Started 2002 Curated, but limited metadata, experiment info XML-based
Electron microscope From Schweikert, 2004 Biocenter, U Helsinki
Single-particle method • Molecular structure • Many images computationally combined • 3D from 2D • resolution increase by avaraging Tripeptidyl-peptidase II (TPP II) courtesy of B. Rockel, Martinsried
Single-particle analysis: GroEL to 4A Ludtke et al, Structure 2008
Data Management Issues Initial EM images: O(1000), 4k x 4k -> O(10GPixel) Particle stacks: O(100,000), 256x256 -> O(10 GPixel) Final data set: 1 MVoxel small Processing power: O(100) cores, some weeks, lab-owned clusters Software: 1970s FORTRAN codes, 1990s C codes fragmented communities, lack of standards
Electron tomography 3D reconstruction by taking a series of images from different angles Difficulty: Nanometer accuracy Problems: Limited tilt range ↔ missing wedge⇒ distortion Imperfections of the tilt ↔ alignment⇒ limited resolution Computational reconstruction algorithms
Tomography of eukaryotic cells PROJECTION SLICE O. Medalia et al, Science, 2002 Dictyostelium discoideum
Image enhancement Before Cytoskeleton of Spiroplasma melliferum J. Kürner et al., Science, 2005
Image enhancement After J. Kürner et al.,Science, 2005 yellow: geodetic line
Automated image analysis Automatic segmentation to identify points/lines/surfaces A. Linaroudis, Ph.D. Thesis, 2006 Automatic Manual
Data Management Issues Original data: 60 images, 8k x 8k -> O(4 GPixel) Reconstruction: 8k x 8k x 256 -> O(16 GPixel) ? Software: 1970s algorithm in 1990s software Visualization: “let's buy more memory” Future: web-based applications (Google Maps) ?
The Electron Microscopy Data Bank contains EM-derived density maps complementary to coordinate sets in PDB established 2002 @ EBI (Kim Henrick) web-based submission and retrieval hand-curated (R. Newman) A bit like Ebay – and you won't make any money, either
A Unified Data Resource for EM NIH-funded joint project Baylor College of Medicine, Houston (W. Chiu, M. Baker) Rutgers University, New Jersey [H. Berman, C. Lawson) PDBe, EBI, Cambridge, UK [K. Henrick, C. Best, R. Newman Baylor College of Medicine Houston, TX European Bioinformatics Institute, Cambridge, UK Rutgers University, Piscataway, NJ
Characteristics Curated Community Archive: PDB and EMDB NIH, EU (in past), and BBSRC funding (+ EMBL) Worldwide cooperation Advisory boards and task forces from the community Open deposition and retrieval→ Alternative access systems by other institutions 760 entries, 26 GB data ca 100 entries/year curation both in Europe and US
EMDep deposition system 750 entries, current rate approx. 15-20/month Contents of an entry:Metadata (XML header) → experimental metadataMap (any format, converted to CCP4/MRC)Additional files Java/Tomcat/XML
EMDB search system Java/Tomcat
EMDB search system Java/Tomcat
EMDB Atlas pages XSLT
Metadata management Difficult: many rounds of consulting the community Still most fields remain empty Data harvesting LIMS, PIMS -> rarely used Processing pipelines, image processing software-> Lack of standards, idiosyncrasies Image formats: Appalling lack of standards
Data issues Current: Deposit final result of experiment and computation How much of original/intermediate data should be deposited? Issues: Cost / Practicability Reproducibility of experiment Intellectual property (un-exploited results?) Usefulness
Non-data issues Embargo: Image data can be withheld up to two years Allows original researcher to further exploit them Journals and funders must define: what data must be deposited when they are to be released Quality Standards: Require community acceptance Technically difficult Data Bank does enrich/annotate, but does not do science → quality standards must be set by scientists
Image data formats Current: Variety of historical ad hoc formats Unclear definitions, variations in different software Need: Interoperability Standards Technical level? Acceptance? → Question for the community HDF5 Common container format to deal with numerical data Heavyweight library, but widely available (but Java?) Would at least solve low-level format problems Metadata format still needs to be specified
Ontologies Systematic way to define classes of objects attributes of these objects relationships between objects Provides framework for metadata models Advantage: Powerful formal method Disadvantage: Not yet widely used
Rich data sets Submissions consist of maps (increasingly more than one) relations between data sets → unexpressed XML-based standards for represen-ting relationships between data: Subject-predicate-object relationships (RDF framework) Harvesting interface to EM processing software Web-based visualization for sub-mission and retrieval, complex sub-missions assembled interactively (AJAX)
Bioimage informatics tools • Current EMDB interface: • simple and efficient • but must be extended to accommodate more complex experiments • OMERO interface: • geared at labs, notpublic databases • All the beauty of AJAX • high-performancevisualization
Bioimage informatics tools BISQUE/BISUICK (UCSB) multichannel images lab notebook tagging image markup
Current Imaging Workflow Paradigm No Standards Experiment? Image? Analytics? Annotations? Jason Swedlow (U. Dundee)
OMERO in 2007/8/9 Jason Swedlow (Univ. Dundee)
A Virtual Research Community Grid/cloud computing /storage Imaging Centers in house storage storage distribution quality assessment acquisition, storage, and management of images storage and computing engines data submission Databases Software data harvesting USERS
CONCLUSIONS Community data bases are a central part of the Scientific Data Infrastructure Image databases rapidly growing Technical challenges: data formats, size Standards and interoperability Improve metadata collection Keep the community engaged