Data management, curation, and display

Data management, curation,and display Bob Sinkovits AfCS Bioinformatics Lab San Diego Supercomputer Center UC San Diego

AfCS quick overview • Cell/molecular biology project focusing on cellular signaling • Collaboration involving eight laboratories plus other investigators working on bridging projects and data analysis • Two main activities • High throughput generation of experimental data • Molecule page project (in collaboration with NPG) • http://www.signaling-gateway.org

The data management problem • Collecting and archiving data • Tracking meta-data associated with experiments (reagents, technicians, labs, dates, machine settings, protocols, etc.) • Processing raw data • Curation • Organization and display • Data distribution

Data collection Data acquisition for the AfCS involves the separate transfer of experimental data and the description of the experiment (meta –data) GUIs meta-data Experimental Lab SDSC wget data (results)

Experimental data collection Experimental data files transferred on a nightly basis using the UNIX wget utility under control of cron job Myriad Ca++, cAMP phosphoprotein cytokine UTSW UCSF Y2H Ca++ SDSC microscopy single cell Ca++ Caltech microarray Lipid MS Stanford Vanderbilt

Experimental data collection • The UNIX wget utility was ideal for our project where data needs to be collected from a limited number of sites on a regular basis. Filters allow control over transfers. • One drawback is that file transfers are initiated if the timestamps on the remote files change. May be worthwhile to make effort write a better wget that also compares checksums

Meta-data collection • Meta-data inserted directly into the AfCS Oracle database through a set of Java Swing GUIs • Sample, experiment, cell line, etc. IDs are generated automatically based on date, laboratory code, etc. • Error checking, the use of pull down menus, and database constraints ensure that valid data entered into GUIs

Meta-data collection

Meta-data collection • All experimental samples and materials (protein extracts, gels, cell preps, plasmids, solutions, reagents, etc.) are physically labeled using a 2-d barcode. Symbol Cyclone scanner Zebra Z4M barcode printer

Data/information flow Oracle 9i GUIs parse.pl curation meta-data www postprocess.pl SDSC Labs data SRB Disk / Tape silo Off-site backup(Caltech)

Databasing • Each type/category of experimental data is stored in a separate database schema • Easier to work with schemas containing smaller numbers of tables • Minimizes possibility of data loss/corruption • Avoids confusion due to multiple developers working in a single schema (overlap of namespaces) • Easier recovery • Privileges granted as needed between schemas

Databasing • Strongly encourage using multiple instances • Production, test, and development for datasets modified by large numbers of users • Production and development may be suitable for datasets that are modified by one/few user • Multiple instances • Provide test beds for new releases of RDBMS • Allow developers to make schema modifications without impacting production system

Databasing • Oracle has worked great for us, but it’s not cheap (even with the educational discount). • For large databases, need to think a lot about performance • Every table should have a primary key • Use indexes for columns that are frequently used in searches • Run ANALYZE on a regular basis • Use bind variables

Data archives • For users who want to analyze complete data sets, downloading results one experiment at a time can be tedious and impractical • For projects that deal with large amounts of data, an ftp server is essential for distributing complete archive

Data archives Archives of data sets can be downloaded at ftp://ftp.afcs.org/pub/datacenter

Data curation • Need to provide convenient way for the AfCS labs to curate data • By ligand (don’t release until replicated) • By experiment (flag bad experiments) • By sample (flag bad samples w/o discarding expt) • Web interfaces for curation have been developed and are restricted by user

Data curation • Ligand, experiments, and samples can be annotated in three ways • Public – available for public • Internal – restricted to internal use. Validity of data still being investigated or experimental conditions not yet replicated • Invalid – experiment or sample flagged as being bad; not available to anyone

Data curation

Data curation by ligand For curation by ligand, interface is based on the public display with additional features

Data curation by sample/expt Curate by experiment Curate by sample

Data curation by sample/expt For some assays, such as cytokine and phosphoprotein, the large number of samples make curation by sampleid impractical. Curation limited to the experiment level

Summary • Catch/prevent data entry errors early • Build business rules (constraints) into database • Limit choices through controlled vocabularies • LIMS much more scalable, shareable than traditional laboratory notebooks • wget ideal for data transfers from limited number of sites • When presenting data to the public, provide access at different granularities

Summary • Use multiple database instances • Production and development at minimum • Plus test for projects where data is entered by large numbers of users • Use multiple schemas, with privileges granted as necessary • Safer • Ease of development • Compartmentalization of data

Acknowledgements • Madhusudan, Ilango Vadivelu – LIMS • Stephen Lyon – web master • Brad Kroeger – systems administration • Chic Barna, Ray Bean – database administration • Sylvain Pradervand – phosphoprotein display • Ron Taussig, Gil Sambrano, Richard Scheuermann - data center design • Paul Sternweis – Ca++, cAMP display • Susie Mumby – phosphoprotein, cytokine display • Lonnie Sorrels, Keng-Mean Lin, Sangdun Choi, Nick Wong, Robert Hsueh, Heping Han, Ruth Levitz

Data management, curation, and display