Data Curation and Management activities within the UCT Computational Biology Group

Data Curation and Management activities within the UCT Computational Biology Group Dr Nicky Mulder

Outline • Activities at UCT: • High-throughput biology data • Sequence annotation • DAS annotation development • Issues we face • A note on standards and ontologies

High-throughput biology data • Close ties with CPGR • Microarray data storage –BASE • Proteomics data: • Annotation –pipeline required • Storage –LIMS required

BASE • BioArray Software Environment • Open source database for storage of array-type data • Manages raw data (images) and annotations • Has limited LIMS options • Can include specifications for MIAME compliance

BASE Sample Information

BASE experimental info

Proteomics Data • Still in progress • Peptide identification programs • Additional cross-linking from results to public database annotations • Storage of experimental data and resulting identifications • Include MIAPE compliance • Linking to genomics data –standards required

Sequence Annotation 1 • Paeano pipeline for annotation of cDNAs from non-model organisms • Uses collection of publicly available and custom software • Results are stored under projects • Links provided to array data in BASE

Sequence Annotation 2 • Glossina (Tsetse) EST annotation project • Held annotation jamboree at UWC • Worked with Twiki tool developed by JBIRC • Data to be submitted to public databases

Twiki system

DAS Annotation Tool • Distributed Annotation System –allows viewing of annotation from different sources • Can overlay your own data/annotation • Facilitates information sharing without issue of updates • Repositories distributed in different geographical locations • Extension of DASTy2 –developed at NBN • Development of DAS annotation tool underway

DASTy

Links to other DAS viewers

DAS annotation tool Collaborative visual annotation tool - Annotation - Comments - Sequences - Features - Non positional features - Methodology of trust on a collaborative annotation process

Data curation and management issues • HTB software licenses are expensive • Open Source not always maintained • Ensuring regular backups (data size) • Keeping data up to date • Researchers leave data after project –not updated to new versions • Privacy –researchers share data only with collaborators, patient data is private • Sharing and linking data

Standards and ontologies • Use a controlled vocabulary (controlled list of terms) or ontology(set of terms with relations) • Enables easy data retrieval and sharing • Easy comparison of results from different labs • Compatibility with other labs/databases world-wide • Ease of uploading data into public databases • Unambiguous report of research

Open Biomedical Ontologies • Central location for accessing well-structured controlled vocabularies and ontologies for use in the biological and medical sciences • Provides simple format for ontologies • Scope include anatomy, phenotype, development, disease, “omics”, experiment, etc. • http://obo.sourceforge.net

Data exchange standards • Microarray standards –MIAME and MAGE • Proteomics Standards Initiative (PSI) • Systems Biology Markup Language (SBML) –computer-readable format for representing models of networks • Biological Pathways Exchange (BioPAX) –format for representing pathways

Conclusions • Some tools in place for curation and management of different data types • Need better education of researchers to encourage this • Ontologies and standards are important in digital data curation and management, need to encourage compliance with international standards

Acknowledgements • Funding: • Collaborations: • CPGR • Researchers at UCT

Data Curation and Management activities within the UCT Computational Biology Group