190 likes | 252 Views
Rapid Development of an Ontology of Coriell Cell Lines. Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone. malone@ebi.ac.uk. Overview. Motivation – our EBI use cases, big stuff in little time Methods – normalization, axiomatisation, automation
E N D
Rapid Development of an Ontology of Coriell Cell Lines Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone malone@ebi.ac.uk
Overview • Motivation – our EBI use cases, big stuff in little time • Methods – normalization, axiomatisation, automation • Results – large flat ontology, adding structure • Desiderata • Future work Master headline
Motivation • Lot of large (often online) catalogues • Infeasible to manually ontologise all of them (and cost is not easy to justify) Master headline
High quality artefacts take time $money$ 33k classes 75k classes FMA Gene Ontology 30 person years Since 1999 3k classes OBI Since 2006 Funders Master headline
Motivation • Would like to include these in BioSample Database for curating data, querying, browsing • Ontology use has proven to add value (e.g. GXA) • RDF representations of data for integration Master headline
Methods: A Model for Cell Lines • Agreed with cell line ontology folk • With addition of is_model_for, to reflect use of cell lines as models for particular diseases (they can be used as models for diseases other than the original disease borne by the subject) Master headline
Methods: Programmatic Engineering • Lexical concept recognition (2) - Map the terms to existing ontologies by either class label or synonyms in order to get identifiers (e.g. label – Homo sapiens and synonym – human) • Produce a list of classes from reference ontologies (3) e.g. • Map organism terms: NCBI taxonomy ontology • Map disease terms: Disease ontology • Map these to an upper ontology and axiomatic structure (4) cell line model on previous slide Master headline
Methods: Our “raw” data • Coriell database dump • Extract information for cell line, cell type, anatomy, organism, sex and disease. (27002 cell lines)
Lexical Concept Recognition • Our mapper tool was used, adapted the Levenshtein distance algorithm <http://www.ebi.ac.uk/efo/tools> • Mapping result: • 93 organism - NCBI taxonomy ontology • 61 anatomy – MAT (species neutral), FMA and other anatomy • 23 cell type – cell ontology • 337 disease - Disease ontology (DO) (via OMIM) • Check the mapping file manually • Inconsistency e.g. fibroma as anatomy • Unmapped terms • Unclear information e.g. buttock-thigh
Using OWL-API to build ontology automatically Example: import class cervix from EFO It has parent class organism part It has class relations part_of some female reproductive system Mapping files Reference ontologies Step one: Adding classes buildingOntology.java Coriell cell line protocol Spreadsheet.txt Step two: Adding restriction Hela cell line Coriell cell line Ontology Master headline
Adding Structure to Normalized Hierarchy Query: human cancer cell line Coriell cell line ontology Master headline
Explanation of query: human cancer cell line Cancer Cell line any cell type any organism part human Master headline
Results In Brief • Large (~28,000) cell line ontology, lots of axiomatisation for the axes mentioned (disease, organism, cell type, anatomy some extra bits like gender) • Running code took ~5 minutes • Flat asserted hierarchy under cell line; structure is all inferred using axioms as per Rector Normalization • Makes more flexible to change and also querying more powerful Master headline
BioPortal Analysis by Matt Horridge et al • http://bio-ontologies.knowledgeblog.org/135 • “the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment“ • Coriell is quite rich axiomatically – this is what we would expect given the normalization method • Suggestion rich ontologies can be created rapdily Master headline
Discussion • Good bits: • Adding new classes was simple once code was done • Re-running the code also simple if a part of pattern was changed • Can be reused for different ontology with similar sort of input • Axiomatisation gives richer querying and renders implicit knowledge explicit • Harder bits: • Some data clean up necessary before using the code (SQL dumps can be messy) • Eliminating errors from concept matching (though expansion on synonyms helped) • Textual definitions for all classes • Size of the ontology created a few issues, e.g. BioPortal not able to browse, most reasoners can not complete on the ontology Master headline
Conclusions • To build very big ontologies we need either • Lots of resource (i.e. money) • Lots of people (crowd-sourcing) • Good tools and methods • Each having pros and cons • Lots of money requires, well, lots of money • Lots of people requires community buy in and ‘trust’ (so audit trail crucial) – we’re not always good at that • Automation can introduce errors and require some curation Master headline
Desiderata: Querying and External URIs • Easier to use interfaces, users like it simple • One of top user feedbacks is “we’d like it to work like Google does” • Users also like putting URIs into a webpage and getting something sensible back i.e.: Master headline
Future Work • Creating text definitions • Will follow up on work by Stevens R, Malone, J et al (2011)* for creating natural language definitions from OWL • Text mining to extract missing disease information from textual descriptions from original catalogue • Begin curating data for BioSample Database at EBI using ontology • Integrate Coriell ontology with rest of process from sample to analysis, e.g. software ontology http://theswo.sourceforge.net *Robert Stevens, James Malone, Sandra Williams, Richard Power, Alan Third (2011) Automating generation of textual class definitions from OWL to English.Journal of Biomedical Semantics, 2011 May 17, Vol. 2 Suppl 2:S5. Master headline
Acknowledgements • Thank to Chao Pang, Helen Parkinson,Tomasz Adamusiak, Paulo Silva and all people from functional genomics production team. • Gen2phen • The Coriell Institute for Medical Research • Alan Ruttenberg and Science Commons for providing the Coriell SQL dump • Lynn Schriml and colleagues from the Disease Ontology for OMIM mappings • Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry Meehan for discussions on the cell line model. Master headline