1 / 19

Rapid Development of an Ontology of Coriell Cell Lines

Rapid Development of an Ontology of Coriell Cell Lines. Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone. malone@ebi.ac.uk. Overview. Motivation – our EBI use cases, big stuff in little time Methods – normalization, axiomatisation, automation

rose-stokes
Download Presentation

Rapid Development of an Ontology of Coriell Cell Lines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid Development of an Ontology of Coriell Cell Lines Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone malone@ebi.ac.uk

  2. Overview • Motivation – our EBI use cases, big stuff in little time • Methods – normalization, axiomatisation, automation • Results – large flat ontology, adding structure • Desiderata • Future work Master headline

  3. Motivation • Lot of large (often online) catalogues • Infeasible to manually ontologise all of them (and cost is not easy to justify) Master headline

  4. High quality artefacts take time $money$ 33k classes 75k classes FMA Gene Ontology 30 person years Since 1999 3k classes OBI Since 2006 Funders Master headline

  5. Motivation • Would like to include these in BioSample Database for curating data, querying, browsing • Ontology use has proven to add value (e.g. GXA) • RDF representations of data for integration Master headline

  6. Methods: A Model for Cell Lines • Agreed with cell line ontology folk • With addition of is_model_for, to reflect use of cell lines as models for particular diseases (they can be used as models for diseases other than the original disease borne by the subject) Master headline

  7. Methods: Programmatic Engineering • Lexical concept recognition (2) - Map the terms to existing ontologies by either class label or synonyms in order to get identifiers (e.g. label – Homo sapiens and synonym – human)‏ • Produce a list of classes from reference ontologies (3) e.g. • Map organism terms: NCBI taxonomy ontology • Map disease terms: Disease ontology • Map these to an upper ontology and axiomatic structure (4) cell line model on previous slide Master headline

  8. Methods: Our “raw” data • Coriell database dump • Extract information for cell line, cell type, anatomy, organism, sex and disease. (27002 cell lines)‏

  9. Lexical Concept Recognition • Our mapper tool was used, adapted the Levenshtein distance algorithm <http://www.ebi.ac.uk/efo/tools> • Mapping result: • 93 organism - NCBI taxonomy ontology • 61 anatomy – MAT (species neutral), FMA and other anatomy • 23 cell type – cell ontology • 337 disease - Disease ontology (DO) (via OMIM) • Check the mapping file manually • Inconsistency e.g. fibroma as anatomy • Unmapped terms • Unclear information e.g. buttock-thigh

  10. Using OWL-API to build ontology automatically Example: import class cervix from EFO It has parent class organism part It has class relations part_of some female reproductive system Mapping files Reference ontologies Step one: Adding classes buildingOntology.java Coriell cell line protocol Spreadsheet.txt Step two: Adding restriction Hela cell line Coriell cell line Ontology Master headline

  11. Adding Structure to Normalized Hierarchy Query: human cancer cell line Coriell cell line ontology Master headline

  12. Explanation of query: human cancer cell line Cancer Cell line any cell type any organism part human Master headline

  13. Results In Brief • Large (~28,000) cell line ontology, lots of axiomatisation for the axes mentioned (disease, organism, cell type, anatomy some extra bits like gender) • Running code took ~5 minutes • Flat asserted hierarchy under cell line; structure is all inferred using axioms as per Rector Normalization • Makes more flexible to change and also querying more powerful Master headline

  14. BioPortal Analysis by Matt Horridge et al • http://bio-ontologies.knowledgeblog.org/135 • “the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment“ • Coriell is quite rich axiomatically – this is what we would expect given the normalization method • Suggestion rich ontologies can be created rapdily Master headline

  15. Discussion • Good bits: • Adding new classes was simple once code was done • Re-running the code also simple if a part of pattern was changed • Can be reused for different ontology with similar sort of input • Axiomatisation gives richer querying and renders implicit knowledge explicit • Harder bits: • Some data clean up necessary before using the code (SQL dumps can be messy) • Eliminating errors from concept matching (though expansion on synonyms helped) • Textual definitions for all classes • Size of the ontology created a few issues, e.g. BioPortal not able to browse, most reasoners can not complete on the ontology Master headline

  16. Conclusions • To build very big ontologies we need either • Lots of resource (i.e. money) • Lots of people (crowd-sourcing) • Good tools and methods • Each having pros and cons • Lots of money requires, well, lots of money • Lots of people requires community buy in and ‘trust’ (so audit trail crucial) – we’re not always good at that • Automation can introduce errors and require some curation Master headline

  17. Desiderata: Querying and External URIs • Easier to use interfaces, users like it simple • One of top user feedbacks is “we’d like it to work like Google does” • Users also like putting URIs into a webpage and getting something sensible back i.e.: Master headline

  18. Future Work • Creating text definitions • Will follow up on work by Stevens R, Malone, J et al (2011)* for creating natural language definitions from OWL • Text mining to extract missing disease information from textual descriptions from original catalogue • Begin curating data for BioSample Database at EBI using ontology • Integrate Coriell ontology with rest of process from sample to analysis, e.g. software ontology http://theswo.sourceforge.net *Robert Stevens, James Malone, Sandra Williams, Richard Power, Alan Third (2011) Automating generation of textual class definitions from OWL to English.Journal of Biomedical Semantics, 2011 May 17, Vol. 2 Suppl 2:S5. Master headline

  19. Acknowledgements • Thank to Chao Pang, Helen Parkinson,Tomasz Adamusiak, Paulo Silva and all people from functional genomics production team. • Gen2phen • The Coriell Institute for Medical Research • Alan Ruttenberg and Science Commons for providing the Coriell SQL dump • Lynn Schriml and colleagues from the Disease Ontology for OMIM mappings • Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry Meehan for discussions on the cell line model. Master headline

More Related