150 likes | 279 Views
Anatomy ontology evaluation @ Arr ayExpress. Helen Parkinson, PhD. Content . ArrayExpress use cases Fuzzy matching of ontology terms Data driven ontology building Wish list. Public/Private. ATLAS. Re-annotate. Summarize. Gene queries. Experiment queries. Submit. Hybs.
E N D
Anatomy ontology evaluation @ Arr ayExpress Helen Parkinson, PhD
Content • ArrayExpress use cases • Fuzzy matching of ontology terms • Data driven ontology building • Wish list
Public/Private ATLAS Re-annotate Summarize Gene queries Experiment queries Submit Hybs ArrayExpress: Overview Public Only Cross expt/ species queries Genes
Fuzzy matching of ontology terms – why? • Clean up ArrayExpress OE and synonym tables • OE based integration • Constrain OEs on data entry/validation • Improved searches in repository/DW web interface • Data integration across species, experiments and experimental designs • Automated mapping of free text to ontology terms for data imporrt
Phonetic Matching • Precompute phonetic encodings of all terms in the ontology • Match each target term by comparing these encodings • Soundex: Robert Russell and Margaret Odell (1918), famously described by Donald Knuth • Double Metaphone: Lawrence Philips (2000) • Metaphone: Lawrence Philips • Most matches are single • Highest success rate
Failures to match • Species (or Kingdom)-specific terms (e.g. plant anatomy) • Conflated terms (e.g. diseased cell types) • Compound terms (e.g. "cerebral cortex and hypothalamus") • Genuinely missing terms • Esoteric terms less of a priority • Most trivial misspellings, however, were matched • Dirty input data
Implications • Need more terms in some commonly-used ontologies • Synonyms are important • generating less noise • better coverage • Choice of ontology can limit expressivity - this will be frustrating to biologists
Why? • Clean up ArrayExpress OE and synonym tables • Add accessions/DB links to these tables • Constrain OEs on data entry/validation • Improved searches in repository/DW web interface • Generate suggestions for new OE terms • Evaluate domain coverage by a given ontology
ArrayExpress Ontology Development and Future Directions Developing the Ontology • Define Scope: ArrayExpress already has some useful structure given the current database plus rich source of use cases and competency questions. • Build: Ontology Capture: Identify key concepts and relationships within our domain and give explicit definitions to these features: • Middle-out approach – specify core of basic terms then specialise and generalise as required • Mappings – text mining approach to do initial semi-automated mappings to external resources for rapid coverage • Manual mapping for data warehouse data, and selected data sets
ArrayExpress Ontology Development and Future Directions Capture to Code: Definitions and Hierarchy
ArrayExpress Ontology Development and Future Directions Semantic Roadmap • Position of the ArrayExpress Experimental Factor Ontology in the ‘bigger picture’ • Key is orthogonal coverage, reuse of existing resources and shared frameworks Chemical Entities of Biological Interest (ChEBI) NCI Cell Type Ontology Various Species Anatomy Ontologies Common Anatomy Reference Ontology Disease Ontology AE Ontology
Wish list • NOT to build our own anatomy ontology • CARO extension • CARO evaluation • Mapping CARO to relevant multi-species ontologies • Application of CARO to ArrayExpress data • Use of CARO in ArrayExpress tools
Acknowledgments • Anna Farne • Ele Holloway • James Malone • Margus Lukk ArrayExpress Production Team • Helen Parkinson • Tim Rayner • Faisal Rezwan • Eleanor Williams • Mengyao Zhao • Holly Zheng