310 likes | 564 Views
Sub-language Processing for phenotype curation. Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013. Agenda. CharaParser Methodology Evaluation Applications CharaParser for Phenoscape New modules Evaluations Challenges. “ Fine-Grained S emantic M ark-up”.
E N D
Sub-language Processing for phenotype curation Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013
Agenda • CharaParser • Methodology • Evaluation • Applications • CharaParser for Phenoscape • New modules • Evaluations • Challenges
“Fine-Grained Semantic Mark-up” • To annotate factual information from textual morphological descriptions of biodiversity in such a detailed manner that the machine readable annotation itself provides information equivalent to the original text.
Previous Research • Syntactic parsing approach (Taylor, 1995 ; Abascal & Sanchenz, 1999; Vanel, 2004) • Interactive extraction (Diederich, J., Fortuner, R. & Milton, J. 1999). • Semi-supervised bootstrapping for lexicons (Ellen Riloff, 1999) • Supervised regular expression rule learning (Soderland, 1999; Tang & Heidorn 2008) • Ontology driven and parallel text (Woods et. al. 2004) • Supervised association rule learning (Cui & Heidorn, 2007)
CharaParser Approach • Unsupervised machine learning to find anatomy and character terms from descriptions automatically • No need to prepare training examples • 50% - 80% terms learned • General-purpose syntactic parser (e.g., Stanford Parser) to parse syntactic structure of sentences • No need to create special-purpose, domain-dependent parser • Learned lexicon from 1 is used to adapt the Parser for biodiversity domains • Intuitive rules to produce annotations from parse trees.
Unsupervised lexicon learning If it is known “roots” is an organ: • Rootsyellow to medium brown or black, thin. • Petals yellow or white • Petals absent; • Subtending bracts absent; • Abaxialhastulaabsent;
Compared against a Heuristics-Based Method • Parser performance evaluated on the same data sets. • CharaParser: unsupervised learning + Stanford Parser • Heuristics-based: unsupervised learning + regular expression rules
Annotation Problems • Chunk errors: • Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–4(–5) mm, pliant; • Attachment errors: • on outer cypselae, crowns of bristlelike scales ca. 0.5 mm; on inner, of dusky white or pale yellow, plumose bristles 5–6 mm. • Semantics: • straight posterolateral bounding ridges to subtriangular , bilobed ventral muscle field;
Applications at Various Development Stages • Convert XML markup to • SDD for identification key generation • Character matrices for tree of life • RDF for the Semantic Web and search • Use marked-up descriptions to support search • FNA Experimental Search • Data source is RDF triples • Allow character based search • Plants that give yellow flowers at 200-400 meter elevation in April in North Carolina
To-Dos • Tighter integration of ontologies in annotation process. • Currently internal glossaries are used in place of ontologies to link a character state (e.g., “red”) to a character (“color”) • Synonyms are not controlled • “Petiolate” = “with petiole” • Continue to reduce annotation errors • Accommodate various syntactic styles • Diagnosis paragraphs • Comparison among different taxa
Phenotype Curation • Convert character and character state information from natural language descriptions to EQ statements
Curator Mental Process • ontologies
Adapted CharaParser Ontologies
Evaluations • Internal evaluation: • The development corpus (three publications on fishes and archosaurs) provided 1,200 character descriptions. 100 of them included in the internal evaluation benchmark. • Raw EQ performance: 90% • Final EQ performance: 50% • BioCreative2012 evaluation: • 50 descriptions independently selected by the organizer (>50% Qs were not in ontologies) • Gold standard created by chief phenoscape curator (raw and final) • Three biocurators worked in two modes (Phenex vs. Phenex+CharaParser) • Raw EQ performance: CharaParser better than biocurators • Final EQ perfoamnce: biocuration better than CharaParser • Inter-curator agreements:
Error Analyses • Various fixable syntactic problems • E.g., “digits I-III” • Curationgranularity • CharaParsergenerated more candidate EQs than curators • “Preopercularlatero-sensory canal leaves preopercle at first exit and enters a plate: yes/no” • Annotating relations (relational quality) • “contact between …”
Ontology Access • Currently use keyword-based search • Class labels and exact, narrower, and related synonyms • False positives • acute(shape) =? acute (process) • "margin" is a broad synonym of "marginal zone of embryo" in UBERON • Pre-composed terms in ontology • “ceratobranchial 5 tooth”, “rib of vertebra 5”, “body of humerus” • Ambiguious term use in descriptions • ‘epibranchial 1’ => epibranchial 1 element? bone? cartilage? • No matching
Exploration of Solutions • Experimented with • Word sense disambiguation: • “crinkly” not in PATO • Candidate matches: [undulate->1.00000000000002] [obovate->1.00000000000001] [flat->1.00000000000001] [flattened->1] [circinate->0.884697579551583] • Experimenting with • Subsets • Specify included classes: e.g. classes related to vertebrates • Specify excluded classes: e.g. exclude certain developmental stages • Ideas to try out: • Bootstrapping to narrow down the search space • starting from known classes • evaluating candidate matches based on the distances to the known classes and other source of evidences.
Annotation consistency • Instructions given to human curators are helpful to CharaParser • Restricted relation list: • http://phenoscape.org/wiki/Guide_to_Character_Annotation#Relations_used_for_post-compositions
Feed more info to EQ generation module Ontologies
Recent Improvements • Explorer of Taxon Concepts project • Making it a pure-java program/web-based application • Currently requires MySQL + Perl • Making it faster • Optimization of the program • Removing MySQL and reducing I/O • “Parallel” computing using java threads • Preliminary evaluation shows • 20 times faster: 2 sec/taxon description • Memory requirements increased by 3 folds
Acknowledgements • Fine-Grained Semantic Markup Project (current and past) • James Macklin: Agriculture and Agri-Food Canada • Robert (Bob) Morris, Alex Dusenbery: UMass-Boston • HariharanGopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan Krishna Gowda, ParthaParthaPratimSanyal, Chunshui Yu: University of Arizona • Phenoscape Project • Chris Mungall: Laurence Berkeley National Lab • Melissa Haendel : Oregon Health & Science University • Paula Mabee, Alex Dececchi: University of South Dakota • Jim Balhoff, WasilaDahdul, Hilmar Lapp, Todd Vision: NESCent • NSF ABI and EF Programs • The Flora of North American Project