80 likes | 174 Views
Entity Recognition: Current Status and Summer Plan. Jing Jiang May 12, 2006. Update since last meeting. Met with Nyla (the biologist) to talk about training/evaluation data Most annotated genes in the BioCreative data set are reasonable
E N D
Entity Recognition:Current Status and Summer Plan Jing Jiang May 12, 2006
Update since last meeting • Met with Nyla (the biologist) to talk about training/evaluation data • Most annotated genes in the BioCreative data set are reasonable • To manually annotate a sample set of bee literature for evaluation and tuning purpose • Tagged some other collections (fly-bcb, songbird, Wnt pathway) • Identified some common errors and came up with some heuristics to fix the errors
Current performance • On BIOSIS honey bee: waiting to hear from Nyla for judgment on the honey bee sample • On Wnt pathway full-text articles (a sample of 100 sentences, judged by Xin) • Precision: 92% (207 / 224) • Recall: 84% (207 / 245) • Examples: • fly, songbird, Wnt pathway
Common errors and heuristics • Same word/phrase tagged differently within the same article • Because of the different contexts • Heuristic: force the tagging to be consistent • Long form and its abbreviation tagged differently • E.g.: …a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and… • Heuristic: force the tagging to be consistent • Easily detectable false positives • E.g.: Roughly half of Drosophilagenes currently… • Heuristic: compile a list (of species names, chemical names, etc.) and some heuristic rules
Common errors and heuristics (cont.) • Conjunctive words/phrases tagged differently • E.g.: …three cbl genes (c-cbl , cblb , and cblc) which… • Heuristic: use some rules to capture such conjunctive words, and tag them consistently • Tokenization errors: • E.g.: There is no difference in AmTRP-expressing cells among worker, … • Heuristic: compile a list of typical suffixes (such as “-expressing”, “-dependent”, etc.) that should be separated from their prefixes
Common errors and heuristics • Mistakes caused by citations: • Only in certain text (Wnt pathway collection has this problem. BIOSIS collections don’t.) • E.g.: Among the downstream targets of PI 3-kinase are phospholipase C (6-9) , protein kinase C (10, 11) , Rac (12-14) , and… • Heuristic: remove these citations(?) • Controversial cases: domain, subunit, etc. • E.g.: Alternating proline / alanine sequence of beta B1 subunit originates… • BioCreative data set tags these as part of gene names
Summer plan • Evaluate the performance on honey bee data based on Nyla’s judgments • Implement and tune the heuristics to capture the common errors, and evaluate their effectiveness • Some heuristics may cause new errors • Tune on the annotated sample honey bee data • Based on the need of BeeSpace, find a good balance between precision and recall • Work with Todd on the input/output format of the entity recognizer