1 / 8

Entity Recognition: Current Status and Summer Plan

Entity Recognition: Current Status and Summer Plan. Jing Jiang May 12, 2006. Update since last meeting. Met with Nyla (the biologist) to talk about training/evaluation data Most annotated genes in the BioCreative data set are reasonable

matty
Download Presentation

Entity Recognition: Current Status and Summer Plan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Recognition:Current Status and Summer Plan Jing Jiang May 12, 2006

  2. Update since last meeting • Met with Nyla (the biologist) to talk about training/evaluation data • Most annotated genes in the BioCreative data set are reasonable • To manually annotate a sample set of bee literature for evaluation and tuning purpose • Tagged some other collections (fly-bcb, songbird, Wnt pathway) • Identified some common errors and came up with some heuristics to fix the errors

  3. Current performance • On BIOSIS honey bee: waiting to hear from Nyla for judgment on the honey bee sample • On Wnt pathway full-text articles (a sample of 100 sentences, judged by Xin) • Precision: 92% (207 / 224) • Recall: 84% (207 / 245) • Examples: • fly, songbird, Wnt pathway

  4. Common errors and heuristics • Same word/phrase tagged differently within the same article • Because of the different contexts • Heuristic: force the tagging to be consistent • Long form and its abbreviation tagged differently • E.g.: …a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and… • Heuristic: force the tagging to be consistent • Easily detectable false positives • E.g.: Roughly half of Drosophilagenes currently… • Heuristic: compile a list (of species names, chemical names, etc.) and some heuristic rules

  5. Common errors and heuristics (cont.) • Conjunctive words/phrases tagged differently • E.g.: …three cbl genes (c-cbl , cblb , and cblc) which… • Heuristic: use some rules to capture such conjunctive words, and tag them consistently • Tokenization errors: • E.g.: There is no difference in AmTRP-expressing cells among worker, … • Heuristic: compile a list of typical suffixes (such as “-expressing”, “-dependent”, etc.) that should be separated from their prefixes

  6. Common errors and heuristics • Mistakes caused by citations: • Only in certain text (Wnt pathway collection has this problem. BIOSIS collections don’t.) • E.g.: Among the downstream targets of PI 3-kinase are phospholipase C (6-9) , protein kinase C (10, 11) , Rac (12-14) , and… • Heuristic: remove these citations(?) • Controversial cases: domain, subunit, etc. • E.g.: Alternating proline / alanine sequence of beta B1 subunit originates… • BioCreative data set tags these as part of gene names

  7. Summer plan • Evaluate the performance on honey bee data based on Nyla’s judgments • Implement and tune the heuristics to capture the common errors, and evaluate their effectiveness • Some heuristics may cause new errors • Tune on the annotated sample honey bee data • Based on the need of BeeSpace, find a good balance between precision and recall • Work with Todd on the input/output format of the entity recognizer

  8. Discussion

More Related