190 likes | 320 Views
A Probabilistic Term Variant Generator for Biomedical Terms. Yoshimasa Tsuruoka and Jun ’ ichi Tsujii CREST, JST The University of Tokyo. Outline. Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion. Background.
E N D
A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun’ichi Tsujii CREST, JST The University of Tokyo
Outline • Probabilistic Term Variant Generator • Generation Algorithm • Application: Dictionary expansion
Background • Information extraction from biomedical documents • Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated…
Technical Term Recognition • Machine learning based • Identifying the regions of terms ⇒ No ID information • Dictionary-based • Comparing the strings with each entry in the dictionary ⇒ ID information
Problems of Dictionary-based approaches • Spelling variation degrades recall ⇒ Approximate string searching • False positivesdegrade precision ⇒ Filtering by machine learning
Exact String Searching • Example • Text Phorbol myristate acetate induced Egr-1 mRNA… • Dictionary EGP EGR-1 EGR-1 binding protein : ⇒ Any of them does not match
Edit Distance • Defines the distance of two strings by the sequence of three kinds of operations. • Substitution • Insertion • Deletion • Ex.)board → abord • Cost = 2 (delete `a’ and add `a’)
Automatic Generation of Spelling Variants • Variant Generator NF-Kappa B (1.0) NF Kappa B (0.9) NF kappa B (0.6) NF kappaB (0.5) NFkappaB (0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability
Generation Algorithm • Recursive generation P = P’ x Pop T cell (1.0) 0.5 0.2 T-cell (0.5) T cells (0.2) 0.2 T-cells (0.1)
Collecting Examples of Spelling Variation • Abbreviation Extraction (Schwartz 2003) • Extracts short and long form pairs
Learning Operation Rules • Operations for generating variants • Substitution • Deletion • Insertion • Context • Character-level context: preceding (following) two characters • Operation Probability
Application:Dictionary Expansion • Expanding each entry in the dictionary • Threshold of Generation Probability: 0.1 • Max number of variants for each entry: 20
Protein Name Recognition • Information Extraction • Longest match • GENIA corpus
Conclusion • Probabilistic Variant Generator • Learning from actual examples • Dictionary expansion by the generator improves recall without the loss of precision.