250 likes | 335 Views
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu.
E N D
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov,Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Overview • Motivation: Need to re-use results of NLP processing: • for additional processing • for end applications: data mining etc. • Proposed solution: • Layers of annotations over text • Illustration: • Application to noun compound bracketing
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibody targets the cell line. • In (b), the cell line is derived from the liver.
Related Work • Pustejosky et al. (1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w3) vs. Pr(w2|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Nakov & Hearst (2005): will be presented at coNLL! • use the Web, Chi-squared • n-grams • paraphrases • surface features
Nakov & Hearst (2005) • Web page hits: proxy for n-gram frequencies • Sample surface features • amino-acid sequence left • brain stem’s cell left • brain’s stem cell right • Majority vote to combine different models • Accuracy 89.34%
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Web Counts: Problems • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • “health”: returns nouns • “care”: returns both verbs and nouns • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition • Page hits are inaccurate
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Solution: MEDLINE+LQL • MEDLINE: ~13 million abstracts • We annotated: • 1.4 million abstracts • ~10 million sentences • ~320 million annotations • Layered Query Language: demo at ACL! • http://biotext.berkeley.edu/lql/
The System • Built on top of an RDBMS system • Supports layers of annotations over text • hierarchical, overlapping • cannot be represented by a single-file XML • Specialized query language • LQL (Layered Query Language)
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Noun Compound Extraction (1) layers’ beginnings should match FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ endings should match
Noun Compound Extraction (2) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC
Noun Compound Extraction (3) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC layer negation artificial range
Finding Bigram Counts SELECTCOUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUPBY lc ORDER BY freq DESC
Paraphrases • Types of paraphrases (Warren,1978): • Prepositional • immunodeficiency virus in humans right • Verbal • virus causinghuman immunodeficiency left • immunodeficiency virus found inhumans left • Copula • immunodeficiency virus that is human right
Prepositional Paraphrases SELECTLOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && contentIN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && contentIN ("the","a","an")] [layer=’pos’ && tag_type="noun" && contentIN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer
Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation
Evaluation • obtained 418,678 noun compounds (NCs) • annotated the top 232 NCs (after cleaning) • agreement 88% • kappa .606 • baseline (left): 83.19% • n-grams: Pr, #, χ2 • prepositional paraphrases • for inflections, we used UMLS
Results wrong N/A correct
Discussion • Semantics of bone marrow cells • top verbalparaphrases • cells derived from bone marrow (22 instances) • cells isolated from bone marrow (14 instances) • top prepositional paraphrases • cells in bone marrow (456 instances) • cells from bone marrow (108 instances) • Finding hard examples for NC bracketing • w1w2w3 such that both w1w2 and w2w3 are MeSH terms
The End Thank you!