10 likes | 94 Views
Supporting Annotation Layers for Natural Language Processing. Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst. CS & SIMS UC Berkeley. Word Part of Speech Shallow Parse. Ontology Gene/protein. Gene/protein. 596. 12043. 24224. 281020. 42722. 397276. D007962. D016923.
E N D
Supporting Annotation Layers for Natural Language Processing Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst CS & SIMS UC Berkeley Word Part of Speech Shallow Parse Ontology Gene/protein Gene/protein 596 12043 24224 281020 42722 397276 D007962 D016923 Ontology D001773 D044465 D001769 D002477 D003643 D016158 D019254 Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53. Word 185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523 POS NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN Shallow parse NP PP NP VP PP NP NP PP NP Example: Chemical–Disease Interactions Project overview Annotation Layers Example We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature. “Adherence to statin prevents one coronary heart disease event for every 429 patients.” • Goal: extract the relation that statin (potentially) prevents coronary heart disease. • MeSH C subtree contains diseases • MeSH supplementary concepts represent chemicals. • LQL query to find potentially useful sentences : Project url: http://biotext.berkeley.edu/lql Project support: NSF-DBI-0317510 & Genentech FROM [layer=‘sentence’ { NO ORDER, ALLOWGAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘MeSH’ && tree_numberBELOW “C”] AS disease $ ] ] AS sent SELECT chemical.content, disease.content, sent.content Full parse, sentence and section layers are not shown. Framework • Annotations are stored independently of text in an RDBMS • Declarative query language for annotation retrieval • Indexing structure designed for efficient query processing • Layered Query Language for easy retrieval • Object Oriented API for annotations: insertion, deletion and modification Based on benchmarking, we use Archictecture 5 Indexing Architectures PMID PMID SECTION SECTION LAYER LAYER START START END TAG TAG SENTE SENTE FIRST WORD POS LAST WORD POS SEQUE SEQUE WORD WORD CHAR CHAR NCE NCE NCE NCE ID ID CHAR TYPE TYPE ID ID POS POS POS POS POS 0 (word) 3345 3345 b (body) b (body) 34 34 39 39 59571 59571 2 1 1 1 59571 59571 This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW). 3345 3345 b b 0 0 41 41 48 48 55608 55608 2 2 2 2 55608 55608 3345 3345 b b 0 0 50 50 54 54 89985 89985 2 3 3 3 89985 89985 3345 3345 b b 1 (POS) 1 (POS) 34 34 39 39 27 (NN) 27 (NN) 2 1 1 1 59571 59571 3345 3345 b b 1 1 41 41 48 48 53 (VB) 53 (VB) 2 2 55608 55608 2 2 3345 3345 b b 1 1 50 50 54 54 27 27 2 3 3 3 89985 89985 1 1 3345 3345 b b 3(s.parse) 3(s.parse) 34 34 39 39 31(NP) 31(NP) 2 1 Basic architecture Added, architecture 3 Added, architecture 5 Added, architecture 2 Added, architecture 4 Key Contributions • Multiple overlapping layers (cannot be expressed in a single XML file) • Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text • Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) • Specialized query language • Flexible results format • Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations Related Work Example: Protein-Protein Interactions • Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc. Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene. • Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001) • NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001); • TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002) • Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001) The LQL Query SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt ( BEGIN_LQL [layer=‘sentence’ { ALLOWGAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘gene’] $ ] AS p1 ] SELECT p1.content, verb.content, p2.content END_LQL ) GROUPBY p1.content, verb.content, p2.content ORDER BY cnt DESC • 1.4 million MEDLINE abstracts • 10 million sentences annotated • 320 million multi-layered annotations • 70 GB database size. Layers of Annotations Sample Output • Each annotation represents an interval spanning a sequence of characters • absolute start and end positions • Each layer corresponds to a conceptually different kind of annotation • Layers can be • Sequential • Overlapping (e.g., two multiple-word concepts sharing a word) • Hierarchical • spanning, when the intervals are nested as in a parse tree, or • ontologically, when the token itself is derived from a hierarchical ontology Summary • A mechanism to effectively store and query layers of textual annotations. • Evaluated various structures for data storage and have arrived at an efficient and simple one. • Implemented a concise and powerful annotation query language (LQL). • Built a web interface • Planning to release the software to the research community.