Supporting Annotation Layers for Natural Language Processing

Supporting Annotation Layers for Natural Language Processing Preslav Nakov,Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

Project overview • A system for flexible querying of text that has been annotated with the results of NLP processing. • Supports • self-overlapping and parallel layers, • integration of syntactic and ontological hierarchies, • and tight integration with SQL. • Designed to scale to very large corpora. • Demo of LQL (Layered Query Language) on examples taken from the NLP literature.

Key Contributions • Multiple overlapping layers (cannot be expressed in a single XML file) • Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text • Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) • Specialized query language • Flexible results format • Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations • 1.4 million MEDLINE abstracts • 10 million sentences annotated • 320 million multi-layered annotations • 70 GB database size.

Layers of Annotations • Each annotation represents an interval spanning a sequence of characters • absolute start and end positions • Each layer corresponds to a conceptually different kind of annotation • Layers can be • Sequential • Overlapping (e.g., two multiple-word concepts sharing a word) • Hierarchical • spanning, when the intervals are nested as in a parse tree, or • ontologically, when the token itself is derived from a hierarchical ontology

Annotation Layers Example

System Architecture(Main table)

System Architecture(Indexes) • (Forward) +doc_id+section+layer_id+sentence+first_word_pos+last_word_pos+tag_type • (Inverted) +layer_id+tag_type+doc_id+section+sentence+first_word_pos+last_word_pos • (Inverted) +word_id+layer_id+tag_type+doc_id+section+sentence+first_word_pos

Example query I • Protein-Protein Interactions • Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.

Example query I - LQL SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

Example query I – Sample output

Example query II • Chemical–Disease Interactions • “Adherence to statin prevents one coronary heart disease event for every 429 patients.” • Goal: extract the relation that statin (potentially) prevents coronary heart disease. • MeSH C subtree contains diseases • MeSH supplementary concepts represent chemicals.

Example query II - LQL [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number ~ 'C%'] AS disease $ ] ] AS sent SELECT sent.pmid, chemical.text, disease.text, sent.text

Supporting Annotation Layers for Natural Language Processing