What’s “NEXT”? Navigating through Dense Annotation Spaces

What’s “NEXT”?Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. NeffLanguage Engineering for Content AnalysisIBM T.J. Watson Research Center Yorktown Heights, NY

Dense Annotation Spaces [SENT] [SENT] [SC] [SC] [SUB] [OBJ] [OBJ] [SUB] [OBJ] [OBJ] [PP] [PP] [NP] [VG] [NP] [NP] [VG] [NP] [NP] [VG] [NP] [NP] [VG] [NP] {np} {nps} {md} {vb} {nn} {nn} {in} {nn} {to} {vb} {dt} {nn} {np} {nps} {md} {vb} {nn} {nn} {in} {nn} {to} {vb} {dt} {nn} Service Reps can read customer name, in order to contact the customer.

Annotation ‘trees’ [SENT] [SC] [SUB] [OBJ] [OBJ] [PP] [NP] [VG] [NP] [NP] [VG] [NP] {np} {nps} {md} {vb} {nn} {nn} {in} {nn} {to} {vb} {dt} {nn} Service Reps can read customer name, in order to contact the customer.

Annotation lattice [SENT] [SC] [SUB] [OBJ] [OBJ] [PP] [NP] [VG] [NP] [NP] [VG] [NP] {np} {nps} {md} {vb} {nn} {nn} {in} {nn} {to} {vb} {dt} {nn} Service Reps can read customer name, in order to contact the customer.

Navigational Challenges [PNAME ] [Title][Name ] [First] [Middle] [Last] What is visible to the lattice traversal engine?

Annotation-Based Finite State Transducer (AFst) • UIMA-based • A finite state calculus over typed feature structures • Cf. “grep” over a sequence of annotations, specified as types and features np = <E>/[NP . Token[pos=~”DT”] | <E> . Token[pos=~”JJ”]* . ( Token[pos=~”NN”] | Token[pos=~”NNS”] ) . <E>/]NP ;

Pitching the Iterator: support for navigational control [SENT] [SC] [SUB] [OBJ] [OBJ] [PP] [NP] [VG] [NP] [NP] [VG] [NP] {np} {nps} {md} {vb} {nn} {nn} {in} {nn} {to} {vb} {dt} {nn} Service Reps can read customer name, in order to contact the customer.

Afst Traversal Regime • Defining a particular path through the annotation space requires a lattice traversal engine that can focus on—simultaneously— • Sequential constraints ~ pattern matching • Horizontal—prenominal mod and nominal head • Structural constraints • Vertical—iterate over NP with specific configurational relationship – e.g. not sentence initial, not in a PP • Configurational constraints • Type prioritization

Linearizing the Lattice: what’s “next”? [SUB] [OBJ] [OBJ] • Unambiguous Typeset iterator, inferred from grammar: …[SUB] . [VG] . [OBJ] . [PP] … • UIMA natural annotation sort order: • Start position ascending • Length descending • Type priority, defined in UIMA descriptors [PP] [NP] [VG] [NP] [NP] [VG] [NP]

Linearizing the Lattice: what’s “next”? Grammar-wide declarations boundary % Sentence[]; honour % Address[] ; month = Token[lemma=~”January”] | Token[lemma=~”February”]| … ; date = <E>/[Year . :month | <E> . Token[string=~:^[12]\d[{3}$:] <E>/]Year;

Focus:Selecting Nested Boundary Annotations <nameValuePair> <name>Focus</name> <value><array> <string>Section[label~=:Education:] </string> <string>Sentence[number==1] </string> </array></value> </nameValuePair>

Linearizing the Lattice: what’s “next”? Grammar-wide declarations match % first, last, longesr, shortest, all advance % skip, step

What’s “next”?:Switching Levels, Mixed Iterator Refocus the iterator to examine inner contour: @descend, @ascend findDrSmith = <E>/PName[@descend] . Title[string=~”Dr.” . <E>/Name[@descend] . First[]|<E> . Last[string==“Smith”] . <E>/Name[@ascend] . <E>/PName[@ascend] ;

Alternate Multiple Level Access Upper/lower context without switching levels Token[_costarts=~Sentence[number==1]; Subject[_covers=~PName[]; PName[_costarts=~NP[],_coends=~NP[]];

Grammar cascading • From simpler to more complex analyses • Lower levels of output feed as inputs into higher levels • Small noun phrases & verb groups • Prepositional, possessive & adjectival phrases • More complex noun phrases • Variety of clause types • Grammatical relations (subject, object)

Implementations • Shallow Parsing • Named Entity Detection interleaved with shallow parsing • Terminology identification in new domains • Temporal expression parsing • Privacy policy rules • Information extraction from resumes • Information extraction from contact center telephone calls

Future work list • Alternate (semi-ambiguous) iterator, useful for “disambiguator” grammars • Actor[] Director[] • Tree-walk iterator for tree representations where children are explicitly referenced in features

Performance Notes Performance is a function of • How grammar is written • Optimisation of fst graph (grammar compiler) • Optimisation of symbol compiler • Optimisation of executor However … for the benefit of the curious … IBM Software Group (Dublin) optimised the last two, and …

IBM LanguageWare (Dublin) text analysis performance results The Results: Precision for Company Annotations only: 0.81 Recall for Company Annotations only: 0.67 Precision for Person Annotations only: 0.93 Recall for Person Annotations only: 0.91 Processing time: 3.4 seconds These numbers are 10 times faster than the best of breed internal reference annotators. The analysis: - AFST rules and FST dictionary - 26 rules, 7 dictionaries (things like first names, indicators like Corp. etc) - creating Person and Company annotations The Test - test set: Enron - 924 files - (4.5Mb)

Bran Boguraev Mary Neff Bran Lambov D.J. McCloskey Thilo Goetz Thomas Hampp Oliver Suhre Roy Byrd Herb Chong Albert Eskenazi Paul Kaye Son Bao Pham Lokesh Shresta Max Silberztein Perpetrators … er…Responsible parties

Tomorrow, 12:25 in Fez 1: A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Environment Youssef Drissi, Branimir Boguraev, David Ferrucci, Paul Keyser, and Anthony Levas For more on AFst and tools --

What’s “NEXT”? Navigating through Dense Annotation Spaces