180 likes | 287 Views
Annotation Types for UIMA. Edward Loper. UIMA. Unified Information Management Architecture Analytics framework Consists of components that perform specific tasks (tagging, parsing, etc.) Each component declares its own interface (input/output, requirements, work flow metadata, etc)
E N D
Annotation Types for UIMA Edward Loper
UIMA • Unified Information Management Architecture • Analytics framework • Consists of components that perform specific tasks (tagging, parsing, etc.) • Each component declares its own interface (input/output, requirements, work flow metadata, etc) • All information is communicated using a single standard data format: CAS • Built-in support for network distribution, clustering, etc.
CAS • Common Analysis Structure • Tends to fall on the “weakly-merged” side of the spectrum (does not require annotations to be modified to ensure consistency). • Annotations are encoded using typed feature structures. • But the type definitions are left unspecified. • C.f.: XML • Components can only work together if they use the same type system.
Standard CAS Types • Goal: design standard CAS types for ULA annotations. • In particular, we’re currently looking at Treebank, Propbank, & Timebank. • Issues: • Redundancy of information • Coupling between annotations • Discontinuous constituents
CAS Types: background • UIMA does provide a couple of top-level types. (e.g. Annotation) • These make it clear that UIMA intends: • Standoff annotations… defined using spans… with character-based offsets • C.f. AGTK
Treebank • Typical representation for treebank: <TreeBankConstituent id=“8” start=“5” end=“23” type=“NP” children=“12 28 38” parent=“94”> • Questions: • Should children be explicitly marked? • Should parents be explicitly marked? • These questions have consequences…
Treebank: Explicit children? • How could we not mark children? • They can be mostly reconstructed, if we assume… • All constituents are properly nested • Unary branch direction can be determined based on node type. • Not quite true: SBAR/FRAG; S/NP; NP/FRAG; NP/PRN. • Theoretical consequences of (not) marking children. • Have to assume proper nesting of constituents • Alternatively, allow for multiple coexisting bracketings (a la chart parse) -- probably not what we want.
Treebank: Explicit parents? • Parent pointers are redundant -- it can be reconstructed. • But it can be very handy to have when working with structures. • Theoretical consequence of marking parents: • Every constituent has exactly one parent. • Rules out multi-parented trees. (fine.)
Propbank • Probank’s current annotation… • Is strongly coupled to treebank • Argument locations are specified using “tree pointers” • Includes trace chain information
Propbank: Tree Pointers • Each propbank argument is specified using a tree pointer w:h • The hth constituent above the wth word. • Problems with this strong coupling: • Propbank can’t be used without trees. • New propbanking can’t be done unless parsing has been done. • Changes to trees are annoying to propagate to propank.
Propbank: spans • Can we get away with using spans instead (UIMA’s preferred approach)? • Do we lose any information? • Potentially yes -- for binary branching nodes. • In practice: • 99.92% of non-trace args select the low constituent. • 97.9% of trace args select the high constituent. • The differences appear to just be errors. • … so no (important) lost info! • About 50-55% of split arguments go away.
Propbank: trace chains • For arguments that have undergone movement, propbank explicitly marks the trace chain. • But isn’t this something the tree should give us anyway? • Treebank & propbank have somewhat different notions of what gets included in trace chains. • 1/3 of the Propbank annotation guidelines talk about null elements.
Propbank: trace chains • How much can we recover? • Using very simple heuristics (e.g., link “NP-2 with *t*-2), ~60% • Using more advanced heuristics, maybe 80%. • Not close enough to 100% to throw them away. • Some differences harder to automate: e.g., propbank (usually) only marks traces that interact with the predicate in some way. • “Asbestosi was used ti … and replaced ti …”
Propbank: trace chains (?s for discussion) • Should marking trace chains be part of the propbanking task? • Or should we leave it up to the treebankers? • If it should be part of propbanking, should it be split off as a separate subtask? • Would that help annotation speed any? • Should the annotation be split off as a separate layer?
Discontinuous constituents • Propbank has provisions for discontinuous constituents: w1:h1,w2:h2 • Discontinuous constituents can appear almost anywhere • Temporal expressions • Named entities • Parse constituents (?) • Want: a uniform way to handle them.
Discontinuous constituents • Goals: • Make the common case easy • Make the uncommon case possible • Preferred approach: • Add an optional property (eg “pieces”) that can be used to specify discontinuous chunks. • If used, then the start/end properties should be treated with appropriate care • Open question: • Should this property be defined on the top-level type, or on individual types (eg PropBankArgument)?
A note on consistency • CAS is “weakly merged” -- it doesn’t enforce consistency. • But that doesn’t mean we can’t enforce consistency ourselves. • For weakly merged formats, it will be important to: • Define consistencies that we want • Both within annotations & between annotations • Actively check those consistencies during annotation. • Weakly coupled annotations are a good thing. • But the more weakly coupled the annotations are, the more we’ll need to check consistency
Questions/discussion • Strongly vs weakly merged • (when) is redundancy good? • How strongly coupled should annotations be? • Handling discontinuous constituents? • Where is there information overlap between annotations (e.g. coref chains)? What should be done about it? • Any principled way to decide when to mark heads vs spans? • Token offset vs character offset vs tree pointer