330 likes | 445 Views
A Knowledge-Level Type System for UIMA. Technical Presentation Semantic Analysis & Integration IBM Watson (Hawthorne). Last Update: 5/17/06. Project Coordinators : J. William Murdock, Christopher Welty, David Ferrucci Other Project Participants :
E N D
A Knowledge-LevelType System for UIMA Technical Presentation Semantic Analysis & Integration IBM Watson (Hawthorne) Last Update: 5/17/06 Project Coordinators: J. William Murdock, Christopher Welty, David Ferrucci Other Project Participants: Jennifer Chu-Carroll, Krzysztof Czuba, Edward Epstein, Bhavani Iyer, Adam Lally, John Lenchner, Anthony Levas, Mary Neff
Part 1 Background and Main Ideas
Goal • Representing knowledge-level information. • Many common analysis processes involve entities and relations. • For such processes to interoperate, they need a common representation for this information. • This mechanism should be: • Compatible with UIMA’s existing annotation-level type system (TOP, Annotation, primitive types) • Application neutral • Capable of supporting provenance, i.e., of encoding how entities and relations were obtained
Processes to Support • Some common text analysis process: • Entity Recognition (annotating a span with an entity type) • Relation Recognition (annotating a span with a relation type) • Relation Annotation Argument Identification (connecting a relation annotation to annotations of arguments) • Entity Identification (deciding which entity an annotation refers to) • Relation Identification (deciding which relation an annotation refers to) • Entity Classification (determining the type of an entity) • Canonical Form Identification (determining the most standard name) • Variant Form Identification (determining all names) • The current UIMA upper-level type system is well-suited to handling annotation processes(1 & 2). • It does not have a standard representation for assigning arguments to relation annotations (3), performing coreference analysis (4 & 5), and classification (6), and naming (7 & 8).
Key Annotation Layer Referent Layer Example Content Relation: OwnerOf Entity:Organization Entity:Person OwnerOf (Relation Annotation) Person (Entity Annotation) OwnerOf (Relation Annotation) Person (Entity Annotation) Organization (Entity Annotation) Mr. Gradgrind is ... Gradgrind, owner of GF, ... Person (Entity Annotation) Organization (Entity Annotation) Joseph Gradgrind is the owner of Gradgrind Foods.
Argument Argument to from from to Entity:Person Relation:OwnerOf Entity:Organization from from from from HasOccurrence HasOccurrence to to HasOccurrence OwnerOf (Relation Annotation) HasOccurrence Person (Entity Annotation) relationArgs Mr. Gradgrind is ... BinaryRelationArgs (RelationArgs) to to Person (Entity Annotation) Organization (Entity Annotation) range domain span span span Key Annotation Layer Referent Layer Reified Links Example Content: Detail Joeseph Gradgrind is the owner of Gradgrind Foods.
Key Annotation Layer Referent Layer Reified Links Existing Type Type Hierarchy uima.cas.TOP Link Referent uima.tcas.Annotation RelationArgs Entity Relation Relation Annotation Entity Annotation HasOccurrence HasEvidence Argument Binary Relation Args Generic Relation Args
Key Annotation Layer Referent Layer Reified Links Simplified UML Class Diagram uima.cas.TOP Referent uima.tcas.Annotation Argument [*] • componentId : String • classes : String [*] • canonicalForm : String • variantForms : String [*] • id : String • begin : int • end : int Argument [*] EntityAnnotation RelationAnnotation • componentId : String • mentionType : String • componentId : String Entity Relation HasOccurrence [*] HasOccurrence [*]
Key Annotation Layer Referent Layer Reified Links Full UML Class Diagram args uima.cas.TOP * GenericRelationArgs from to Referent RelationArgs uima.tcas.Annotation • componentId : String • classes : String [*] • canonicalForm : String • variantForms : String [*] • id : String Link from relationArgs • begin : int • end : int • componentId : String to * * * links links links predicate BinaryRelationArgs EntityAnnotation RelationAnnotation • componentId : String • mentionType : String • componentId : String Entity Relation Argument HasOccurrence • role : String
Key Annotation Layer Referent Layer Reified Links Links Among Types Relation HasOccurrence Argument(role) Entity Relation Annotation HasOccurrence RelationArgs (role) Entity Annotation
Part 2 Open Technical Issues
Links between Elements • Issue: Some connections between elements need to be: • Bidirectional • Many-to-many • Capable of being marked up with additional information (e.g., the ID of the component that determined that this link exists) • Proposed mechanism: • There is a Link class and a set of subclasses for different kinds of links (e.g., HasOccurrence, Argument, HasEvidence) • Each element that can be linked has a list of links • When a Link is created, it points to both elements, and it is added to the Link list of both elements. • Currently we do not use the link type for relation annotation arguments even though we use it for relation arguments (etc.). Revisit? • Another alternative would be for UIMA to supply us with bidirectional, many-to-many, annotatable features.
Type-specific entity properties (1) • Some types of entities have specific properties commonly associated with them. • e.g., Person: gender, first name, last name, title, etc. • It is possible to make these features in an annotation type. • This does not provide a place to put conclusions about the entity that these annotations are occurrences of • Especially when there is disagreement within the annotations Persongender=?firstName=JoelastName=Gradgrind Joe Gradgrind Persongender=MfirstName=?lastName=Gradgrind Entityclass=Person gender=MfirstName=JosephlastName=Gradgrind Mr. Gradgrind Persongender=MfirstName=JosephlastName=? Joseph
Type-specific entity properties (2) • We could subclass Entity • You can double the entire type system (awkward, complex integrity constraint, may be tricky to do the mapping, no support for multiple types) • You can only use subclasses of entities for those entities that require additional properties (more compact, but with the above limitations and requires standardization effort) • We could put annotations in the classes feature of Referent • A coreference component could chose an existing representative annotation or create a new “prototypical” annotation. • We could use existing mechanisms (relations, links, etc.) • E.g., gender encoded by Male and Female unary relations; first name + last name as subclasses of HasEvidence links. (awkward) • We could have generic “other features” container in Entity (or Referent) • Duplicates the work done in the design of CAS feature structures • Makes these features undiscoverable from metadata
Alias Relationship e.g., “Muhammed bin Harazi entered the USA in March, 1993 and uses the alias Abdul Ramazi.” • Proposal 1: One entity, two HasOccurence links • Loses some info, i.e., that one is the real name and the other is a fake name • Proposal 1.1: Canonical form of entity holds the “real” name • Consider this artificial example: “Muhammed bin Harazi entered the USA in March, 1993. Bin Harazi uses the alias Abdul Ramazi.” There is clearly a difference between “Bin Harazi” and “Abdul Ramazi” as non-canonical variants. • Proposal 1.2: HasAliasOccurrence subclass of HasOccurrence • Proposal 1.3: New mention type for alias • Proposal 2: Separate entities, with an alias relation • Violates the “semantics” of entity and relations (e.g., that Bin Harazi and Ramazi are separate entities) • Limits usefulness of fact search over CAS information (maybe we should only be doing fact search over integrator results, anyway?) • Makes it easier for downstream components to retract an alias • Particularly, if we don’t want to support retracting arbitrary coreference
Modality of Annotations (1) Current solution • Idea 1: Subclassing (i.e., rCAS draft standard) • EntityAnnotation and RelationAnnotation both extend uima.tcas.Annotation • Each entity or relation annotation type extends EntityAnnotation and RelationAnnotation • Problem: TCAS specific – does not handle non-text annotations • Idea 2: Multiple Inheritance • For text, types (e.g., Person, At) extend uima.tcas.Annotation and either EntityAnnotation or RelationAnnotation • For other modalities, types extend that modality’s annotation type and either EntityAnnotation or RelationAnnotation • Doesn’t work: Java, CAS don’t support multiple inheritance. • Idea 3: Entity/Relation Interfaces • Have EntityAnnotation and RelationAnnotation interfaces • Types extend the modality’s annotation type and the • Doesn’t work: JCasGen does not allow interfaces. • Persuade UIMA architects/developers to change this? • Work around JCasGen? E.g., hand code JCas classes or use CAS but not JCAS?
Modality of Annotations (2) • Idea 4: Modality Interfaces • Persuade UIMA architects/developers to alter mode-specific annotations such that they are interfaces, not classes • Fairly radical change • Might be handy for multimodal CAS’s anyway • E.g., could define a generic Person annotation type and then mode-specific subclasses of that type. • Problem: Interfaces can’t contain any implementation • Need methods, e.g., uima.tcas.Annotation.getCoveredText() • Some unpleasant workarounds (static methods, methods inserted by JCasGen) • Would have EntityAnnotation and RelationAnnotation classes • Each type would extend one of those classes and implement a mode-specific interface
Modality of Annotations (3) • Idea 5: Distinct, unconnected types for each modality • e.g., EntityTextAnnotation extends uima.tcas.Annotation • Problem: Links from other levels must now refer to some vague superclass: java.lang.Object, uima.cas.TOP, or some (to be created?) mode-independent annotation superclass • Problem: Can’t build mode-independent code that manipulates annotations and accesses internal structure. • Idea 6: Entity/relation field in annotation • By convention, every annotation in a KR CAS would have a feature knowledgeType with three possible values: ENTITY, RELATION, and OTHER • Roughly the same problems as above, but arguably less messy and perhaps slightly better with a mode-independent annotation superclass
Encoding Provenance • Approach: No provenance goes into the type system Instead, CAS handling mechanisms write store provenance in system-controlled features • Pro: No effort from writers of CAS Processors (annotators, etc.) • Con: Major internal UIMA change • Approach: No provenance goes into the CAS. Instead, the CPM stores provenance in separate data structures (e.g., each intermediate CAS or change log) • Pro: No effort from writers of CAS Processors (annotators, etc.), Simpler than #1 • Con: Requires much space and time, Omits instance-level dependencies • Approach: Every element in the CAS has a single “Component ID” that it came from • Pro: Very simple • Con: Omits instance-level dependencies, Encodes creating or modifying but not both. • Approach: Every element in the CAS has a sequence of “Component ID’s” that it came from • Con: Omits instance-level dependencies • Approach: Every element in the CAS has a “Component ID” that it came from and a set of links to elements that influenced it’s existence • E.g., entities have links not only to those spans that mention them but also to annotations that are occurrences of them • Pro: Very general • Con: Encodes creating but not modifying but not both. Current solution
Cross-DocumentCoreference Resolution • Coreference Resolution takes as input individual single-document CAS’s at the SofA and Annotation layers • It’s output should be one corpus-wide CAS at the Annotation layer only • Annotations are tied to the SofA layer; e.g., uima.tcas.Annotation.getCoveredText() • Solution 1: Don’t encode cross-document coreference in CAS • Solution 2: Document ID for entity and relation annotation • Can copy an entity or relation annotation into a document-free CAS unchanged. • Probably interacts in a complicated way with the six proposals for encoding multimodal annotations (earlier slides) • E.g., something like uima.tcas.Annotation.getCoveredText() will probably need to be reimplemented (or dropped) for the cross-document CAS • Solution 3: Cross-document coreference handled by having document CAS’s within cross-document CAS • Unclear whether our type system would support that • Solution 4: Employ the upcoming standard for CAS’s with multiple SofA’s. • More investigation is required to judge feasibility. Current solution
Mention Types • Many collection processing components have “mention types” • ACE format: NAME, NOMINAL, PRONOMINAL • EAnnotator: NAME, NOMINAL, PRONOMINAL, TITLE (and others?) • TAF/Talent encodes this information in the type system (e.g., PersonName as a type of annotation) • Can be very important for coreference resolution • We should probably encode it • Approach 1: Fixed, numbered set of types • Pro: Facilitates integration • Con: Restrictive • Approach 2: String “mention type” field in annotations • Pro: Flexible • Con: Requires consensus to integrate Current solution
Common Supertype forKnowledge-Level Types • It might make sense to have a common supertype for all knowledge-level types • Conceptually, they are all related. • Pragmatically, they share some common features (all have componentId, most have links); code which manipulates these features would benefit from a common type for these instances. • However, EntityAnnotation and RelationAnnotation are both already subtypes of Annotation, and there is no multiple inheritance in the CAS. • One way to handle this issue would be to add the common features to TOP (they seem appropriate there). • Would this introduce incompatibilities with earlier UIMA releases?
Referent Identification and Coreference Resolution • Referent Identification: a span refers to a particular concept (entity, relation, etc.) • Span-concept bindings • The main link between the annotation level and the knowledge level • Coreference Resolution: a set of spans refer to the same concept • Span-span bindings • Exist within the annotation level • We want to support this across documents, however • Coreference can be inferred from referent identification • Given span-concept bindings, infer span-span binding for each pair that shares a concept • Thus we can represent only referent identification • However: • They are typically produced by different kinds of processes • Coreference resolution naturally occurs at the annotation & text layers. Forcing it to interact with a higher layer is problematic. Current solution
Part 2 Preliminary implementation in support of NIMD and other projects [using old “.cts” notation]
Implementation:Entity Annotations public casType EntityAnnotation extends Annotation { FSList<Link> links; String componentId; String mentionType; // e.g., “NAME”, “PRONOUN” { public String toString() {} }} Can/should we move this into the CAS infrastructure? From ACE. Better name?
Implementation:Relation Annotations casType RelationAnnotation extends Annotation { FSList<Link> links; String componentId; Annotation predicate; // predicate mention, may be null RelationArgs relationArgs; // collection of arguments String { public String toString() {}} }} casType RelationArgs { String componentId; } casType BinaryRelationArgs extends RelationArgs { Annotation domain; // aka "arg1" Annotation range; // aka "arg2" { public String toString() {} }} Not used in NIMD
Implementation:Referents public casType Referent { FSList<Link> links; StringArray types; String componentId; String id; { public java.util.LinkedList getOccurrences() {} public java.util.LinkedList getEvidence() {} public void addOccurrence(TOP o, String compId) {} public void addEvidence(TOP o, String compId) {} public void setType(String typeName) {} public boolean hasType(String typeName) {} public string toString() {} }} Should be StringList? Do we really want to allow multiple types? Should we encode type by subtyping Entity/Relation? By pointing to “reference annotations”?
Implementation: Entities and Relations public casType Entity extends Referent {} public casType Relation extends Referent { { public java.util.LinkedList getArgumentLinks() {} public void addArgument(TOP o, String role, String compId) {} public String toString() {}} }
Implementation:Reified Links Class public casType Link { TOP from; TOP to; String componentId; { public static java.util.LinkedList getFromValues(FSList links, Class linkClass) {} public static java.util.LinkedList getToValues(FSList links, Class linkClass) {} public static FSList getLinks(TOP obj) {} public static void setLinks(TOP obj, FSList links) {} public static Link addLink(TOP from, TOP to, Class linkClass, String compId) {} public static void addLink(TOP obj, Link link) {} public String toString() {} }} Static Link methods should really be non-static methods for TOP or some common supertype of EntityAnnotation, Referent, etc.
Implementation:Reified Links Subclasses public casType HasOccurrence extends Link { // from Entity to EntityAnnotation // OR from Relation to RelationAnnotation } public casType HasEvidence extends Link { // from EntityAnnotation or RelationAnnotation to Entity or Relation // OR from Instance or Assertion to Entity or Relation } public casType Argument extends Link { String role; // from Relation to Entity or Relation // OR from Assertion to Instance or Assertion }
Part 4 Discussion
Interconnected Efforts • Semantic Integration • Maps knowledge-level results into external (domain) ontologies • Inference Tracking • Taxonomy of inferences that are supported by UIMA-based extraction and a mechanism for tracking those inferences • Outputs can be sent to Inference Web, etc. • Type system should store the content needed for tracking and use the same vocabulary. • Extracted Knowledge Database (EKDB) • Database for storing entities and relations • Type system should be sufficient to populate all of EKDB • May also contain additional content for next-generation EKDB • Avatar project at Almaden • Automatically building a database schema from a type system and populating that schema from a CAS. • Knowledge TS should also be compatible with that effort. • Trivial constraint? (i.e., the Almaden system should take any CAS) • More investigation is required
Summary • We propose a knowledge-level type system to encode: • Entity Annotations • Relation Annotations • Entities • Relations • and links among these • This proposal would facilitate interoperation among UIMA-compliant components that perform knowledge-level reasoning • e.g., relation annotators, coreference resolvers, entity classifiers