280 likes | 437 Views
Semantic Annotation for Interlingual Representation of Mulilingual Texts. Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia),
E N D
Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia), Flo Reeder (MITRE), Advaith Siddharthan (Columbia) LREC 2004 Workshop: “Beyond Named Entity Recognition: Semantic labelling for NLP tasks”
IAMTC (Interlingua Annotation of Multilingual Corpora) Project • Collaboration: • New Mexico State University • University of Maryland • Columbia University • MITRE • Carnegie Mellon University • ISI, University of Southern California LREC 2004 Workshop
Goals of IAMTC • Interlingua design • Three levels of depth • Annotation methodology • manuals, tools, evaluations • Annotated multi-parallel texts • Foreign language original and multiple English translations • Foreign languages: Arabic, French, Hindi, Japanese, Korean, Spanish LREC 2004 Workshop
Starting on January 1 of next year, SK Telecomsubscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again toanother provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality. Starting January 1st of next year customersof SK Telecom can changetheir service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytimewithin 14 days if problems such aspoorcall quality are experienced. Getting at Meaning(Two translations of Korean original text) LREC 2004 Workshop
Color Key • Black: same meaning and same expression • Green: small syntactic difference • Blue: Lexical difference • Red: Not contained in the other text • Purple: Larger difference. • Need to use some inference to know that the meaning is the same LREC 2004 Workshop
This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers." This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the "year of the merger“ for all we know. Getting at meaning(Two translations of a Japanese original text) More lexical similarity. More differences in dependency relations. LREC 2004 Workshop
Toward a ‘Theory of Annotation’ • Recently, sharp increase in number of annotated resources being built: • Penn Treebank, Propbank, many others… • For annotation, need • Theory behind phenomena being annotated (for) • Annotation termsets (even WordNet, FrameNet, verbnet, HowNet…) • Standard (?) annotation corpus (same old Treebank?) • Annotation tools—they make an immense difference • Carefully considered annotation procedure (interleaving per text vs. per sentence, etc.) • Reconciliation and consistency checking procedures • Evaluation measures, appropriately defined LREC 2004 Workshop
Corpus and Data • Initial Corpus • 10+ texts in each language • 2+ translations each into English • Interlingua designed for MT • Multiple English translations of same source show translation divergences. Some phenomena: • Lexical level: word changes • Syntactic level: phrasing, thematization, nominalization • Semantic level: additional/different content • Discourse level: multi-clause structure, anaphor • Pragmatic level: Speech Acts, implicatures, style, interpersonal • Causes of divergence • Genuine ambiguity/vagueness of source meaning • Translator error/reinterpretation LREC 2004 Workshop
IL Development: Staged, deepening • IL0: simple dependency tree gives structure • IL1: semantic annotations for Nouns, Verbs, Adjs, Advs, and Theta Roles • Not yet ‘semantic’—”buy”≠“sell’, many remaining simplifications • Concept ‘senses’ from ISI’s Omega ontology • Theta Roles from Dorr’s LCS work • Elaborate annotation manuals • Tiamat annotation interface • Post-annotation reconciliation process and interface • Evaluation scores: annotator agreement • IL2: that comes next… LREC 2004 Workshop
Deep syntactic dependency representation: Removes auxiliary verbs, determiners, and some function words Normalizes passives, clefts, etc. Includes syntactic roles (Subj, Obj) Construction: Dependency parsed using Connexor (English) Tapanainen and Jarvinen, 1997 Hand-corrected Extensive manual and instructions on IAMTC Wiki website Details of IL0 LREC 2004 Workshop
Example of IL0 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” TrEd, Pajas, 1998 LREC 2004 Workshop
Example of IL0 • Sheikh Mohammed, who is also the Defens Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” announced V Root Mohamed PN Subj Sheikh PN Mod Defense_Minister PN Mod who Pron Subj also Adv Mod of P Mod UAE PN Obj at P Mod ceremony N Obj inauguration N Mod LREC 2004 Workshop
Details of IL1 • Intermediate semantic representation: • Annotations performed manually by each person alone • Associate open-class lexical items with Omega Ontology items • Replace syntactic relations by one of approx. 20 semantic (theta) roles (from Dorr), e.g., AGENT, THEME, GOAL, INSTR… • No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure… • Nodes may receive more than one concept • Average: about 1.2 • Manual under development; annotation tool built LREC 2004 Workshop
Example of IL1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” LREC 2004 Workshop
Example of IL1: internal representation The study led them to ask the Czech government to recapitalize CSA at this level. [3, lead, V, lead, Root, LEAD<GET, GUIDE] [2, study, N, study, AGENT, SURVEY<WORK, REPORT] [4, they, N, they, THEME, ---, ---] [6, ask, V, ask, PROPOSITION, ---, ---] [9, government, N, government, GOAL, AUTHORITIES, GOVERNMENTAL-ORGANIZATION] [8, Czech, Adj, Czech, MOD, CZECH~CZECHOSLOVAKIA, ---] [11, recapitalize, V, recapitalize, PROP, CAPITALIZE<SUPPLY, INVEST] [12, csa, N, csa, THEME, AIRLINE<LINE, ---] [16, at, P, value_at, GOAL, ---, ---] [15, level, N, level, ---, DEGREE, MEASURE] [14, this, Det, this, ---, ---, ---] Semantic Roles Concepts from the Omega Ontology LREC 2004 Workshop
Details of IL2 – In development • Start capturing meaning: • Handle proper names: one of around 5 classes (PERSON, LOCATION, TIME, ORGANIZATION…) • Conversives (buy vs. sell) at the FrameNet level • Non-literal language usage (open the door to customers vs. start doing business) • Extended paraphrases involving syntax, lexicon, grammatical features • Possible incorporation of other ‘standardized’ notations for temporal and spatial expressions • Still excluded: • Quantification and negation • Discourse structure • Pragmatics LREC 2004 Workshop
Omega ontology • Single set of all semantic terms, taxonomized and interconnected (http://omega.isi.edu) • Merger of existing ontologies and other resources: • Manually built top structure from ISI • WordNet (110,000 nodes) from Princeton • Mikrokosmos (6000 nodes) from NMSU • Penman Upper model (300 nodes) from ISI • 1-million+ instances (people, locations) from ISI • TAP domain relations from Stanford… • Undergoing constant reconciliation and pruning • Used in several past projects (metadata formation for database integration; MT; QA; summarization) LREC 2004 Workshop
Dependency parser and Omega ontology Omega (ISI):110,000 concepts (WordNet, Mikrokosmos, etc.), 1.1mill instances URL: http://omega.isi.edu Dependency parser (Prague) LREC 2004 Workshop
Tiamat: annotation interface For each new sentence: Step 1: find Omega concepts for objects and events Candidate concepts Step 2: select event frame (theta roles) LREC 2004 Workshop
Evaluation webpage LREC 2004 Workshop
Evaluation • Three approaches to evaluation: • Inter-annotator agreement — completed • Sentence generation from extracted annotation structure — to be completed • Comparison of interlingual structures (graph comparisons) — not planned • Inter-annotator agreement: Is the IL sufficiently defined to permit consistent annotation? • Impacts ontology, theta-roles: coverage and precision LREC 2004 Workshop
Annotation Issues • Post-annotation consistency checking • Novice annotators may make inconsistent annotations within the same text. • Intra-annotator consistency checking procedure • e.g. If two nodes in different sentences are co-indexed, then annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences • Post-annotation reconciliation LREC 2004 Workshop
2. Post-annotation reconciliation Question: How much can annotators be brought into agreement? • Procedure: • Annotator sees all annotations, votes Yes/Maybe/No on each • Annotators then discuss all differences (telephone conf) • Annotators then vote again, independently • We collapse all Yes and Maybe votes, compare them with No to identify all serious disagreement • Result: • Annotators derive common methodology • Small errors and oversights removed during discussion • Inter-annotator agreement improved • Serious problems of interpretation or error identified LREC 2004 Workshop
Annotation across Translations Question: How different are the translations? • Procedure: • Annotator sees annotations across both translations, identifies differences of form and meaning • Annotator selects ‘true’ meaning(s) • Results (work still in progress): • Impacts ontology richness/conciseness • Improvement in Interlingua representation ‘depth’ • Useful for IL2 design development • Observations: • This is very hard work • Methodology unclear: what is seen first, how to show alternatives, what to do with results… LREC 2004 Workshop
Principal problems to date • Proper nouns • Proposed solution: automatically tag with one of 6 types (Person, Location, Org, DateTime, etc.) • Noun compounds • Alternatives: tag head only; parse and tag whole structure • Omega is too rich • Hard to distinguish from the others • Granularity of concept selection • Light verbs • Proposed solution: rephrase to remove light verb if possible (“take a shower” “shower”, but “take a shower” ?) • Vagueness and ambiguity • Annotate all plausible senses (“propose” as Urge and Suggest) • Idioms and metaphors • Proposed solution: ? LREC 2004 Workshop
Discussion and conclusion • Results are encouraging • But more work must be done to solidify them • Outcomes—how have we done? • IL design —partly, and IL2 in the works • Annotation methodology, manuals, tools, evals — yes • Annotated parallel texts — approx. 150 done • Six texts, two translations, 10-12 annotators • Next steps • Foreign language annotation standards and tools • Development of IL2 • Addressing coverage gaps (1/3 of open class words marked as having no concept) • Generation of surface structure from deep structure • Is it possible? LREC 2004 Workshop
Contact information • URLs and Wiki pages: • Project website: http://aitc.aitcnet.org/nsf/iamtc/ LREC 2004 Workshop