300 likes | 625 Views
Word sense disambiguation and annotation transfer in parallel texts. Dan TUFIŞ 1,2 , Radu ION 1 , Verginica MITITELU 1 1 RACAI – Research Institute for Artificial Intelligence, Bucharest 2 University „A.I. Cuza”, Ia şi { tufis,radu, vergi}@racai.ro. Outline of the talk.
E N D
Word sense disambiguation and annotation transfer in parallel texts Dan TUFIŞ1,2, Radu ION1,Verginica MITITELU1 1RACAI – Research Institute for Artificial Intelligence, Bucharest 2University „A.I. Cuza”, Iaşi {tufis,radu, vergi}@racai.ro
Outline of the talk • WSD: monolingual / multilingual approaches • WSD based on parallel texts and aligned wordnets • Brief account on Romanian wordnet (part of BalkaNet) • WSDtool+ main processing steps • Word Alignment (COWAL), • Sense labelling (four sense inventory: synsets IDs, SUMO/MILO concepts, IRST Domains, Explanatory Dictionary of Romanian) • Custering • Annotation generation • Annotation import in aligned and WSDed corpora • Conclusions and further work
WSD based on parallel texts and aligned wordnets Our approach: A mixture of unsupervised and KB approaches with multiple processing steps: • word alignment in parallel corpora (COWAL) and translation equivalents extraction (translation model); • sense labeling using (BalkaNet): • Princeton word sense inventory; • SUMO/MILO ontology • IRST Domains classes • EXPD labels Based on aligned wordnets (covers ~75% of the word occurrences in a corpus) • sense clustering based on translation equivalents extracted from parallel corpora (takes care of the uncovered cases by the previous step; • generation of the WSD annotation in the parallel corpus
Brief account on Romanian wordnet Conceptually dense: for any RO-synset, aligned to an ILI code (EN-synset),all the hierarchical antecedents of the EN-synset are also linked to synsets present the RO-wordnet. Virtually, more synsets and more literals (monosemous literals; “coverage heuristics”)
Brief account on Romanian wordnetStructure of a synset PWN id <SYNSET> <ID>ENG20-04992592-n</ID><POS>n</POS> <SYNONYM> <LITERAL>organ<SENSE>1</SENSE></LITERAL> </SYNONYM> <DEF>Parte a unui organism animal sau vegetal care îndeplineşte anumite funcţii</DEF> <STAMP>Dan Tufis</STAMP> <BCS>1</BCS> <ILR>ENG20-04919813-n<TYPE>hypernym</TYPE></ILR> <DOMAIN>anatomy</DOMAIN> <SUMO>Organ<TYPE>=</TYPE></SUMO> </SYNSET> EXPD id IRST id SUMO id
WSD MAIN STEPS 1.Word AlignmentandFiltering of the Translation Equivalents • The word alignment system (COWAL) is producing N-M cross-POS alignment links with an average F-measure of more than 80% (F=82.52%, AER=17.48%). • Only the preserving major POS (V, N, A, R) translation links retained. In such a case F is better than 92% (F=92.04%, AER=7.96%)
WSD MAIN STEPS2. Sense Labeling • Aligned wordnets (lexical ontologies) • Conceptual knowledge structuring (upper ad mid level ontologies): SUMO/MILO • Domain taxonomies (UDC-librarian’s taxonomy): IRST-DOMAINS • Explanatory Dictionary of Romanian (3 sense levels) • Coverage heuristics: if one of the words in a translation pair is not member of any synset in the respective language wordnet, but the other word is present in the aligned wordnet, and moreover it is monosemous, the first word gets the sense as given by the monosemous word. If one of the languages is English, any other language can benefit from this heuristics (approx. 80% of the PWN literals are monosemous): • Ex: hilarious <-> hilar => ENG20-01221243-a burp <-> râgâi=> ENG20-00003374-v prospicience <->clarviziune => ENG20-05469664-n
EQ-SYN WN1 ILI WNk W(i)L1 w(j)Lk ILIk SUMO Domains TR-EQV Example (I): <lamp lampă> PWN2.0 (lamp) = {03500372-n, 03500773-n} RoWN (lampă) = {03500773-n, 03500872-n} EQ-SYN ILI= 03500773-n=><lamp(2) lampă(1)> SUMO (03500773-n) = +Device DOMAINS (03500773-n) = furniture
WN1 EQ-SYN ILI EQ-SYN WNk SUMO Domains EQ-SYN TR-EQV Example (II): <lamp felinar> PWN2.0 (lamp) = {03500372-n, 03500773-n} RoWN (felinar) = {003505057-n} δ (03500372-n, 003505057-n)=0.5 δ (03500373-n, 003505057-n)=0.125 ILI= 03500773-n=><lamp(1) felinar(1.1)> SUMO (03500773-n) = IlluminationDevice DOMAINS (03500773-n) = factotum
WSD MAIN STEPS 3.Word Sense Clustering _____|-> (1) |-----| |-> (1) | |_____|---> (1) | |___|-> (1) | |-> (1) | |---> (1) |----| | _|-> (1) | | | |-| |-> (1) | | |---| |-| |-> (1) | | | | |-| |-> (1) | |-----| | |-| |-> (1) |--| | |---| |-> (1) | | | |-> (1) | | |___|---> (6) | | |___|-----> (1) | | |-----> (1) | | _____|-----> (6) -| |----| |-----> (6) | |-----> (4) | | | |---> (2) | |---| _|-> (2) | | | |-| |-> (2) | | |---| |-> (2) |--| |-> (2) | |-----> (2) | | ___|-----> (2) |---| |----| |-----> (2) | | | _|-> (2) | | |---| |-> (2) |-----| |-> (2) | ____|-> (3) |----| |-> (2) | _|-> (2) |----| |-> (2) |-> (2) An agglomerative, hierarchical algorithm using a vector space model, Euclidean distance and cardinality weighted computation of the centroid (the “center of weight” of a new class). The undecidable problem of how many classes, gets hints from the work already done in step 1 and 2
4. WSD annotation in the parallel corpus (“1984”) <tu id="Ozz20"> <seg lang="en"> <s id="Oen.1.1.4.9"> <w lemma="the" ana="Dd">The</w> <w lemma="patrol" ana="Ncnp" sn=“3"oc=“Group"dom="military"> patrols</w> <w lemma="do" ana="Vais">did</w> <w lemma="not" ana="Rmp" sn="1"oc="not“ dom="factotum"> not</w> <w lemma="matter" ana="Vmn" sn="1"oc="SubjAssesAttr"dom="factotum"> matter</w><c>,</c> <w lemma="however" ana="Rmp" sn="1"oc="SubjAssesAttr|PastFn”dom="factotum"> however</w><c>.</c></s></seg> <seg lang="ro"> <s id="Oro.1.2.5.9"> <w lemma="şi" ana=Crssp>Şi</w> <w lemma="totuşi" ana="Rgp" sn="1“oc="SubjAssesAttr |PastFn"dom="factotum"> totuşi</w><c>,</c> <w lemma="patrulă" ana="Ncfpry" sn="1.1.x"oc=“Group"dom="military"> patrulele</w> <w lemma="nu" ana="Qz" sn="1.x"oc="not"dom="factotum"> nu</w> <w lemma="conta" ana="Vmii3p" sn="2.x"oc="SubjAssesAttr"dom="factotum"> contau</w><c>.</c></s></seg> … </tu>
Evaluation (I) • “lexical sample” and 1-tag annotation evaluation (with k-tag, the performance would be essentially the one of the filtered word alignment, i.e. 92.04% ) • 216 English ambiguous words (at least two senses/POS) with 2081 occurrences in “1984” were semantically disambiguated by three experts in terms of PWN2.0 sense inventory. The experts negotiated all the disagreement cases, thus resulting the Gold Standard annotation (GS) • - this is “lexical sample/lexical choice” evaluation type of SENSEVAL (much harder than “all words”whichincludes monosems and homographs as well) • For each PWN2.0 sense number, the GS was deterministically enriched with the SUMO category and the DOMAINS label. • Thus, we had three sense inventories in the GS and could evaluate system’s WSD accuracy in terms of each of them.
Evaluation (II) • Automatic WSD was performed three ways: • using only RO-EN aligned BalkaNet wordnets (AWN) • combining AWN with clustering (AWN+C) • combining AWN+C with the simple heuristics (AWN+C+MFS) • Out of 2081 total occurrences 61 (34 words) could not receive a sense tag because the target literal was wrongly aligned by the translation equivalents extractor module of the WSDtool, or because it was not translated or wrongly translated by the human translator. In this case we used MFS, a simple heuristics assigning the most frequent sense label (42 occurrences were correctly tagged).
Evaluation (III) WSD based on WN2.0+RoWN (PWN2.0 id)
Evaluation depends on the sense inventories PWN2.0+RoWN ( AWN+C+MFS)
AIM: Test whether it is possible (and to what degree) to automatically transfer syntactic relations (as they are lexicalized in a corpus) from a resource-rich language (English) into another language with fewer resources (Romanian), using parallel corpora. Annotation import in aligned and WSDed corpora
Resources • Parallel corpus: George Orwell’s 1984 • XCES XML encoded • Sentence aligned (only 1-1 alignments retained) • Tokenized • Morpho-syntactically annotated • Chunked • Word aligned • The English version - parsed with an FDG parser (Tapanainen and Järvinen, 1997) • A set of generic transfer rules of the syntactic relations from En-Ro (hand written, language pair dependent)
Transfer cases • Perfect transfer • Transfer with amendments • Language specific phenomena • Impossibility of transfer
2. Transfer with amendments (I) Active-Passive inversion ‘Lucrul’ is not the object of the ‘sugerat’ but its subject
2.Transfer with amendments (II) Dummy anticipatory ‘It’ is subject and book is complement. Yet in Romanian, ‘carte’ is subject.
3. Language Specific Phenomena • Pro-drop phenomenon
4. Impossibility of transfer (I) • Equivalent verbs with different syntactic behaviour: ‘like’ – ‘plăcea’
See for details: Verginica Mititelu, Radu Ion: Cross-Language Transfer of Syntactic Relations Using Parallel Corpora, 2005
Conclusions (I) • Cross experiment evaluations of the WSD results are hard to compare when different granularity sense-inventory are used (PWN2.0: 115424 meanings vs. SUMO-MILO: 2066 categories, vs. IRST DOMAINS: 163) • Considering the fine granularity of the WSD annotation our results are superior to those reported by the few researchers who used the same sense inventory (e.g. G.Rigau and his colleagues): not surprising! Most of the WSD experiments were carried on in monolingual environments; word alignment reveals the mental lexicon used by professional human translators in parallel texts and as such is an invaluable knowledge source.
Conclusions (II) • One of the greatest advantages of applying such methods to parallel data: it may be used to automatically sense-tag corpora in not only one language, but rather several at once. • The automatic procedure of transferring syntactic relations (as shown in the Romanian experiment) is reliable provided that all necessary resources are present with the required level of annotation • Language specific structures and grammatical phenomena require the pre- and post-processing of the data • Given that syntactic relations may be imported we started experiments on importing FrameNet annotations from English into Romanian.
Further work • Further development of the Romanian wordnet • about 15,000 “hard-synsets” • about 70,000 “easy synsets” (monosemous and/or mono-literals) • Development of new word-aligned multilingual corpora (including AqC21), improving current translation models and building new ones; (see the XCES sample of the EN-RO AqC: 80 docs, tagged, lemmatised, chunk-parsed, word-aligned, WSDed, approx. 1MB) • Extending the annotation import experiment and improving the rule-set governing the transfer: • Framenet project • Dependency grammar for Romanian • Development of a multilingual SMT system (we started En-Ro experiments with very encouraging results!)
Recent papers with details for this talk • Dan Tufiş, Verginica Mititelu, Luigi Bozianu, Cătălin Mihăilă: Romanian WordNet: New Developments and Applications. In Piek Vossen and Christiane Fellbaum (eds.)Proceedings of the 3rd International Wordnet Conference, Jeju Island, Korea, South Jeju, January 2006, 10 pages • Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Stefănescu: Combined Aligners. In Proceeding of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp. 107-110 • Verginica Mititelu, Radu Ion: Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. In Diana Inkpen & Carlo Strapparava (eds.) Proceedings of the Workshop on Cross-Language Knowledge Induction, 2-4 August, Cluj-Napoca, 2005, pp. 46-51, ISBN 973-703-139-9, • Dan Tufiş, Radu Ion: Evaluating the word sense disambiguation accuracy with three different sense inventories. In Proceedings of the Natural Language Understanding and Cognitive Systems Symposium, Miami, Florida, May 2005, pp. 118-127, ISBN 972-8865-23-6 • Dan Tufiş, Radu Ion, Nancy Ide: Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets. In proceedings of the 20th International Conference on Computational Linguistics, COLING2004, Geneva, 2004, pp. 1312-1318, ISBN 1-9324432-48-5 • Dan Tufiş: Word Sense Disambiguation: A Case Study on the Granularity of Sense Distinctions. In WSEAS Transactions on Information Science and Applications, vol. 2, no. 2, February 2005, pp.183-188, ISSN 1790-0032 • Dan Tufiş, Eduard Barbu. A Methodology and Associated Tools for Building Interlingual Wordnets. In Proceedings of the 5th LREC Conference, Lisabona, 2004, pp. 1067-1070 • Dan Tufiş, Dan Cristea, S. Stamou. BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issue on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 9-34, ISSN 1453-8245 • Dan Tufiş, Eduard Barbu, Verginica Mititelu, Radu Ion, Luigi Bozianu. The Romanian Wordnet. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issued on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 105-122, ISSN 1453-8245 • Dan Tufiş, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, May 2004, pp. 163-189, ISSB 0010-4817