1 / 32

Morphosyntactic correspondence: a progress report on bitext parsing

Morphosyntactic correspondence: a progress report on bitext parsing. Alexander Fraser, Renjing Wang, Hinrich Schütze. Institute for NLP. INFuture2009: Digital Resources and Knowledge Sharing Nov 4 th 2009, Zagreb. University of Stuttgart. Outline.

hide
Download Presentation

Morphosyntactic correspondence: a progress report on bitext parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP INFuture2009: Digital Resources and Knowledge Sharing Nov 4th 2009, Zagreb University of Stuttgart

  2. Outline • The Institute for Natural Language Processing at the University of Stuttgart • Bitext parsing • Using morphosyntactic correspondence

  3. IfNLP Stuttgart • The Institute for Natural Language Processing (IfNLP/IMS) at the University of Stuttgart • Dogil (Phonetics and Speech) • Large department • Kuhn/Rohrer (LFG syntax and semantics) • Cahill (LFG generation) • Heid (Terminology extraction, morphology) • Padó (Semantics, lexical semantics) • Schütze (Statistical NLP and Information Retrieval) • More on next slide

  4. IfNLP – Statistical NLP Group • Hinrich Schütze (director since 2004) • Bernd Möbius – Speech recognition and synthesis • Helmut Schmid - Parsing , morphology (known for TreeTagger, BitPar) • Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics • Michael Walsh – Speech, exemplar theoretic syntax • Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval • General department areas of research • New statistical NLP models and methods • Semi-supervised and active learning • Cognitive/linguistic representation models • Applied to: NLP, retrieval, MT, speech, e-learning, …

  5. IfNLP - Partnerships • Partnerships • Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing • Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA-based German processing) • International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN • Industrial: various projects with publishers (many focusing on terminology)

  6. Outline • The Institute for Natural Language Processing at the University of Stuttgart • Bitext parsing • Using morphosyntactic correspondence

  7. What is bitext parsing? • Bitext: a text and its translation • Sentences and their translations are aligned • Sometimes called a parallel corpus • Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse) • Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext • We will use the complementarity of the syntax of the two languages to obtain improved parses

  8. Motivation for bitext parsing • Many advances in syntactic parsing come from better modeling • But the overall bottleneck is the size of the treebank • Our research asks a different question: • Where can we (cheaply) obtain additional information, which helps to supplement the treebank? • A new information source for resolving ambiguity is a translation • The human translator understands the sentence and disambiguates for us! • Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment

  9. Clause attachment ambiguity Parse 1: high attachment (wrong) Parse 2: low attachment (correct)

  10. Not ambiguous in German • Number agreement disambiguates • FRAU (woman) and HATTE (had) agree • Unambiguous low attachment

  11. Parse reranking of bitext • Goal: improve English parsing accuracy • Parse English sentence, obtain list of 100 best parse candidates • Parse German sentence, obtain single best parse • Determine the correspondence of German to English words using a word alignment • Calculate syntactic divergence of each English parse candidate and the projection of the German parse • Choose probable English parse candidate with low syntactic divergence

  12. exp ∑mλm hm(g, e, a) P(e | g) = ∑e exp ∑mλm hm(g, e, a) Measuring syntactic divergence • Define features to capture different (overlapping) aspects of syntactic divergence. Functions of: • Candidate English parse e • German parse g • Word alignment a • Combine in log-linear model • Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)

  13. Rich bitext projection features • Defined 36 features by looking at common English parsing errors • No monolingual features, except baseline parser probability • General features • Is there a probable label correspondence between German and the hypothesized English parse? • How expected is the size of each constituent in the hypothesized English parse given the German parse? • Specific features • Are coordinations realized identically? • Is the NP structure the same? • Mix of probabilistic and heuristic features

  14. Training • Use BitPar syntactic forest parser • English BitPar trained on Penn Treebank • German BitPar trained on Tiger Treebank • Probabilistic feature functions built using large parallel text (Europarl) • Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German • Minimum error rate training using F score

  15. Reranking English parses • Difficult task • German is difficult to parse • Our knowledge source, the German parser, is out-of-domain (poor performance) • Baseline English parser we are trying to improve is in-domain (good performance) • Test set has long sentences • Result: 0.70% F1 improvement on test data (stat. significant)

  16. New results • Reranking German parses • We needed German gold standard parses (and English translations) • Sebastian Pado has made a small parallel treebank for Europarl available • No engineering on German yet • We are using the same syntactic divergence features which were designed to improve English parsing • There are German specific ambiguities which could be modeled, such as subject-object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”) • But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain) • 2.3% F1 improvement currently, we think this can be further improved

  17. Summary: bitext parsing • I showed you an approach for bitext parsing • Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse • I then showed our first results for reranking German parses using a single English parse • The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking • Machine translation involves morphosyntactic correspondence • And this is where we are interested in looking at Croatian

  18. Outline • The Institute for Natural Language Processing at the University of Stuttgart • Bitext parsing • Using morphosyntactic correspondence

  19. Morphosyntactic processing • I am co-PI of a new IfNLP project funded by the DFG (German Science Foundation) • Project: morphosyntactic modeling for statistical machine translation (SMT) • SMT research, up until recently, has been dominated by translation into English • English expresses a lot of information through word order, very little through inflection • Approaches to translating morphologically rich languages to English are preprocessing based

  20. Present: linguistic preprocessing • Linguistic preprocessing for SMT (stat. machine translation) • From: freer syntax, morphologically rich language • To: rigid syntax, morphologically poor language • Existing examples: German to English, Czech to English

  21. Present: linguistic preprocessing • How this works • Produce morphosyntactic analysis of German (or Czech) • Reorder words in the German/Czech sentence to be in English order • Reduce morphological inflection (for instance, remove case marking, remove all agreement on adjectives, etc) • For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns) • Use statistics on this “simplified” German or Czech to map directly to English using SMT

  22. Present: linguistic preprocessing • How well does this work? • German to English SMT with linguistic preprocessing (Stuttgart system) • Results from 2008 ACL workshop on machine translation (extensive human evaluation) • Only system limited to organizer’s data competitive with: • The best system of 5 rule-based MT systems • Saarbrücken hybrid rule-based/SMT system • Google Translate, which does not use linguistic preprocessing but does use vastly more data

  23. Future: modeling • What about translating from English to German or to Slavic languages? • Problem: morphological generation is more difficult • It is easy to reduce multiple inflections to one (for instance, stemming) • Harder to learn to generate the right inflection

  24. Future: modeling • Current work on morphological generation • Work at Charles University in Prague on Czech • Tectogrammatical representation is not (yet) competitive with simple statistics (little explicit knowledge of morphology or syntax) • Best English to German SMT systems also use little or no morphological knowledge • And they are much worse than rule-based English to German systems • Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing • morphosyntactic modeling

  25. Morphosyntactic correspondence • In fact, all multilingual problems involve morphosyntactic correspondence: • If we have a source parse tree, and source text, and we would like a target text, this is machine translation • If we have a source parse tree, source text and target text, and we would like a target parse, this is bitext parsing • If we would like to know which word in the target text is a translation of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment • The same thinking can be used for cross-lingual information retrieval • Very relevant when one of the languages is morphologically rich

  26. Conclusion • I introduced the IfNLP Stuttgart • I presented a new approach to improving parsing using morphosyntactic correspondence: bitext parsing • I discussed the general challenge of using morphosyntactic correspondence, focusing on statistical machine translation • Biggest challenge is translating into freer word order, morphologically rich (e.g., German and particularly Slavic languages) • We are interested in the challenge of building systems to translate to Croatian • To do this: we need partners who are working on Croatian analysis! • We also request that you think about multilingual applications when producing Croatian NLP resources • The type of approach I showed for bitext parsing is useful for other multilingual applications

  27. Thank you!

  28. Title • text

  29. Statistical Approach • Using statistical models • Create many alternatives, called hypotheses • Give a score to each hypothesis • Find the hypothesis with the best score through search • Disadvantages • Difficulties handling structurally rich models (math and computation) • Need data to train the model parameters • Difficult to understand decision process made by system • Advantages • Avoid hard decisions • Speed can be traded with quality, no all-or-nothing • Works better in the presence of unexpected input • Learns automatically as more data becomes available Modified from Vogel

  30. Morphosyntactic knowledge • We use: morphological analyzers & treebanks, which are combined in parsing models learned from treebanks • English models have little morphological analysis (suffix analysis to determine POS for unknown words) • German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological Analyzer) • Given inflected form, SMOR returns possible fine-grained POS tags • E.g., for nouns/adjectives: POS, case, gender, number, definiteness • BitPar puts possible analyses in the chart, and disambiguates • Slavic languages require even more morphological knowledge than German

  31. Transferring syntactic knowledge • Need knowledge source! • English syntactic parser • About 90% bracketing accuracy • Mapping • Requires bitext • Work discussed here uses German/English Europarl (European Parliament Proceedings) • Resource for Croatian: Acquis Communautaire • Automatically generated word alignment

  32. Additional details in the paper • Formalization of bitext parsing as a parse reranking task • Definitions of bitext feature functions • Analysis of feature functions through feature selection • Comparison of MERT (minimum error rate training) with SVM-Rank

More Related