1 / 18

ECMT- Oriented Generation T reebank

ECMT- Oriented Generation T reebank. Beijing Language and Culture University Oct. 2006. R ationalism and E mpiricism MT Approaches. rules. Shortcoming. R ationalism. MT. Advantage. E mpiricism. statistics. Advantages Human knowledge applied Shortcomings

palmer
Download Presentation

ECMT- Oriented Generation T reebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECMT-OrientedGenerationTreebank Beijing Language and Culture University Oct. 2006

  2. Rationalism and EmpiricismMT Approaches rules Shortcoming Rationalism MT Advantage Empiricism statistics

  3. Advantages Human knowledge applied Shortcomings The bottleneck of Knowledge acquirement The difficulty of control rules Rationalism

  4. Empiricism • Advantages • Robust • Trainable and mathematical theory based • Shortcomings • Data sparse • Could not apply human knowledge

  5. How to combine their virtues ? • We can get knowledge of tranlsation as much as possible, in every level of language, such as word, phrase. • We are able to use statistics and avoid the problem of controling the rules of linguistics painfully.

  6. Our exploration research • A treebank with 30,000 sentences • the structure of source language • Human labeled translation of words and phrase • translation pattern of source-target language. • The structure of source language sentences is based on Penn Treebank,with some changes : • combination • reduction

  7. Our exploration research • Every word in source language sentences has been labeled with part of speech and chinese translation. • Phrase translation: • composing mode of target language • Collocation • Sentence translation:root

  8. Example: the judge declined to discuss his salary in detail.

  9. Tagged English sentences • Extracted from WSJ, NIST corpus and dictionary examples. • Each sentence has original Chinese translation. • The average length of sentence is 13 words

  10. Structure of source language sentence • The tag set is mainly based on Penn Treebank tag set. • POS tag set is the same as that Penn Treebank had used. • Syntactic tag set we adopt has changed a little. Some tags Penn Treebank use have been discard,such as SINV and SBAR . • Syntactic trees are not as deep as these in Penn Treebank, • Bacause some components of syntactic tree have been compressed, such as NP,VP.

  11. Example

  12. Word and Collocation • Every word in source language sentences has been labeled with POS and its chinese translation,except for some exceptions,such as preposition,human name. • Collocation is very important to syntactic analysis and language generation.

  13. Two kinds of collocations • Adjoining and not adjorning • The first collocation mainly refer to compound noun such as human resource,which can't bee translated into chinese word by word . • The second collocation is these which can be inserted by some words.

  14. Example

  15. Example

  16. Chinese Translation in+detail->详细地: this is a collocation. NP<1+2>->: it means its translation equals to 1 and 2. decline to VP ->it is a VP collocation S<1+2> -> we can get its translation recursively.

  17. Future work • Enlarge the the scale of Treebank • Research based on the Treebank • Research on MT integrated parsing and generation • Parsing& translation knowledge acquirement • How to apply the Treebank into SMT work

  18. Browsing the Treebank

More Related