180 likes | 976 Views
ECMT- Oriented Generation T reebank. Beijing Language and Culture University Oct. 2006. R ationalism and E mpiricism MT Approaches. rules. Shortcoming. R ationalism. MT. Advantage. E mpiricism. statistics. Advantages Human knowledge applied Shortcomings
E N D
ECMT-OrientedGenerationTreebank Beijing Language and Culture University Oct. 2006
Rationalism and EmpiricismMT Approaches rules Shortcoming Rationalism MT Advantage Empiricism statistics
Advantages Human knowledge applied Shortcomings The bottleneck of Knowledge acquirement The difficulty of control rules Rationalism
Empiricism • Advantages • Robust • Trainable and mathematical theory based • Shortcomings • Data sparse • Could not apply human knowledge
How to combine their virtues ? • We can get knowledge of tranlsation as much as possible, in every level of language, such as word, phrase. • We are able to use statistics and avoid the problem of controling the rules of linguistics painfully.
Our exploration research • A treebank with 30,000 sentences • the structure of source language • Human labeled translation of words and phrase • translation pattern of source-target language. • The structure of source language sentences is based on Penn Treebank,with some changes : • combination • reduction
Our exploration research • Every word in source language sentences has been labeled with part of speech and chinese translation. • Phrase translation: • composing mode of target language • Collocation • Sentence translation:root
Example: the judge declined to discuss his salary in detail.
Tagged English sentences • Extracted from WSJ, NIST corpus and dictionary examples. • Each sentence has original Chinese translation. • The average length of sentence is 13 words
Structure of source language sentence • The tag set is mainly based on Penn Treebank tag set. • POS tag set is the same as that Penn Treebank had used. • Syntactic tag set we adopt has changed a little. Some tags Penn Treebank use have been discard,such as SINV and SBAR . • Syntactic trees are not as deep as these in Penn Treebank, • Bacause some components of syntactic tree have been compressed, such as NP,VP.
Word and Collocation • Every word in source language sentences has been labeled with POS and its chinese translation,except for some exceptions,such as preposition,human name. • Collocation is very important to syntactic analysis and language generation.
Two kinds of collocations • Adjoining and not adjorning • The first collocation mainly refer to compound noun such as human resource,which can't bee translated into chinese word by word . • The second collocation is these which can be inserted by some words.
Chinese Translation in+detail->详细地: this is a collocation. NP<1+2>->: it means its translation equals to 1 and 2. decline to VP ->it is a VP collocation S<1+2> -> we can get its translation recursively.
Future work • Enlarge the the scale of Treebank • Research based on the Treebank • Research on MT integrated parsing and generation • Parsing& translation knowledge acquirement • How to apply the Treebank into SMT work