230 likes | 375 Views
Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation CSC 177 Presentation Nguyen Doan H, Ph.D. Outline. Motivations Crosslingual Query Noun phrase translation extraction Experiments and results Conclusion and next steps.
E N D
Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation CSC 177 Presentation Nguyen Doan H, Ph.D
Outline • Motivations • Crosslingual Query • Noun phrase translation extraction • Experiments and results • Conclusion and next steps
Motivations – Unknown Translations • Words that outside scope of bilingual dictionary • Brand names, Place names, Personal names • Titles (music, book, video) • Terminologies (Science, Computer, Medical, Space, Farming etc) • Compound nouns • Meaning might not be inferable from individual components • Might required expert knowledge for translation • Might have multiple correct translations • Applicability • Cross-language Information Retrieval (CLIR) • Machine Translation (MT) • Machine-Readable Dictionary (MRD) • Most of the words are Out-Of-Vocabulary (OOV)
Examples Example 1: Computer Terminology (phần mềm -> software)
Examples Example 2: Personal Name (ca sĩ Quang Dũng -> Singer Quang Dung)
Searching the web for translation? • Parallel Data on the Web: Vietnamese to English Translation
Searching the web for translation? • Comparable corpus on the web:
Searching the web for translation? • Mixed language web pages: English Translation
Our Approach • Extensions to CMU’s Ying Zhang 2005 paper (Credit) • Addressing issues focusing to Vietnamese-English OOV translations • Proper name translation is using pattern recognition technique and not by phonetic similarity and string alignment • Detection of borrowed English words • Improving translation suggestions by utilizing contextual information
Crosslingual Query to Obtain Mixed Languages WebPages • Extend the source query, VS , with extended words/phrases VEX: (tend to frequently co-occur) • VS : phần mềm → ? • VSVEX : phần mềm miễn phí • Translate the extended words/phrases, VEX, , to English, EEX: • VEX : miễn phí → EEX : free • Submit both source query and translated words/phrases to a search engine • VSEEX : phần mềm free
How to Find This VEX ? Overture Search Log • Find co-occurred terms in web log • Use co-occurred terms in search query (in CLIR) • Search Google, with VS, and select Vietnamese words, VEX, with high frequency
Our Approach: Noun Phrase Translation Extraction • Proper noun recognition & Transliteration • Preprocessing • Frequency-Distance Model • Contextual Ordering Model & Result Ranking
Proper name recognition & Transliteration • Extract and concatenate Title, Summary, and URL • Recognize that proper name text pattern • is likely to appear in capital with the • first letter • Compute the likelihood of a query text is a proper name • Once recognized, map Vietnamese vowels to English vowels: • i.e á → a, à → a … , ũ → u… • Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung • Compute and assign a weight to a translation candidate
Preprocessing (Query: Thuật toán genetic) • Extracting and concatenation of Title, Summary, and URL Thuật toán-Cấu trúc dữ liệu ... (Reserve Polish Notation – RPN), một thuật toán "kinh điển" trong lĩnh vực trình biên dịch. ... THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM - Kỳ 2 ... ity.vnuit.edu.vn/thuattoan/index.htm • Mark query, normalize text, remove noise text ~123456789 cấu trúc dữ liệu reserve polish notation – rpn một ~123456789 kinh điển trong lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ 2 ity vnuit edu vn thuattoan index htm • Mark recognized Vietnamese text with VNW tag ~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW VNW trong VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm VNW ity vnuit edu vn thuattoan index htm • Group continuous English words and build word list ['~123456789', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn', '~123456789', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di', 'VNW', 'VNW', '~987654321', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index', 'htm']
Frequency-Distance Model • Frequency-Distance model: • Frequency of co-occurrence • Distance of either VS or EEX within a snippet text • For all doc returned summaries • Example: Thuật toán genetic
Contextual Ordering Model & Result Ranking • EstimateCloseness Probability • Overall Score for each candidate • Sort score and present top 5 suggestions
Sample Program Output # 1(dân ca -> folk or traditional music)
Contributions Recognize and translate important phrases Translate: persons, locations, concepts Low cost for implementation with reasonable performance Future work Experiment with a larger set of test data Integration with Vietnamese-English CLIR work Automate the generation of extended words/phrase to derived English extended word Experiment on “Refine Result” concept for search engine Conclusion and Next Steps