160 likes | 283 Views
Multilingual Synchronization focusing on Wikipedia. 2011-03-17. Goal of M-Sync. Multilingual Synchronization Synchronizing contents of Wikipedia from multiple different languages Linking among multiple language contents Combining them to synthesis
E N D
Multilingual Synchronization focusing on Wikipedia 2011-03-17
Goal of M-Sync • Multilingual Synchronization • Synchronizing contents of Wikipedia from multiple different languages • Linking among multiple language contents • Combining them to synthesis • The various Wikipedia editions from different languages • can offer more precise and detailed information based on different intentions/backgrounds/cultures • can fill the gap between different languages and to acquire the integrated knowledge
Multilingual Resource Synthesis • Construction Association Network from multiple language hyperlink structures • based on occurrence data • A Web page makes a direct reference to another Web page via a hyperlink • Definitions • Association • A relation between any two words or concepts with a strength • Association Network • Nodes and edges • Node: entity (concept or a named entity) • Edge: link between entities with weight(the strength of association)
Motivating Example Baldness Hair
Motivating Example Baldness Hair 탈모증 털
Motivating Example Baldness Hair 탈모증 털 Increase Strength
Motivating Example Hamilton-Norwood_scale Baldness Hair 탈모증 털
Motivating Example Hamilton-Norwood_scale Baldness Hair 탈모증 털 Link Prediction or Keyword Recommendation Hamilton-Norwood_scale
Multilingual Resource Synthesis: Construction of Association Network • Hypothesis • X is associated with Y in L1 X’ should be associated with Y’ in L2 • Where Y’ is a corresponding term to Y in different language • Assumption • Inter-language links are accurate links to connect two pages about the same entity or concept in different languages • Where X’ is a translating term to X in different language • X is associated with Y according to its strength synthesis based on respective occ-score
Proposed System Workflow English Documents Korean Documents Chinese Documents Link Extraction Link Extraction Link Extraction Association set for each English word Association set for each Korean word Association set for each Chinese word Calculation of Correlations Calculation of Correlations Calculation of Correlations Bilingual Dictionary Bilingual Dictionary Association for each pair of English and Korean words Association for each pair of Korean and Chinese words Calculation of Correlations Selection of highly Associated pairs of words
Example of Association Set Association set of “Baldness” • androgenic alopecia • 머리카락 • 荷爾蒙 • 남성형 탈모증 • 頭髮 • 心理 • adult • alopecia areata • 원형 탈모증 • 營養
Preprocessing: Selecting Target • Source languages(5) • English, Spanish, French, Chinese, Korean • Extracting target pages with a 5-clique by inter-language links • Assumption: • Pages founded in all 5 languages are key pages and the target to sync • Enforcing consistency of a link path • If a path from X(L1) to X’(L2) founded once,its inverse path (X’, X) is automatically added to the output A subset of UN official languages en:Badminton fr:Badminton es:Bádminton ko:배드민턴 zh:羽毛球
Link types of Wikipedia • internal linksto other pages in the wiki • Syntax usage: [[Main Page]] • external links to other websites • interwiki links to other websites registered to the wiki in advance • Unlike internal links, interwiki links do not use page existence detection • Syntaxusage: [[wikipedia:Sunflower]] • Interlanguage links to other websites registered as other language versions of the wiki
Experiment • Link Extraction • Input: • 40,000 pages in each languages • Output • 5,453,959 links en • 3,965,279 links fr • 3,368,949 links es • 2,397,959 links zh • 1,347,811 links ko Association Pairs
Experiment: respective mode • Compute strength of Association: LF-IDF • is motivated by TF-IDF heuristic
Experiment: synthesis mode • M: Simple average • NM: Average with standard deviation • Demo • http://nlplab.kaist.ac.kr/~kekeeo/term