150 likes | 316 Views
Multilingual Synchronization. Eun-kyung Kim 2011-02-10. Introduction. Wikipedia Supports over 270 languages Allows cross-lingual navigation with inter-language link Different quantity of data Goal
E N D
Multilingual Synchronization Eun-kyung Kim 2011-02-10
Introduction • Wikipedia • Supports over 270 languages • Allows cross-lingual navigation with inter-language link • Different quantity of data • Goal • Synchronizing multilingual Wikipedia data to fill the gap between different languages & to acquire the integrated knowledge
Methodology (base) • Hypothesis • X is a key fact in L1 X’ should be a key fact in L2 • where X’ is a corresponding term to X in different language • Assumption • Inter-language links are accurate links to connect two pages about the same entity or concept in different languages • Key facts come from the structured data such as: • Infobox • Category • Hyperlink text (than normal text)
Methodology • Basic methodology • Infobox synchronization between English & Korean • Duplicate resolving & conflict resolving • Comments from Committee (Ph.D proposal) • Improve the multilingualism • Do not ignore multiple viewpoints • Hard to evaluate • Extended methodology • Sync target selection in 5 languages • Key facts synchronization including not only Infobox but also LinkText • Filling missing information according to each background knowledge and characteristics • focusing on how to add new information from other language resources
Methodology • Basic methodology • Infobox synchronization between English & Korean • Duplicate resolving & conflict resolving • Comments from Committee (Ph.D proposal) • Improve the multilingualism • Do not ignore multiple viewpoints • Hard to evaluate • Extended methodology • Sync target selection in 5 languages • Key facts synchronization including not only Infobox but also LinkText • Filling missing information according to each background knowledge and characteristics • focusing on how to add new information from other language resources
Example of Infobox Synchronization • Drawback of Infobox • Sometimes meaningless for synchronization • Solution • Adding links information to synchronize Infobox from Arthritis
Links on the Web • Links • navigate to a web page with more detailed information • point to previously published web pages with similar or related content • Understanding of the influence of each link can substantially benefit many applications • e.g., multilingual sync 베짱이 귀뚜라미 방아깨비 메뚜기목 메뚜기 해충 풀무치 벼메뚜기 여치 사우디아라비아 농업 예멘
Multilingual Synchronization Process Finding missing links according to the model Preprocessing (Target Page Selection) Wikipedia Data LN Wikipedia Data L2 Wikipedia Data L1 Translating links into target languages to sync Extracting Links Modeling on influence links Computing similarity between existing and new L1 L2 LN … Unifying synchronized data
Multilingual Synchronization Process Finding missing links according to the model Preprocessing (Target Page Selection) Wikipedia Data LN Wikipedia Data L2 Wikipedia Data L1 Translating links into target languages to sync Extracting Links Modeling on influence links Computing similarity between existing and new L1 L2 LN … Unifying synchronized data
Preprocessing: Selecting Target • Source languages(5) • English, Spanish, French, Chinese, Korean • Extracting target pages with a complete graph(clique) by inter-language links • Assumption: • Pages founded in all 5 languages are key pages and the target to sync • Enforcing consistency of a link path • If a path from X(L1) to X’(L2) founded once,its inverse path (X’, X) is automatically added to the output A subset of UN official languages en:Badminton fr:Badminton es:Bádminton ko:배드민턴 zh:羽毛球
Preprocessing: Selected Pages • Total 42,077 pages • Example) page-lengthcomparison • Badminton (배드민턴) • en(52,098) > fr(26,508) > ko(22,960) > zh(19,050) > es(17,594) • Suncheon,_Jeollanam-do(순천시_(전라남도)) • ko(20,816) > en(8,910) > zh(1,688) > es(1,600) > fr(1,503)
Modelingon influence links • Example of links in multiple language Wikipedia • Different Wikipedia has different viewpoints and different concerns (fig) • Some links are newly added, some others are deleted by user in a temporal manner • We need to know the permutation distance of links on each language Wikipedia(ongoing)
Evaluation Plan • Compare how much usefulinformation to fillfrom other language resources • Links of the Featured article in L1 vs.Unified links from M-Sync in L2 , …, LN • Compare how much relevantinformation to fillfrom other language resources • NGD(normalized Google distance)
Task & Schedule • Target Conference: • Web Intelligence 2011 (3/4, 3/11) • Task • Modeling on influence link to synchronize • Link category analysis on each Lang • Using Wikipedia links information • Using Wikipedia template, category(CAT2ISA) • Link evolution analysis on each Lang • Using Wikipedia edit history • Making evaluation dataset