220 likes | 503 Views
Corpus Exploitation f rom Wikipedia f or Ontology Construction. Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University. Outline. Introduction Related Work s Algorithm design Classification Tree Traversal
E N D
Corpus Exploitation from Wikipedia forOntologyConstruction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University
Outline • Introduction • Related Works • Algorithm design • Classification Tree Traversal • Ranking nodes in the classification tree • Experiments and Evaluations • Conclusion and Future Works
Background • Ontology Construction • Manual construction • Corpus is not necessary • Small scale • Automatic or semiautomatic construction • Domain specific corpus • Good domain knowledge coverage
Related Works • Corpus Selection • Corpus by linguists • British National Corpus (BNC) [Collin F. Baker, etc., 1998] • Corpus from Publications • Reuters News Corpus [Latifur Khan, Feng Luo, 2002] • Corpus from Internet • Searching Results from Web as Corpus [P Cimiano, etc., 2004]
Use of Wikipedia as a Resource • Statistical and analysis work • [A Lih., 2004], [Jakob Voss, 2005] • Link structure and cultural bias analysis of Wiki • [M Völkel, M Krötzsch, D Vrandecic, H Haller and R Studer., 2006 ], [F Bellomi and R Bonato, 2005] • Add semantic links • Add semantic links between concepts in Wiki pages • [M Völkel, 2006], [Michael Strube, Simone Paolo Ponzetto, 2006] • Corpus for XML retrieval • [L Denoyer, P Gallinari, 2006]
Problems • Manually Selected Corpus • Domain experts needed • Time and labor intensive • Corpus Collection from Publications • Limitation in time and region • Internet Exploitation • Difficulty in domain specific data identification
Wikipedia Overview • Established in 2001 • 500,000 articles in 2005 • 1 million articles in Nov. 2006 • More than 2 millions of articles till now • Different types of data • Abundance of domain specific data • Availability of category information • Too many reachable nodes
Algorithm Design • Basic Idea • Make use of the classification tree to only certain qualified reachable nodes • Classification Tree Traversal • Given a Root node: Pr(category node) • Breadth-First-Search Algorithm • Initialization • Wr= 1 for root node Pr • Wi= 0 if Pi is not on the current traversal path
Tree traversal and weights • Wiki Graph • Classification Tree • In-edge • Out-edge • Nin(P) • Nout(P)
Ranking Schemes (1) • S1 • Considering the sum of scores of Pc’s out-edges pointing to the classification tree against the total number of Pc‘s out-edges • The 1 in denominator is to avoid it being 0
Ranking Schemes (2) • S2 • Considering the summation of Pc’s in-edges in the classification tree against the total number of the in-edges of Pi s, which are Pc’s upper level nodes
Ranking Schemes (3) • S3 • Considering the summation of the out-edge nodes in the classification tree divided by both Pc’s out-edge scores and its upper level nodes Pi’s in-edge scores
Data • Wikipedia Resource • English version in XML • 1,100,000 articles • Cut off date: Nov. 30, 2006 • Domain Connected Branches • 549,486 nodes for IT • 549,433 nodes for biology
Evaluation on Scheme Selection • Evaluation by sampling • For Top 20,000 nodes • 10 nodes in every 1,000 nodes • For Remaining nodes • 10 nodes in every 10,000 nodes • Corpus size • Top 20,000 • 98M for IT • 101M for Biology
Sampling Results of Different Schemes Table 1 Evaluation Result of Different Schemes in the IT Domain Table 2 Evaluation Result of Different Schemes in the Biology Domain
Root Node Identification • Different root nodes leads to different classification structure • E.g. “Category: Electronics” • For electronics • For IT • Compare to Library of American Congress Classification (LACC) • Widely used library classification in most research and academic areas
Comparisons with LACC Comparisons of Classification Trees with Root Nodes from Respective Domains
Comparisons with LACC (2) Comparisons of Classification Tree Structures with LACC with Root Node: Electronics
Conclusion • Acquire leave nodes through qualified classification tree branches in Wiki • Best performance should take into consideration of both in-edges and out-edges • Selection of proper nodes does affect the results • Pick the most common term as the root node
Future Works • Improve Ranking Functions • Using page contents • Using hyperlinks in contexts of pages • Set different parameters of weights to different domains
Thanks! Q & A