Corpus Exploitation f rom Wikipedia f or Ontology Construction

Corpus Exploitation from Wikipedia forOntologyConstruction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University

Outline • Introduction • Related Works • Algorithm design • Classification Tree Traversal • Ranking nodes in the classification tree • Experiments and Evaluations • Conclusion and Future Works

Background • Ontology Construction • Manual construction • Corpus is not necessary • Small scale • Automatic or semiautomatic construction • Domain specific corpus • Good domain knowledge coverage

Related Works • Corpus Selection • Corpus by linguists • British National Corpus (BNC) [Collin F. Baker, etc., 1998] • Corpus from Publications • Reuters News Corpus [Latifur Khan, Feng Luo, 2002] • Corpus from Internet • Searching Results from Web as Corpus [P Cimiano, etc., 2004]

Use of Wikipedia as a Resource • Statistical and analysis work • [A Lih., 2004], [Jakob Voss, 2005] • Link structure and cultural bias analysis of Wiki • [M Völkel, M Krötzsch, D Vrandecic, H Haller and R Studer., 2006 ], [F Bellomi and R Bonato, 2005] • Add semantic links • Add semantic links between concepts in Wiki pages • [M Völkel, 2006], [Michael Strube, Simone Paolo Ponzetto, 2006] • Corpus for XML retrieval • [L Denoyer, P Gallinari, 2006]

Problems • Manually Selected Corpus • Domain experts needed • Time and labor intensive • Corpus Collection from Publications • Limitation in time and region • Internet Exploitation • Difficulty in domain specific data identification

Wikipedia Overview • Established in 2001 • 500,000 articles in 2005 • 1 million articles in Nov. 2006 • More than 2 millions of articles till now • Different types of data • Abundance of domain specific data • Availability of category information • Too many reachable nodes

Algorithm Design • Basic Idea • Make use of the classification tree to only certain qualified reachable nodes • Classification Tree Traversal • Given a Root node: Pr(category node) • Breadth-First-Search Algorithm • Initialization • Wr= 1 for root node Pr • Wi= 0 if Pi is not on the current traversal path

Tree traversal and weights • Wiki Graph • Classification Tree • In-edge • Out-edge • Nin(P) • Nout(P)

Ranking Schemes (1) • S1 • Considering the sum of scores of Pc’s out-edges pointing to the classification tree against the total number of Pc‘s out-edges • The 1 in denominator is to avoid it being 0

Ranking Schemes (2) • S2 • Considering the summation of Pc’s in-edges in the classification tree against the total number of the in-edges of Pi s, which are Pc’s upper level nodes

Ranking Schemes (3) • S3 • Considering the summation of the out-edge nodes in the classification tree divided by both Pc’s out-edge scores and its upper level nodes Pi’s in-edge scores

Data • Wikipedia Resource • English version in XML • 1,100,000 articles • Cut off date: Nov. 30, 2006 • Domain Connected Branches • 549,486 nodes for IT • 549,433 nodes for biology

Evaluation on Scheme Selection • Evaluation by sampling • For Top 20,000 nodes • 10 nodes in every 1,000 nodes • For Remaining nodes • 10 nodes in every 10,000 nodes • Corpus size • Top 20,000 • 98M for IT • 101M for Biology

Sampling Results of Different Schemes Table 1 Evaluation Result of Different Schemes in the IT Domain Table 2 Evaluation Result of Different Schemes in the Biology Domain

Overall Precision on sampled data

Root Node Identification • Different root nodes leads to different classification structure • E.g. “Category: Electronics” • For electronics • For IT • Compare to Library of American Congress Classification (LACC) • Widely used library classification in most research and academic areas

Comparisons with LACC Comparisons of Classification Trees with Root Nodes from Respective Domains

Comparisons with LACC (2) Comparisons of Classification Tree Structures with LACC with Root Node: Electronics

Conclusion • Acquire leave nodes through qualified classification tree branches in Wiki • Best performance should take into consideration of both in-edges and out-edges • Selection of proper nodes does affect the results • Pick the most common term as the root node

Future Works • Improve Ranking Functions • Using page contents • Using hyperlinks in contexts of pages • Set different parameters of weights to different domains

Thanks! Q & A

Corpus Exploitation f rom Wikipedia f or Ontology Construction

Corpus Exploitation f rom Wikipedia f or Ontology Construction

Presentation Transcript

Presentation f or:

F rom Machine to System Gainesboro

f rom Paradise Lost

: A View F rom Wisconsin

Environmental Science F rom Space

Where I'm F rom

f rom The Land I Lost

Recovering Data F rom Corrupt Packets

f rom the Rig Veda

f rom “Blue Highways”

F rom Italy to America

The View f rom Saturday

f rom Gorjan Petrovski

F rom “Care” to College

f rom

Continuation F rom Last Time

F luid in ICU, F riend or F oe? F acts Revealed

F rom P roposal to Dissertation

F rom EROTOLOGY to KINSEY

Presentation f or