A Knowledge-Based Search Engine Powered by Wikipedia

A Knowledge-Based Search EnginePowered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)

Introduction • Whenever we seek out new knowledge — how can one describe the unknown? • What knowledge seekers need is a bridge between what they know and what they wish to know. • Knowledge seekers can benefit from a thesaurus that covers the terminology of both documents and potential queries, and describes relations that bridge between them.

KORU

Creating A Relevant Knowledge Base • To work well, Koru relies on a large and comprehensive thesaurus. • Our own technique automatically extracts thesauri from a huge manually defined information structure. • From Wikipedia, we derive a thesaurus that is specific to each particular document collection.

Creating A Relevant Knowledge Base • The basic idea is to use Wikipedia’s articles as building blocks for the thesaurus. • Each article describes a single concept. • Concepts are often referred to by multiple terms and Wikipedia handles these using “redirects”.

Measuring semantic relatedness • Semantic relatedness concerns the strength of the relations between concepts. • The measure that we use quantifies the strength of the relation between two Wikipedia articles by weighting and comparing the links found within them.

Measuring semantic relatedness • Links are weighted by their probability of occurrence. • Links are less significant for judging the similarity between articles if many other articles also link to the same target. • We simply sum the weights of the links that are common to both articles.

Disambiguating unrestricted text • To identify the concepts relevant to a particular document collection, we work through each document in turn, identifying the significant terms and matching them to individual Wikipedia articles. • Disambiguate each term using the context surrounding it.

Identifying relations between concepts • Wikipedia contains many more links than the redirects we use to identify synonymy. • We gather all relations from article and category links, but weight them so that only the strongest are emphasized.

Weighing topics, occurrences and relations • Every occurrence of every topic is weighted within the thesaurus. • 1. tf-idf：A significant topic for a document should occur many times. • 2. average semantic relatedness measure：A significant topic should relate strongly to other topics in the document.

KORU in action

Evaluation • 2種版本：Topic browsing：有使用thesaurusKeyword Searching：沒有使用thesaurus • 12名參加者、10個task

Evaluation

A Knowledge-Based Search Engine Powered by Wikipedia

A Knowledge-Based Search Engine Powered by Wikipedia

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

Social Services Knowledge Scotland powered by The Knowledge Network

Search By Frompo Engine

Search Engine

Search Engine

FEDSAS MEMBER SCHOOLS SEARCH ENGINE PROJECT, POWERED BY RSA SEARCH Project Information

Multi-Language Ontology-based Search Engine

MedEx : Medical Expert Powered by a Context Targeted Search Engine

Wikitology: A Wikipedia Derived Knowledge Base

LyricSearch: A Text-Based Search Engine for Lyrics

N-gram Search Engine on Wikipedia

Solaris A Solar Powered Stirling Engine

On-line Knowledge Management Search Engine

Search Engine

Search engine

Search Engine

search engine

SLXi , NOW powered by a lean combustion engine

SEARCH ENGINE

A Knowledge-Based Search Engine Powered by Wikipedia