Improving the performance of personal name disambiguation using web directories

Improving the performance of personal name disambiguation using web directories Quang Minh Vu, Atsuhiro Takasu, Jun Adachi IPM, 2008 Presented by Hung-Yi Cai 2010/09/01

Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

Motivation Searching for information about a person on the internet is an increasing requirement in information retrieval. Search results returned from search engines for a personal name query often contain documents relevant to several people because a name is usually shared by several people. Due to this name ambiguity problem, users have to manually investigate the result documents to ﬁlter out people in whom they have no interest.

Previous Studies

Objectives • Propose Similarity via Knowledge Base (SKB) that uses web directories to improve the disambiguating performance in Name Disambiguation System (NDS). • SKB can be divided into two components: • Using web directories as a knowledge base to ﬁnd common contexts by TF-IDF in documents. • Then, using the common contexts measure to determine document similarities.

TF-IDF • Term weights are calculated using the terms’ occurrences in the document concerned and in a set of documents. • Tf (t, doc) is the number of times term t appears in the document doc.

Methodology • In SKB, using web directories to measures features of terms in a document. • Measurement of term weights using a knowledge base • A knowledge base • Modification of term weight in documents • Modification of term weight in directories • Measurement of document similarities • Find directories close in topic with the document • Measure document similarities

Name Disambiguation System • The operational details are as follows: • Preprocessing documents • Calculation of document similarities • Discrimination by reranking documents

Experiments • Step 1. Data Sets • Documents of people • Creation of pseudo namesake document sets and real namesake document sets

Experiments • Step 2. Web directory structures

Experiments • Step 3. Baseline methods • Comparing SKB with two conventional methods: • VSM： • Calculating the weight of these terms by TF-IDF • Building the feature vectors of documents • NER： • Extracting the entity names in the documents by LingPipe software • Using these names to construct feature vectors of the documents (the constituents of vectors were binary values)

Experiments • Step 4. Evaluation metrics • We recorded the precision values at 11 recall points: 0%, 10%, ... ,90%, and 100% and denoted these as P(doci, 0%), P(doci, 10%), ... , P(doci, 90%) and P(doci, 100%), respectively.

Experiments • Step 5. Experimental results • The overall performance for each method • In this experiment, we set the window size n = 50 and the number of representative directories k = 20. We set the frequency document ratio threshold for SKB2 r = 5.

Experiments • Step 5. Experimental results • Performance of SKB2 when varying the frequency ratio threshold

Experiments • Step 5. Experimental results • Performance of SKB systems when varying the window size

Experiments • Step 5. Experimental results • Performance of SKBs when varying the number of representative directories

Experiments • Step 5. Experimental results • Performance for each method on real namesake document sets

Conclusions • Disambiguation of people will be a trend in web search, and we propose a new method that uses web directories as a knowledge base to improve the disambiguation performance. • The experimental results showed a significant improvement with our system over the other methods, and we also verified the robustness of our methods experimentally with different web directory structures and with different parameter values.

Comments • Advantages • Just requiring little preparation • Broad range of people • Shortages • Cost of computation is proportional • Some mistake • Applications • Information retrieval

Improving the performance of personal name disambiguation using web directories

Improving the performance of personal name disambiguation using web directories

Presentation Transcript

Improving the performance of personal name disambiguation using web directories

Person Name Disambiguation by Bootstrapping

Personal Name Classification in Web queries

Name Disambiguation in Digital Libraries

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

Strategies for improving Web site performance

Improving the Performance of Interactive TCP Applications using Service Differentiation

Integrated Approach to Improving Web Performance

Author Name Disambiguation for Citations Using Topic and Web Correlation

Co-occurrence and place name disambiguation.

Improving NEST Performance Using Surrogates

Improving the Performance of Network Intrusion Detection Using Graphics Processors

Improving End-to-End Performance of the Web Using Server Volumes and Proxy Filters

Improving Subcategorization Acquisition using Word Sense Disambiguation

Survey on Improving Dynamic Web Performance

Contextual Search and Name Disambiguation in Email using Graphs

Contextual Search and Name Disambiguation in Email Using Graphs

Strategies for improving Web site performance

Contextual Search and Name Disambiguation in Email using Graphs

Improving Web Servers performance

Benefits of using Legal business directories

Benefits of Using Local Business Directories