230 likes | 252 Views
This paper outlines the importance of similarity thesauruses in information retrieval systems, focusing on Arabic language. It discusses the drawbacks of manual thesaurus construction and proposes an improvement using a similarity matrix. The experiment evaluates two query expansion methods - "SUM" and "MEAN" - with the "MEAN" method showing better relevance strength. Future work includes word stemming, collocation production, and integration of similarity thesauruses in question answering systems.
E N D
An Automatic Construction of Arabic Similarity Thesaurus Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009
Outline • Introduction • Thesauruses • Similarity Thesaurus • Proposed Improvement • The Experiment • Evaluation • Discussion • Conclusions and Future Work
Introduction • Thesaurus importance • Effective Information Retrieval systems • Vocabulary mismatch problem • Query Expansion
Thesauruses • Arabic thesauruses • Manual construction drawbacks: • cost • time • subjectivity • Automatic construction approaches
Similarity Thesaurus • Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. • Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. • Expanding queries based on similarity to their concepts rather than similarity to the individual terms.
قرص الشمس قـرص قرص الدواء قرص ضوئي Similarity Thesaurus (cont.) • Using similarity thesaurus is analogous to the translation from a language to another. • Example
Similarity Thesaurus construction • The similarity thesaurus is a matrix that represents terms similarities. • Each term is represented by a vector that determines its relation with each document. • The matrix is generated through calculating similarities between terms vectors.
Query Expansion using Similarity Thesaurus • Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = • As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.
“Sum” method • “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. • When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. • Similarity values are considered to be inconsistent when they contain outliers.
Outliers • Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. Y outlier X
Proposed Improvement • A given term should have a high similarity value with each individual term in the query in order to be considered related. • The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. • The total similarity value should remain as the main factor in query expansion.
Proposed Improvement (cont.) • Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = • The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.
The Experiment • we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. • This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). • After examining the high frequency terms in the collection, we had chosen 150 stop words.
Evaluation • The objective of the evaluation was to assess the relevance strength of the produced terms. • The evaluation process was applied for both the “SUM” and “MEAN” methods. • We have selected twenty common topics that belong to five different domains.
Evaluation • For each topic, the top ten related terms were presented to five expert evaluators. • Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. • Levels of relevance: • Relevant • Somewhat Relevant • Irrelevant
Evaluation Results • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.
Discussion • We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. • Adding a single term to the query may completely change the concept of the query. • The candidate related term should have consistent similarities with all of the query terms.
Example • The response to a query about the former French president “جاك شيراك”:
Conclusions • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% • “MEAN” method shows an improvement of about 3.3% over “SUM” method. • We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.
Future Work • Applying word stemming. • Producing collocations. • Constructing a single word-category thesaurus. • Using similarity thesaurus in question answering.