230 likes | 250 Views
An Automatic Construction of Arabic Similarity Thesaurus. Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009. Outline. Introduction Thesauruses Similarity Thesaurus Proposed Improvement The Experiment Evaluation Discussion Conclusions and Future Work.
E N D
An Automatic Construction of Arabic Similarity Thesaurus Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009
Outline • Introduction • Thesauruses • Similarity Thesaurus • Proposed Improvement • The Experiment • Evaluation • Discussion • Conclusions and Future Work
Introduction • Thesaurus importance • Effective Information Retrieval systems • Vocabulary mismatch problem • Query Expansion
Thesauruses • Arabic thesauruses • Manual construction drawbacks: • cost • time • subjectivity • Automatic construction approaches
Similarity Thesaurus • Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. • Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. • Expanding queries based on similarity to their concepts rather than similarity to the individual terms.
قرص الشمس قـرص قرص الدواء قرص ضوئي Similarity Thesaurus (cont.) • Using similarity thesaurus is analogous to the translation from a language to another. • Example
Similarity Thesaurus construction • The similarity thesaurus is a matrix that represents terms similarities. • Each term is represented by a vector that determines its relation with each document. • The matrix is generated through calculating similarities between terms vectors.
Query Expansion using Similarity Thesaurus • Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = • As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.
“Sum” method • “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. • When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. • Similarity values are considered to be inconsistent when they contain outliers.
Outliers • Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. Y outlier X
Proposed Improvement • A given term should have a high similarity value with each individual term in the query in order to be considered related. • The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. • The total similarity value should remain as the main factor in query expansion.
Proposed Improvement (cont.) • Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = • The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.
The Experiment • we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. • This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). • After examining the high frequency terms in the collection, we had chosen 150 stop words.
Evaluation • The objective of the evaluation was to assess the relevance strength of the produced terms. • The evaluation process was applied for both the “SUM” and “MEAN” methods. • We have selected twenty common topics that belong to five different domains.
Evaluation • For each topic, the top ten related terms were presented to five expert evaluators. • Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. • Levels of relevance: • Relevant • Somewhat Relevant • Irrelevant
Evaluation Results • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.
Discussion • We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. • Adding a single term to the query may completely change the concept of the query. • The candidate related term should have consistent similarities with all of the query terms.
Example • The response to a query about the former French president “جاك شيراك”:
Conclusions • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% • “MEAN” method shows an improvement of about 3.3% over “SUM” method. • We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.
Future Work • Applying word stemming. • Producing collocations. • Constructing a single word-category thesaurus. • Using similarity thesaurus in question answering.