Automatic Construction of Arabic Similarity Thesaurus

An Automatic Construction of Arabic Similarity Thesaurus Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009

Outline • Introduction • Thesauruses • Similarity Thesaurus • Proposed Improvement • The Experiment • Evaluation • Discussion • Conclusions and Future Work

Introduction • Thesaurus importance • Effective Information Retrieval systems • Vocabulary mismatch problem • Query Expansion

Thesauruses • Arabic thesauruses • Manual construction drawbacks: • cost • time • subjectivity • Automatic construction approaches

Similarity Thesaurus • Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. • Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. • Expanding queries based on similarity to their concepts rather than similarity to the individual terms.

قرص الشمس قـرص قرص الدواء قرص ضوئي Similarity Thesaurus (cont.) • Using similarity thesaurus is analogous to the translation from a language to another. • Example

Similarity Thesaurus construction • The similarity thesaurus is a matrix that represents terms similarities. • Each term is represented by a vector that determines its relation with each document. • The matrix is generated through calculating similarities between terms vectors.

Query Expansion using Similarity Thesaurus • Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = • As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.

“Sum” method • “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. • When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. • Similarity values are considered to be inconsistent when they contain outliers.

Outliers • Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. Y outlier X

Proposed Improvement • A given term should have a high similarity value with each individual term in the query in order to be considered related. • The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. • The total similarity value should remain as the main factor in query expansion.

Proposed Improvement (cont.) • Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = • The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.

The Experiment • we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. • This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). • After examining the high frequency terms in the collection, we had chosen 150 stop words.

Document collection characteristics

Evaluation • The objective of the evaluation was to assess the relevance strength of the produced terms. • The evaluation process was applied for both the “SUM” and “MEAN” methods. • We have selected twenty common topics that belong to five different domains.

Evaluation • For each topic, the top ten related terms were presented to five expert evaluators. • Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. • Levels of relevance: • Relevant • Somewhat Relevant • Irrelevant

Evaluation Results • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.

Discussion • We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. • Adding a single term to the query may completely change the concept of the query. • The candidate related term should have consistent similarities with all of the query terms.

Example • The response to a query about the former French president “جاك شيراك”:

Example (cont.)

Conclusions • The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% • “MEAN” method shows an improvement of about 3.3% over “SUM” method. • We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.

Future Work • Applying word stemming. • Producing collocations. • Constructing a single word-category thesaurus. • Using similarity thesaurus in question answering.

End

Automatic Construction of Arabic Similarity Thesaurus

Automatic Construction of Arabic Similarity Thesaurus

Presentation Transcript

Thesaurus Construction and Use

Automatic Assessment of Spoken Modern Standard Arabic

Construction of Arabic WordNet in Parallel with an Ontology

An Information-Theoretic Definition of Similarity

Automatic Derivation of Thesauri Thesauri Construction

Automatic Part-of-Speech Tagging of Arabic Text

Visual Thesaurus

Visual Thesaurus

CHAPTER 9 THESAURUS CONSTRUCTION Prof.R.C.Tripathi

Thesaurus

Automatic Construction of Indoor Floor plans

CROWDINSIDE: AUTOMATIC CONSTRUCTION OF INDOOR FLOOR PLANS

An Overview of Similarity Query Processing

Semi-automatic construction of topic ontology

THESAURUS IMPLEMENTATION

Topology Matching For Fully Automatic Similarity Matching of 3D Shapes

Automatic RAID Construction

Multilingual Thesaurus of Geosciences

Halal is an Arabic word

Automatic Assessment of Spoken Modern Standard Arabic

Construction of Automatic Solar Lawn Mower

best thesaurus