1 / 33

N -Gram and Local Context Analysis for Persian text retrieval

Tehran University. N -Gram and Local Context Analysis for Persian text retrieval. Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai. Outline. The Persian Language

yoko
Download Presentation

N -Gram and Local Context Analysis for Persian text retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tehran University N-Gram and Local Context Analysis for Persian text retrieval Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai

  2. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • The test collections • Our experiment and the results • Conclusion University of Tehran - Database Research Group

  3. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  4. The Persian Language • It is Spoken in countries like Iran, Tajikistan and Afghanistan • It has Arabic like script for writing and consists of 32 characters that are written continuously from right to left • It’s morphological analyzers need to deal with many forms of words that are not actually Farsi • Example • The word “کافر” (singular)  “کفار” (plural) • Or “عادت” that has two plural forms in Farsi: • Farsi form“عادت ها” • Arabic form“عادات” • So N-Grams are a solution University of Tehran - Database Research Group

  5. Our Study • We investigated vector space model on the Persian language: • unstemmed single term • N-gram based • Local Context Analysis • Using HAMSHAHRI collection which contains 160,000+ news articles University of Tehran - Database Research Group

  6. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  7. Vector Space Model List of Weights that produced the best results Best University of Tehran - Database Research Group

  8. Problem with Document length normalization • It is supposed to remove the difference between the document's lengths • Under cosine normalization shorter documents get higher weights but they are less relevant. Average probability of Relevance/Retrieval Average of median bin length University of Tehran - Database Research Group

  9. Lnu.ltu weighting scheme • A good weight proposed by Amit Singhal, et al. and tested on TREC collections • Based on reducing the gap between relevance and retrieval Lnu = ltu = University of Tehran - Database Research Group

  10. Probability Final Normalization Factor Document Length Old Normalization Factor Pivoted Normalization Source: A. Singhal, et al. “Pivoted Document Length Normalization” University of Tehran - Database Research Group

  11. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  12. NGRAM Approach (Cont.) • NGRAMS are strings of length n. • In this approach the whole text is considered as a stream of characters and then it is broken down to substrings of length n. • It is remarkably resistant to textual errors (e.g. OCR) and no linguistic knowledge is needed. • Example: “مخابرات” for n=4 مخاب خابر ابرا برات رات University of Tehran - Database Research Group

  13. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  14. Word Mismatch Problem • Automatic query expansion is a good solution for the issue of word mismatch in IR: • Local Analysis • + Expansion based on high ranking documents • - Needs an extra search • - Some queries may retrieve few relevant documents • Global Analysis • + It has robust average performance • - Expensive in terms of disk space and CPU • - Individual Queries can be significantly degraded University of Tehran - Database Research Group

  15. Local Context Analysis • Local Context Analysis is an automatic query expansion method • combines global analysis (use of context & phrase structure) and local feedback (Top ranked documents) • LCA is fully automated and there is no need to collect any information from user other than the initial query • + It is computationally practical • - But has the extra search to retrieve top ranked documents University of Tehran - Database Research Group

  16. Local Context Analysis (Cont.) • LCA has three main steps: • Run user’s query, break the top N retrieved documents into passages and rank them again. • Calculate similarity of each concept in the top ranked passages with the entire original query using similarity function: • the top M ranked concepts are added to the original query and initial retrieval method is done with the expanded query University of Tehran - Database Research Group

  17. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  18. Test Collections • Qvanin Collection • Documents: Iranian Law Collection • 177089 passages • 41 queries and Relevance Judgments • Hamshari Collection • Documents: 600+ MB News from Hamshari Newspaper • 160000+ news articles • 60 queries and Relevance Judgments • BijanKhan Tagged Collection • Documents: 100+ MB from different sources • A tag set of 41 tags • 2590000+ tagged words University of Tehran - Database Research Group

  19. Hamshahri Collection • We used HAMSHAHRI (a test collection for Persian text prepared and distributed by DBRG (IR team) of University of Tehran) • The 3rd version: • contains about 160000+ distinct textual news articles in Farsi • 60 queries and relevance judgments for top 20 relevant documents for each query University of Tehran - Database Research Group

  20. Some examples of Queries University of Tehran - Database Research Group

  21. Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Our experiment and the results • Conclusion  University of Tehran - Database Research Group

  22. Term-based vector space model • A. Singhal, et al. in their paper “Pivoted Document Length Normalization” reported that the following two configurations have the best performance: • Slope=0.25 and using pivoted unique normalization (P.U.N.). • Pivot = average no. of unique terms in a document • Slope=0.75 and using pivoted cosine normalization (P.C.N.). • Pivot = average cosine factor for 1+log(tf) University of Tehran - Database Research Group

  23. Our experiment results • Comparison of vector space model slope=0.25 and slope=0.75 University of Tehran - Database Research Group

  24. Our experiment results • Comparison of vector space model and LCA: • In LCA we used Lnu.ltu (slope=0.25 and P.U.N.) University of Tehran - Database Research Group

  25. N-Gram Experiments • Next, we assessed N-gram based vector space model for N = 3,4,5 on the HAMSHAHRI collection. • In addition to Lnu.ltu we assessed atc.atc in which both query and documents are weighted as follows: atc = University of Tehran - Database Research Group

  26. N-Gram experiment results • N-Grams using atc.atc and lnu.ltu (slope=0.25) weighting schemes University of Tehran - Database Research Group

  27. Conclusion Previous Works: Comparison of Vector Space System with FuFaIR They used the first version of HAMSHAHRI collection (300+ MB) in Their experiments. It has 30 Queries In vector space model the Slope set to 0.75 and the Pivot set to 13.36 University of Tehran - Database Research Group

  28. Conclusion Comparison of vector space systems with BM25 University of Tehran - Database Research Group

  29. Conclusion Comparison of Best Vector Space With Best N-grams Source: F. Oroumchian, F. Mazhar Garamaleki. “An Evaluation of Retrieval performance Using Farsi Text”. First Eurasia Conference on Advances in Information and Communication Technology, Tehran, Iran, October 2002. Experiments on Qavanin Collection University of Tehran - Database Research Group

  30. Our experiment best results • Experiments using atc.atc and lnu.ltu (slope=0.25) weighting schemes University of Tehran - Database Research Group

  31. Conclusion Results Analysis (N-Gram) • AS It was shown, 4-gram based vector space with Lnu.ltu weighting scheme has better performance than FuFaIR and other vector space models: • It is in contradiction with the performance of them in English. • The rational is that most Farsi words' roots are about 4 characters. • Our results are more valid than previous works because we used a better collection University of Tehran - Database Research Group

  32. Conclusion Results Analysis (LCA) • Local Context Analysis only marginally improved the results over the Lnu.ltu method • Lnu.ltu weighting method is performing very well on the Farsi language • It’s better to tune LCA parameters for the HAMSHAHRI collection University of Tehran - Database Research Group

  33. Thanks, Questions ? http://ece.ut.ac.ir/dbrg University of Tehran - Database Research Group

More Related