280 likes | 407 Views
University of Tehran Database Research Group. Mono & Cross Language Experiments on Persian Text. Persian@CLEF 2008. Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran. 18 Sep 2008. Outline.
E N D
University of Tehran Database Research Group Mono & Cross Language Experiments on Persian Text Persian@CLEF 2008 Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran 18 Sep 2008
Outline • Persian Language • Persian Test Collections • Hamshahri in CLEF 2008 • UT Participants • Using Part of Speech Tagging in Persian Information Retrieval • Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track • Local Cluster Analysis Using Part of Speech Tagging • Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text • Cross Language Experiments at Persian@CLEF 2008 • Next Year
The Persian Language • A branch of Indo-European Languages • Official Language of Iran, Afghanistan and Tajikistan • Its morphological analysis is Comparably difficult • The word “خبر” has two plural forms: • Persian rules: “خبرها” • Arabic rules: “اخبار”
Some Processing Issues • Writing Style Issues: • e.g. ”می شود“ and “میشود” are the same • e.g. ”کتابها“ and ”کتاب ها“ are the same • KASRE: • e.g. چراغ علی خانه را سوزاند has two different meanings: • CheraghAli burned the house • Ali’s lantern burned the house
Some Processing Issues • Encoding
Persian in the Middle East User Population Growth on the Web (2000-2008) December 31, 2007 Source: Internet World Statistics, http://internetworldstats.com/
Persian Test Collections • IR Domain • Ghavanin (domain specific) • Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri • NLP Domain • Bijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan
Hamshahri in CLEF 2008 • News articles of Hamshahri newspaper from year 1996 to 2002 • Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) • 22 assessors • Evaluation based on DIRECT System
Implementation of our methods We submitted top 100 for each run
Using Part of Speech Tagging in Persian Information RetrievalReza Karimpour, AminehGhorbani, AzadehPishdad, MitraMohtarami, AbolfazlAleAhmad, HadiAmiri, FarhadOroumchian
Using Part of Speech Tagging in Persian Information Retrieval
Using Part of Speech Tagging in Persian Information Retrieval
Using Part of Speech Tagging in Persian Information Retrieval
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track • And two other variations of this operator: IOWA and NOWA
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian TrackPost hoc Results
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian TextAmir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text But the result was not good on the test set
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation • Probabilistic Structured Queries (PSQ) • Combinatorial Translation Probability (CTP)
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Results
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation • Using Shiraz machine translation system from CRL of NMSU • Took 10 days to translate 130,000+ docs from Persian to English
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation & Hybrid Results
Next Year • Ham2 for the Next Year • Extended Version of Hamshahri Collection • 2 times larger (~1.5 GB)
Questions?Thanks For Your Attention Database Research Group http://ece.ut.ac.ir/dbrg