150 likes | 294 Views
SINAI-GIR A Multilingual Geographical IR System. José Manuel Perea Ortega. Computer Science Department. University of Jaén (Spain). CLEF 2008, 18 September, Aarhus (Denmark). Introduction. Preliminary work of SINAI in GeoCLEF :
E N D
SINAI-GIRA Multilingual Geographical IR System José Manuel Perea Ortega Computer Science Department University of Jaén (Spain) CLEF 2008, 18 September, Aarhus (Denmark)
Introduction • Preliminary work of SINAI in GeoCLEF: • 2006: query expansion using gazetteers and thesaurus [García-Vega et al., 2007] • 2007: filtering documents based on manual rules [Perea-Ortega et al., 2007] • GeoCLEF 2008: • Filtering documents using new manual rules and new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion) GeoCLEF 2008, Aarhus
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Translates the queries from other languages into English • We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007] • It works with different online machine translators Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Preprocessing: stemming, stopwords, POS • The toponyms are extracted (NER) • Two indexes are generated: • Locations • Keywords Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Query Preprocessing: stemming, stopwords, removes irrelevant information • The toponyms are extracted (NER) • Spatial relations finder based on manual rules • Query reformulation based on POS tagging and query parsing subtask • Geo-expansion using a gazetteer • Keywords/Hyponyms detection Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 • Lemur as index-search engine • Okapi with PRF as weighting function Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 • Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query • Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved
Experiments description • SINAI has participated in mono and bilingual tasks with a total of 15 experiments: • MONO-EN: 9 experiments • BILI-X2EN: 6 experiments • Combining the content of topic labels: TD or TDN • Baseline: Q1 without applying any filtering or re-ranking process • Other experiments: • Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3 • Using keywords and/or hyponyms in the re-ranking process GeoCLEF 2008, Aarhus
MONO-EN results Best results using onlythe TD topic labels Best result: baseline (no filtering and no re-ranking) In some filtering experiments the use of keywords improves the results GeoCLEF 2008, Aarhus
BILI-X2EN results Best result: baseline (no filtering and no re-ranking) with Portuguese topics Best results using onlythe TD topic labels GeoCLEF 2008, Aarhus
Conclusions • The baseline experiment seems to work well because we include the geo-information in the retrieval process • The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content • The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them • Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query GeoCLEF 2008, Aarhus
TextMESS at GeoCLEF 2008 • Spanish TextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI) • Method employed: merging algorithm based on fuzzy Borda voting scheme, taking as input the two document lists returned by both systems • Second best result in the monolingual English task GeoCLEF 2008, Aarhus
Thank you sinai.ujaen.es GeoCLEF 2008, Aarhus
References • García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007. • Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007. • García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007. http://sinai.ujaen.es GeoCLEF 2008, Aarhus