1 / 15

SINAI-GIR A Multilingual Geographical IR System

SINAI-GIR A Multilingual Geographical IR System. José Manuel Perea Ortega. Computer Science Department. University of Jaén (Spain). CLEF 2008, 18 September, Aarhus (Denmark). Introduction. Preliminary work of SINAI in GeoCLEF :

Download Presentation

SINAI-GIR A Multilingual Geographical IR System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. SINAI-GIRA Multilingual Geographical IR System José Manuel Perea Ortega Computer Science Department University of Jaén (Spain) CLEF 2008, 18 September, Aarhus (Denmark)

  2. Introduction • Preliminary work of SINAI in GeoCLEF: • 2006: query expansion using gazetteers and thesaurus [García-Vega et al., 2007] • 2007: filtering documents based on manual rules [Perea-Ortega et al., 2007] • GeoCLEF 2008: • Filtering documents using new manual rules and new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion) GeoCLEF 2008, Aarhus

  3. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  4. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Translates the queries from other languages into English • We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007] • It works with different online machine translators Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  5. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Preprocessing: stemming, stopwords, POS • The toponyms are extracted (NER) • Two indexes are generated: • Locations • Keywords Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  6. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 Q2 • Query Preprocessing: stemming, stopwords, removes irrelevant information • The toponyms are extracted (NER) • Spatial relations finder based on manual rules • Query reformulation based on POS tagging and query parsing subtask • Geo-expansion using a gazetteer • Keywords/Hyponyms detection Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  7. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 • Lemur as index-search engine • Okapi with PRF as weighting function Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  8. SINAI-GIR System overview English Query (Q) GeoNames TRANSLATOR QUERY ANALYZER Multilingual Query Q1 Keywords and geo-information extracted Q3 • Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query • Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query Q2 Documents retrieved GeoNames VALIDATOR IR Subsystem Keywords and geo-information extracted Collection Preprocessing subsystem English collection Final Re-Ranked Documents retrieved

  9. Experiments description • SINAI has participated in mono and bilingual tasks with a total of 15 experiments: • MONO-EN: 9 experiments • BILI-X2EN: 6 experiments • Combining the content of topic labels: TD or TDN • Baseline: Q1 without applying any filtering or re-ranking process • Other experiments: • Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3 • Using keywords and/or hyponyms in the re-ranking process GeoCLEF 2008, Aarhus

  10. MONO-EN results Best results using onlythe TD topic labels Best result: baseline (no filtering and no re-ranking) In some filtering experiments the use of keywords improves the results GeoCLEF 2008, Aarhus

  11. BILI-X2EN results Best result: baseline (no filtering and no re-ranking) with Portuguese topics Best results using onlythe TD topic labels GeoCLEF 2008, Aarhus

  12. Conclusions • The baseline experiment seems to work well because we include the geo-information in the retrieval process • The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content • The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them • Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query GeoCLEF 2008, Aarhus

  13. TextMESS at GeoCLEF 2008 • Spanish TextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI) • Method employed: merging algorithm based on fuzzy Borda voting scheme, taking as input the two document lists returned by both systems • Second best result in the monolingual English task GeoCLEF 2008, Aarhus

  14. Thank you sinai.ujaen.es GeoCLEF 2008, Aarhus

  15. References • García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007. • Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007. • García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007. http://sinai.ujaen.es GeoCLEF 2008, Aarhus

More Related