170 likes | 476 Views
The Domain-Specific Track at CLEF 2007. Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007. Outline. The Domain-Specific Task Collections & Controlled Vocabularies Topics
E N D
The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007
Outline • The Domain-Specific Task • Collections & Controlled Vocabularies • Topics • Participants, Runs & Relevance Assessments • Themes • Summary & Outlook
The Domain-Specific Task • CLIR on structured scientific document collections: • social science domain • bibliographic metadata • controlled vocabularies for subject description • Leverage bibliographic metadata & controlled vocabularies for: • search • translation
The Domain-Specific Task • Tasks: • Monolingual against German, English or Russian • Bilingual against German, English or Russian • Multilingual against combined collection
Controlled Vocabularies • 5 different subject-describing terminologies: • Thesaurus for the Social Sciences (GIRT-DE, -EN) • Thesaurus of Sociological Indexing Terms (CSA-SA) • INION Thesaurus (ISISS) • Social Sciences Classification (GIRT-DE, -EN) • Sociological Abstracts Classification (CSA-SA)
Controlled Vocabularies – Mapping Tools • Translation: • GIRT German GIRT English • Intellectual term mappings (cross-walks): • equivalent terms in vocabularies • GIRT German CSA-SA English • GIRT English CSA-SA English • original-term: agricultural area • mapped-term: Rural areas
Topics • 25 topics in standard TREC format (title, desc, narr): • 15 volunteers (social scientists) • 2-5 suggestions from 28 subject specialties • checked for: • coverage in collections • variance from previous years • translated into English, Russian
Participants 5 groups
Relevance Assessments All assessments done with Univ. of Padova‘s DIRECT System. * In Russian collection:3 topics without relevant topics
Themes - Retrieval models • Lucene • Language Modelling • Logistic Regression • Comparison: Vector Space, LM,Probabilistic - Okapi, DFR • Data fusion • Russian • word-based vs. N-gram retrieval • new light-weight stemmer
Themes – Query Expansion • Entry Vocabulary Modules • query terms associated with thesaurus terms from documents • Thesaurus Lookup • combined thesaurus from all CVs • GIRT Thesaurus Index • Lexical Entailment • find document terms in relation to query terms • Blind Feedback
Themes – Translation • Lucene plug-in • Babelfish, Google, PROMT, Reverso • Bilingual thesaurus mapping • Dictionary adaption • disambiguate term translation given language context of feedback documents • Statistical machine translation • MATRAX • Commercial Software
Summary & Outlook • Extension of Russian materials • Translation table DE-EN-RU for GIRT Thesaurus • Translation table RU-EN for INION Thesaurus • Mapping between GIRT – INION Thesaurus • More tools for Terminology mapping • different relationships (0T, SYN, BT, NT, RT) • GESIS-IZ project: > 40 mappings • 25 controlled vocabularies / 11 disciplines • ~ 125,000 terms & phrases • ~ 400,000 relations
Domain-Specific Track: http://www.gesis.org/en/research/ information_technology/clef_ds_2007.htm Vocabulary Mappings: http://www.gesis.org/en/research/ information_technology/komohe.htm Email: vivien.petras@gesis.org