160 likes | 169 Views
This report presents a joint work on performing cross-language retrieval between Hungarian and English using Wikipedia participation. The approach includes term-by-term query translation, Wikipedia usage to discard off-topic terms, and the use of a bigram language model for selecting the most probable English translation. The algorithm involves preparations such as constructing a dictionary and generating a concept network from Wikipedia, as well as mapping terms to concept space and ranking concepts. The role of Wikipedia in concept mapping and the challenges faced in evaluation are discussed, along with future improvements.
E N D
Performing Cross-Language Retrieval with WikipediaParticipation report for Ad Hoc bilingualHungarian →English joint work with András A. Benczúr, István Bíró, Károly Csalogány Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Péter Schönhofen
Our Approach • Term-by-term query translation by dictionaries • Bigram language model helps select the most probable English translation • Using Wikipedia to discard off-topic terms IR System: Hungarian Academy of Sciences Search Engine (http://search.sztaki.hu) • TF×IDF-based • OR query, heavily weighted by # matched terms • Also taking into account proximity and term location Use only query title; description and narrative contributes to mapping title to Wikipedia concepts
Outline of the algorithm • Preparations • construct a dictionary • generate concept network from Wikipedia • pre-process queries and documents • Raw translation • disambiguation with bigram model • Improve translation quality with Wikipedia • map terms to concept space • rank concepts • map concepts to words
Dictionary Construction • Two sources of Hungarian-English term pairs: • On-line dictionary of the Institute(official + community edited entries) • cross-language links present in Wikipedia • Select conflicting entries in above order(official, community, Wikipedia) • 100,510 dictionary entries in total(however, large portion is idiom)
Raw translation • Find Hungarian dictionary terms in queries • Hungarian terms may overlap • Select best translations based on bigram model • a translation is better if it joins to other translations through bigrams with higher probability • Wikipedia model used but any other large corpus suffices Hungarian word query score by bigram model Translation candidate 1 Translation candidate 1 Translation candidate 2 Translation candidate 2 max
Concept network • Regular Wikipedia articles represent concepts • article title is concept name • links to other articles describe semantic relations • redirections are handled as additional concept names(sort of synonyms) • Category assignments are ignored • Wikipedia is in fact converted to an ontology • less formal than a proper ontology (e. g. WordNet) • only one type of relationship exists
Map terms to concepts • Match Wikipedia article titles with query terms • Concepts behind Wikipedia article titles: • the same title may represent multiple concepts • another layer of disambiguation is introduced • Concepts are recognized through terms, and are carried by text locations occupied by the term
Rank concepts • Select concepts which are the most tightly connected to other candidate concepts • Score of concept C computed from three factors: • L: # text locations carrying conceptssemantically related to C; • M: # concepts carried by the same text locations as C; • F: # text locations carrying C
Map concepts to words • Concepts→ titles (word sequences) pasting titles would yield too long queries • Titles→ set of words • Words are ranked based on the scores of concepts behind them the same word may represent many concepts • Query title words required if all translations of a title word discarded, forcefully injected into the translated query
Why use Wikipedia? • Advantages • freely available (snapshots are downloadable) • relatively high-quality • wide range of subjects covered • rapidly growing, up-to-date • Disadvantages • articles not always link to other relevant articles • category assignments not always consistent • basic verbs and nouns are not covered
Example query • Original query title:“cancer research” • Raw translation:“oncology” • Improved translation:“oncology cancer treatment”
Difficulties • Hungarian stemmer is not perfect • language is complex • pronouns not always recognized as such • Dictionary is small • In short: raw translation is of very low quality • Retrieval is not performed on the concept level • Context is not large enough to support the reliable selection of relevant Wikipedia concepts
Future work • Performing German queries against English corpora • More rich dictionary • Improved mechanism • raw translation is used for retrieval • Wikipedia concept network is used for determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts • Improved matching • POS information also taken into account