290 likes | 409 Views
Primo ž Jakopin, Birte Lönneker Scientific Research Center ZRC SAZU Ljubljana, Slovenia. Query-driven dictionary enhancement. Motivation. Log files of online dictionaries provide direct acces to the users‘ requests. Make use of them!. Dictionary author‘s question:
E N D
Primož Jakopin,Birte Lönneker ScientificResearchCenter ZRC SAZU Ljubljana, Slovenia Query-driven dictionary enhancement
Motivation • Log files of online dictionaries provide direct acces to the users‘ requests. • Make use of them! Dictionary author‘s question: What are the needs of the users of my dictionary? 11th EURALEX Congress - Query-driven dictionary enhancement
Overview • The dictionary: Online SLO-DE-SLO • The log file • Use of the log file • to evaluate current dictionary contents • to choose the most promising corpus type for enlarging the dictionary • Conclusions 11th EURALEX Congress - Query-driven dictionary enhancement
Dictionary: Online SLO-DE-SLO • Bidirectional online dictionary • German-Slovenian • On the Web since 2001 • Initially a learners‘ dictionary • for German-speaking learners of Slovenian 11th EURALEX Congress - Query-driven dictionary enhancement
Online SLO-DE-SLO user interface 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluated version (October 2003) Textbook corpus: 5,172 entries Newspaper corpus: 729 entries Total: 5,901 entries Current version (June 2004) Textbook corpus: 5,544 entries Newspaper corpus: 743 entries Technical corpus: 829 entries Total: 7,116 entries Online SLO-DE-SLO contents 11th EURALEX Congress - Query-driven dictionary enhancement
Online SLO-DE-SLO entry concept • Each entry is bilingual • Exactly one equivalence per entry • An entry can describe • a basic word form • an inflected word form • an example sentence or phrase • a collocation 11th EURALEX Congress - Query-driven dictionary enhancement
Online SLO-DE-SLO query results 11th EURALEX Congress - Query-driven dictionary enhancement
The log file • When a user submits a query to the dictionary, a program writes data about the query into the log file, e.g. • Source language • Submitted query string • Selected search options (exact string match, match at beginning of word, match anywhere) • Time stamp 11th EURALEX Congress - Query-driven dictionary enhancement
The log file: details • Evaluation period • 6 January 2002 to 10 October 2003 • Number of queries stored in log file • 131,674 • Number queries, exact string match • 88,879 • Only exact string match queries are evaluated 11th EURALEX Congress - Query-driven dictionary enhancement
The log file: preprocessing • Has to take into account how the matching is performed when the dictionary finds an entry for the user • Example 1: • Dictionary matching: Case insensitive (user enters A for a) • Preprocessing: Downcase all letters in log file (and in dictionary evaluation file) 11th EURALEX Congress - Query-driven dictionary enhancement
The log file: preprocessing • Example 2 a: • Dictionary matching: Substitution of special characters for easier access (user enters ae for ä) • Preprocessing version I: • Make a second version of log file • Replace ae by ä in second version • Use spell checker word list to find valid versions • Check ambiguous cases manually 11th EURALEX Congress - Query-driven dictionary enhancement
The log file: preprocessing • Example 2b: • Dictionary matching: Substitution of special characters for easier access (user enters c for č) • Preprocessing version II: • Make a second version of log file • Replace c by č in second version • Use frequencies of parallel spellings to find valid versions • Check ambiguous cases manually 11th EURALEX Congress - Query-driven dictionary enhancement
The log file: preprocessing • Users sometimes determine erroneous source language (SL) • Correct SL could be found using spell checker lists for both languages • In our case: „spell checker lists“ taken from Online SLO-DE-SLO detect • ...wrongly determined SL Slovenian: 378 • ...wrongly determined SL German: 593 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation IQueries against dictionary • Question: To which extent does the dictionary satisfy users‘ requests? • Method: match preprocessed queries against downcased dictionary entries, language by language 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation I • Dictionary entries: 5,901 • German distinct (downc.): 5,289 • Slovenian distinct: 5,103 • Compare these lists with queries • Result I („tokens“): • German: 40,7% of queries match • Slovenian: 38,3% of queries match 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation I • Result II (types) • types: distinct queries • German: 10,4 % of types match • Slovenian: 12,7 % of types match • Well-known frequency distribution also in query log file: • a few types occur very often and many types occur rarely 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation I: Qualitative results • Online SLO-DE-SLO still lacks some expressions and words used in social relations and everyday life, e.g. • Slovenian „top unmatched“ queries: • regard, offer, confirmation, cow, payment, kiss, oak, to miss, fond of, to teach, sale,... • German „top unmatched“ queries: • kiss, welcome, regard, regards, good morning, treasure, to fuck, good evening,... 11th EURALEX Congress - Query-driven dictionary enhancement
Corpus-based enlargement • Log file entries alone are not enough • The enlargement of the dictionary should stay corpus-based, because the dictionary author wants to • find appropriate examples of use • find also collocations and idioms • find more words that are likely to be of interest to typical users 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II: Outline • Which corpus should be used next to enlarge Online-SLO-DE-SLO? • Which corpus best reflects the structure of the entire vocabulary entered by the users? • Evaluation of Slovenian queries using Slovenian corpora (subcorpora of Nova Beseda c.) 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation IIQueries against corpora • Evaluate corpora of three text types Newspaper 88 million Fiction 5,7 million Technical 6,3 million 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II. Method • Compare lemmas in user queries with relative frequencies in the three corpora. • Lemmatize Slovenian queries and assign POS • Retain lemmatized content words and interjections: 7,246 „query lemmas“ 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II. Method • Lemmatize each corpus (currently with ambiguities) • Calculate relative frequencies (per 1 million) of lemmas in three corpora • Assign „weight“ to lemma: for each query lemma and corpus, multiply number of queries with relative frequency 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II. Example • First seven lines of fiction corpus evaluation (alphabetical order) 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II. Result • All lemma weights are summed up for each of the three corpora separately • Fiction 10,262,558 • Newspaper 9,694,125 • Technical 9,369,494 • The fiction corpus reflects the user queries best 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II.Top twenty weights Lemmas in at least two corpora (transl.) • to be, to have, to give, to go, good, day, house, beautiful, table, to come, light (ADJ), to know, to see, big, year, to work/to do Top 20 weight in fiction: • to think, to say, to look, fond of Top 20 weight in newspaper: • town/place; Slovenian Top 20 weight in technical: • computer, picture, data item 11th EURALEX Congress - Query-driven dictionary enhancement
Evaluation II.Improvements and Variations • Improvement: unambiguously lemmatized corpora (work in progress for Slovenian) • Variation: evaluate only non-matched queries • Not: overall structure of all queries • But: overall structure of unsuccessful queries (might change after enhancements) 11th EURALEX Congress - Query-driven dictionary enhancement
Conclusion • We have shown • query-driven methods of evaluation for online dictionaries • query-driven methods for finding adequate corpora as sources for enhancing dictionaries • Result: example dictionary Online-SLO-DE-SLO should and will be enlarged based on literary texts first 11th EURALEX Congress - Query-driven dictionary enhancement
Thank you for your attention http://www.rrz.uni-hamburg.de/slowenisch 11th EURALEX Congress - Query-driven dictionary enhancement