1 / 29

Query-driven dictionary enhancement

Primo ž Jakopin, Birte Lönneker Scientific Research Center ZRC SAZU Ljubljana, Slovenia. Query-driven dictionary enhancement. Motivation. Log files of online dictionaries provide direct acces to the users‘ requests. Make use of them!. Dictionary author‘s question:

hector
Download Presentation

Query-driven dictionary enhancement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Primož Jakopin,Birte Lönneker ScientificResearchCenter ZRC SAZU Ljubljana, Slovenia Query-driven dictionary enhancement

  2. Motivation • Log files of online dictionaries provide direct acces to the users‘ requests. • Make use of them! Dictionary author‘s question: What are the needs of the users of my dictionary? 11th EURALEX Congress - Query-driven dictionary enhancement

  3. Overview • The dictionary: Online SLO-DE-SLO • The log file • Use of the log file • to evaluate current dictionary contents • to choose the most promising corpus type for enlarging the dictionary • Conclusions 11th EURALEX Congress - Query-driven dictionary enhancement

  4. Dictionary: Online SLO-DE-SLO • Bidirectional online dictionary • German-Slovenian • On the Web since 2001 • Initially a learners‘ dictionary • for German-speaking learners of Slovenian 11th EURALEX Congress - Query-driven dictionary enhancement

  5. Online SLO-DE-SLO user interface 11th EURALEX Congress - Query-driven dictionary enhancement

  6. Evaluated version (October 2003) Textbook corpus: 5,172 entries Newspaper corpus: 729 entries Total: 5,901 entries Current version (June 2004) Textbook corpus: 5,544 entries Newspaper corpus: 743 entries Technical corpus: 829 entries Total: 7,116 entries Online SLO-DE-SLO contents 11th EURALEX Congress - Query-driven dictionary enhancement

  7. Online SLO-DE-SLO entry concept • Each entry is bilingual • Exactly one equivalence per entry • An entry can describe • a basic word form • an inflected word form • an example sentence or phrase • a collocation 11th EURALEX Congress - Query-driven dictionary enhancement

  8. Online SLO-DE-SLO query results 11th EURALEX Congress - Query-driven dictionary enhancement

  9. The log file • When a user submits a query to the dictionary, a program writes data about the query into the log file, e.g. • Source language • Submitted query string • Selected search options (exact string match, match at beginning of word, match anywhere) • Time stamp 11th EURALEX Congress - Query-driven dictionary enhancement

  10. The log file: details • Evaluation period • 6 January 2002 to 10 October 2003 • Number of queries stored in log file • 131,674 • Number queries, exact string match • 88,879 • Only exact string match queries are evaluated 11th EURALEX Congress - Query-driven dictionary enhancement

  11. The log file: preprocessing • Has to take into account how the matching is performed when the dictionary finds an entry for the user • Example 1: • Dictionary matching: Case insensitive (user enters A for a) • Preprocessing: Downcase all letters in log file (and in dictionary evaluation file) 11th EURALEX Congress - Query-driven dictionary enhancement

  12. The log file: preprocessing • Example 2 a: • Dictionary matching: Substitution of special characters for easier access (user enters ae for ä) • Preprocessing version I: • Make a second version of log file • Replace ae by ä in second version • Use spell checker word list to find valid versions • Check ambiguous cases manually 11th EURALEX Congress - Query-driven dictionary enhancement

  13. The log file: preprocessing • Example 2b: • Dictionary matching: Substitution of special characters for easier access (user enters c for č) • Preprocessing version II: • Make a second version of log file • Replace c by č in second version • Use frequencies of parallel spellings to find valid versions • Check ambiguous cases manually 11th EURALEX Congress - Query-driven dictionary enhancement

  14. The log file: preprocessing • Users sometimes determine erroneous source language (SL) • Correct SL could be found using spell checker lists for both languages • In our case: „spell checker lists“ taken from Online SLO-DE-SLO detect • ...wrongly determined SL Slovenian: 378 • ...wrongly determined SL German: 593 11th EURALEX Congress - Query-driven dictionary enhancement

  15. Evaluation IQueries against dictionary • Question: To which extent does the dictionary satisfy users‘ requests? • Method: match preprocessed queries against downcased dictionary entries, language by language 11th EURALEX Congress - Query-driven dictionary enhancement

  16. Evaluation I • Dictionary entries: 5,901 • German distinct (downc.): 5,289 • Slovenian distinct: 5,103 • Compare these lists with queries • Result I („tokens“): • German: 40,7% of queries match • Slovenian: 38,3% of queries match 11th EURALEX Congress - Query-driven dictionary enhancement

  17. Evaluation I • Result II (types) • types: distinct queries • German: 10,4 % of types match • Slovenian: 12,7 % of types match • Well-known frequency distribution also in query log file: • a few types occur very often and many types occur rarely 11th EURALEX Congress - Query-driven dictionary enhancement

  18. Evaluation I: Qualitative results • Online SLO-DE-SLO still lacks some expressions and words used in social relations and everyday life, e.g. • Slovenian „top unmatched“ queries: • regard, offer, confirmation, cow, payment, kiss, oak, to miss, fond of, to teach, sale,... • German „top unmatched“ queries: • kiss, welcome, regard, regards, good morning, treasure, to fuck, good evening,... 11th EURALEX Congress - Query-driven dictionary enhancement

  19. Corpus-based enlargement • Log file entries alone are not enough • The enlargement of the dictionary should stay corpus-based, because the dictionary author wants to • find appropriate examples of use • find also collocations and idioms • find more words that are likely to be of interest to typical users 11th EURALEX Congress - Query-driven dictionary enhancement

  20. Evaluation II: Outline • Which corpus should be used next to enlarge Online-SLO-DE-SLO? • Which corpus best reflects the structure of the entire vocabulary entered by the users? • Evaluation of Slovenian queries using Slovenian corpora (subcorpora of Nova Beseda c.) 11th EURALEX Congress - Query-driven dictionary enhancement

  21. Evaluation IIQueries against corpora • Evaluate corpora of three text types Newspaper 88 million Fiction 5,7 million Technical 6,3 million 11th EURALEX Congress - Query-driven dictionary enhancement

  22. Evaluation II. Method • Compare lemmas in user queries with relative frequencies in the three corpora. • Lemmatize Slovenian queries and assign POS • Retain lemmatized content words and interjections: 7,246 „query lemmas“ 11th EURALEX Congress - Query-driven dictionary enhancement

  23. Evaluation II. Method • Lemmatize each corpus (currently with ambiguities) • Calculate relative frequencies (per 1 million) of lemmas in three corpora • Assign „weight“ to lemma: for each query lemma and corpus, multiply number of queries with relative frequency 11th EURALEX Congress - Query-driven dictionary enhancement

  24. Evaluation II. Example • First seven lines of fiction corpus evaluation (alphabetical order) 11th EURALEX Congress - Query-driven dictionary enhancement

  25. Evaluation II. Result • All lemma weights are summed up for each of the three corpora separately • Fiction 10,262,558 • Newspaper 9,694,125 • Technical 9,369,494 • The fiction corpus reflects the user queries best 11th EURALEX Congress - Query-driven dictionary enhancement

  26. Evaluation II.Top twenty weights Lemmas in at least two corpora (transl.) • to be, to have, to give, to go, good, day, house, beautiful, table, to come, light (ADJ), to know, to see, big, year, to work/to do Top 20 weight in fiction: • to think, to say, to look, fond of Top 20 weight in newspaper: • town/place; Slovenian Top 20 weight in technical: • computer, picture, data item 11th EURALEX Congress - Query-driven dictionary enhancement

  27. Evaluation II.Improvements and Variations • Improvement: unambiguously lemmatized corpora (work in progress for Slovenian) • Variation: evaluate only non-matched queries • Not: overall structure of all queries • But: overall structure of unsuccessful queries (might change after enhancements) 11th EURALEX Congress - Query-driven dictionary enhancement

  28. Conclusion • We have shown • query-driven methods of evaluation for online dictionaries • query-driven methods for finding adequate corpora as sources for enhancing dictionaries • Result: example dictionary Online-SLO-DE-SLO should and will be enlarged based on literary texts first 11th EURALEX Congress - Query-driven dictionary enhancement

  29. Thank you for your attention http://www.rrz.uni-hamburg.de/slowenisch 11th EURALEX Congress - Query-driven dictionary enhancement

More Related