1 / 27

Free construction of a free dictionary of synonyms using computer science

Free construction of a free dictionary of synonyms using computer science. Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College November 11, 2006. Smith: A Dictionary of Synonymous Words in the English Language [1889]

vangie
Download Presentation

Free construction of a free dictionary of synonyms using computer science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus RosellKTH, Stockholm Talk given by Viggo at Amherst College November 11, 2006

  2. Smith:A Dictionary of Synonymous Words in the English Language[1889] CLASS.Order. Rank. Degree. Classification. Grade. Webster’s Dictionary of Synonyms [1942] classify.Alphabetize, pigeonhole, assort, sort. Ana. Order, arrange, systematize, methodize, marshal. Examples of English synonyms

  3. Goals • To construct a Swedish dictionary of synonyms as a list of synonymous pairs • I don’t want to work a lot • I don’t want to pay anyone to work • The resulting list should be free

  4. Ideas • Automatically construct a large set of word pairs that might be synonyms • Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

  5. More ideas • Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 17 M) of lookups each month • Users visit Lexin to translate words, and are thus probably motivated to help me • Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

  6. My plan • Construct lots of possible synonyms • Sort out bad synonym pairs automatically • Ask lots of users if the rest of the pairs are good synonyms • Analyze the gradings done by the users and decide which pairs to keep

  7. Step 1: Construct lots of possible synonyms • If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish • {(w,v): y: ySE(w)  vES(y)} or{(w,v): y: ySE(w)  ySE(v)} • 616 000 word pairs were generated

  8. Step 2: Sort out bad synonym pairs automatically • Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space • Keep pairs that have small enough distance in the vector space

  9. Random Indexing • Each word w is assigned a random label vector Lw of thousand elements • For each word w construct a context vector Cw by adding the random vectors for the words appearing in the context of each occurrence of w in a large corpus

  10. Random Indexing settings • Context: 4 words to the left and 4 to the rightStop words were removed • Dimensionality: 1800 • 5 corpora from different domains were used, for example newspapers and medical texts

  11. Number of pairs for different cos thresholds (435 000 of 616 000 pairs occurred in corpus)

  12. Step 3: Ask lots of users if the rest of the pairs are good synonyms When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like: Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

  13. After answering the user may • grade new randomly chosen word pair • look up word in the synonym dictionary • suggest new synonymous word pair • download synonym dictionary in XML

  14. Step 4: Analyzing the gradings done by the users • 1.2 millions gradings were made in less than 2 months • Grading statistics were analyzed on several occasions • Some users sent comments

  15. Keeping the users happy! • Many users said that there were too many bad pairs • Lots of pairs were graded 0 (not at all synonyms) by all users. After some weeks 25 000 such pairs were removed. Later 60 000 more pairs were removed, improving the quality of the remaining pairs considerably.

  16. User gradings first two months

  17. More interesting gradings 2006

  18. Distribution of mean gradings of word pairs after two months

  19. Distribution of mean gradings of word pairs 2006

  20. Analysis of the pairs graded 0Distance (cosine) in RI space

  21. Some statistics (November 2006) • 2.5 M user gradings done • 67 000 pairs (graded ≥ 2) in dictionary • 90 000 pairs suggested by users • 50 000 unique pairs suggested • 14 000 of them have been accepted

  22. 5: rang (grade)rank (rank)slag (kind) 4: kategori (category)stånd (social class)årskurs (grade) 3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level)ordning (order) 3: skikt (layer)sort (sort)standard (standard)stil (style) 2: storleksordning (magnitude)typ (type) 1: poäng (point)stadga (stability) 0: uppdrag (mission)utbilda (educate) Example: Synonyms to klass (class)

  23. How to prevent abuse? • Many gradings of a word pair are needed before it’s considered to be good • The pair to be graded is randomly picked from a very large list • Word pairs suggested by users are spell checked before they are added to the very large list

  24. People's definition of synonymy • Exact meaning of 'synonym' wasn’t defined • Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair • The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

  25. The people’s synonym dictionary on the web http://lexin.nada.kth.se/cgi-bin/synlex

  26. Lessons learned • The list of suggested synonyms should be huge • Try to improve the quality of the list automatically as much as possible,Random indexing is useful for this, also try tagging and using other dictionaries • Use the 0 answers early to remove bad pairs that only irritate the users

More Related