1 / 20

Finding Word Groups …

Finding Word Groups …. Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00. Background. NordTalk and SweDanes:

tynice
Download Presentation

Finding Word Groups …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00 Växjö: Statistical Methods I

  2. Background • NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus • Comparable Danish and Swedish corpora • 1.3 MToken each, natural spoken interaction • We are mainly working with Spoken language – not written Växjö: Statistical Methods I

  3. Peter Juel Henrichsen’s ideas • Words with similar context distibutions are called Siblings • Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins • Groups of siblings in each corpus together with seed pairs gives new probable cousins. Växjö: Statistical Methods I

  4. Siblings as word groups • Drop the Cousins for now – focus on Siblings • Traditional parts-of-speech are not necessarily valid • What we have is the corpus. Only the corpus • We will take information from the 1+1 words context • Nothing else like morphology or lexica Växjö: Statistical Methods I

  5. The original Sibling formula Växjö: Statistical Methods I

  6. Improvements of the Sibling measure • Symmetry: sib(x1, x2)= sib(x2, x1) • Similarity should be possible even if the context on one of the sides is different Växjö: Statistical Methods I

  7. Trees instead of groups • Iterative use of the ggsib similarity measure • Calculate ggsib between all word pairs above a frequency threshold • Pairs with similarity above a rather high score threshold Sth are collected in a list L • For each pair in L: replace the less frequent of the words with the other, in the corpus Växjö: Statistical Methods I

  8. Trees instead of groups (forts) • If L is empty: decrement Sth slightly • Run from step 1 again if Sth is above a lowest score threshold. • The result may be interpreted as trees Växjö: Statistical Methods I

  9. An example tree Växjö: Statistical Methods I

  10. Implementation • Easy to implement: Peter made a Perl script • But… One step in the iteration with ~5000 word types took 100 hours • Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Växjö: Statistical Methods I

  11. Most important optimizations Starting point: we have enough memory but not enough time • A compiled low level language instead of an interpreted high level • Frequencies for words and word pairs are stored in letter trees instead of hash tables • Try to move computation and counting out in the loop hierarchy Växjö: Statistical Methods I

  12. Optimizations (letter trees) • Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) • But in linear time to the average length of the words, but this is constant when the lexicon grows. • Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Växjö: Statistical Methods I

  13. Optimizations (more) • An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words • The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree Växjö: Statistical Methods I

  14. Personal pronouns Växjö: Statistical Methods I

  15. Växjö: Statistical Methods I

  16. Colours Växjö: Statistical Methods I

  17. Problems • Sparse data • Homonyms • When to stop • Memory and time complexity Växjö: Statistical Methods I

  18. Conclusions • Our method is an interesting way of finding word groups • It works for all kinds of words (syncategorematic as well as categorematic) • Difficult to handle low frequent words and homonyms Växjö: Statistical Methods I

  19. Växjö: Statistical Methods I

  20. Växjö: Statistical Methods I

More Related