200 likes | 308 Views
Finding Word Groups …. Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00. Background. NordTalk and SweDanes:
E N D
Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00 Växjö: Statistical Methods I
Background • NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus • Comparable Danish and Swedish corpora • 1.3 MToken each, natural spoken interaction • We are mainly working with Spoken language – not written Växjö: Statistical Methods I
Peter Juel Henrichsen’s ideas • Words with similar context distibutions are called Siblings • Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins • Groups of siblings in each corpus together with seed pairs gives new probable cousins. Växjö: Statistical Methods I
Siblings as word groups • Drop the Cousins for now – focus on Siblings • Traditional parts-of-speech are not necessarily valid • What we have is the corpus. Only the corpus • We will take information from the 1+1 words context • Nothing else like morphology or lexica Växjö: Statistical Methods I
The original Sibling formula Växjö: Statistical Methods I
Improvements of the Sibling measure • Symmetry: sib(x1, x2)= sib(x2, x1) • Similarity should be possible even if the context on one of the sides is different Växjö: Statistical Methods I
Trees instead of groups • Iterative use of the ggsib similarity measure • Calculate ggsib between all word pairs above a frequency threshold • Pairs with similarity above a rather high score threshold Sth are collected in a list L • For each pair in L: replace the less frequent of the words with the other, in the corpus Växjö: Statistical Methods I
Trees instead of groups (forts) • If L is empty: decrement Sth slightly • Run from step 1 again if Sth is above a lowest score threshold. • The result may be interpreted as trees Växjö: Statistical Methods I
An example tree Växjö: Statistical Methods I
Implementation • Easy to implement: Peter made a Perl script • But… One step in the iteration with ~5000 word types took 100 hours • Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Växjö: Statistical Methods I
Most important optimizations Starting point: we have enough memory but not enough time • A compiled low level language instead of an interpreted high level • Frequencies for words and word pairs are stored in letter trees instead of hash tables • Try to move computation and counting out in the loop hierarchy Växjö: Statistical Methods I
Optimizations (letter trees) • Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) • But in linear time to the average length of the words, but this is constant when the lexicon grows. • Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Växjö: Statistical Methods I
Optimizations (more) • An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words • The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree Växjö: Statistical Methods I
Personal pronouns Växjö: Statistical Methods I
Colours Växjö: Statistical Methods I
Problems • Sparse data • Homonyms • When to stop • Memory and time complexity Växjö: Statistical Methods I
Conclusions • Our method is an interesting way of finding word groups • It works for all kinds of words (syncategorematic as well as categorematic) • Difficult to handle low frequent words and homonyms Växjö: Statistical Methods I