Finding Word Groups …

Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00 Växjö: Statistical Methods I

Background • NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus • Comparable Danish and Swedish corpora • 1.3 MToken each, natural spoken interaction • We are mainly working with Spoken language – not written Växjö: Statistical Methods I

Peter Juel Henrichsen’s ideas • Words with similar context distibutions are called Siblings • Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins • Groups of siblings in each corpus together with seed pairs gives new probable cousins. Växjö: Statistical Methods I

Siblings as word groups • Drop the Cousins for now – focus on Siblings • Traditional parts-of-speech are not necessarily valid • What we have is the corpus. Only the corpus • We will take information from the 1+1 words context • Nothing else like morphology or lexica Växjö: Statistical Methods I

The original Sibling formula Växjö: Statistical Methods I

Improvements of the Sibling measure • Symmetry: sib(x1, x2)= sib(x2, x1) • Similarity should be possible even if the context on one of the sides is different Växjö: Statistical Methods I

Trees instead of groups • Iterative use of the ggsib similarity measure • Calculate ggsib between all word pairs above a frequency threshold • Pairs with similarity above a rather high score threshold Sth are collected in a list L • For each pair in L: replace the less frequent of the words with the other, in the corpus Växjö: Statistical Methods I

Trees instead of groups (forts) • If L is empty: decrement Sth slightly • Run from step 1 again if Sth is above a lowest score threshold. • The result may be interpreted as trees Växjö: Statistical Methods I

An example tree Växjö: Statistical Methods I

Implementation • Easy to implement: Peter made a Perl script • But… One step in the iteration with ~5000 word types took 100 hours • Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Växjö: Statistical Methods I

Most important optimizations Starting point: we have enough memory but not enough time • A compiled low level language instead of an interpreted high level • Frequencies for words and word pairs are stored in letter trees instead of hash tables • Try to move computation and counting out in the loop hierarchy Växjö: Statistical Methods I

Optimizations (letter trees) • Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) • But in linear time to the average length of the words, but this is constant when the lexicon grows. • Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Växjö: Statistical Methods I

Optimizations (more) • An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words • The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree Växjö: Statistical Methods I

Personal pronouns Växjö: Statistical Methods I

Växjö: Statistical Methods I

Colours Växjö: Statistical Methods I

Problems • Sparse data • Homonyms • When to stop • Memory and time complexity Växjö: Statistical Methods I

Conclusions • Our method is an interesting way of finding word groups • It works for all kinds of words (syncategorematic as well as categorematic) • Difficult to handle low frequent words and homonyms Växjö: Statistical Methods I

Växjö: Statistical Methods I

Finding Word Groups …

Finding Word Groups …

Presentation Transcript

10 Focus Groups: Learning Objectives

Taboo Game

WORD FAMILIES

Effective Groups and Teams

On Finding Repeats in Strings

The New NCCI Hazard Groups

Root finding Methods

unsupervised learning - clustering

Fact-finding Techniques

Word-Composition

Fact Finding Techniques and Structured Methodology

Chap. 3 Microsoft Word 2003

Word Meaning and Similarity

Sight Word Practice

WEEK 1 Week of September 5

ABO Blood Groups

Motif Finding

Red Hat Enterprise Linux

Word Study Skills

What makes a team? Click to listen to story.