The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili

The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili Yu Hu Irina Matveeva John Goldsmith Colin Sprague SED heuristic: morpheme discovery

A new heuristic for morpheme discovery • Over-all goal: to understand the process of going fromuntagged corpora in an unknown language to a parsing of each word in the corpus into its component morphemes. • That is, morphology-induction. SED heuristic: morpheme discovery

Linguistica • http://linguistica.uchicago.edu SED heuristic: morpheme discovery

General structure of morphology induction • Search method within morphology space • Objective function to evaluate goodness of any given morphology for a specific corpus Search method divided into two parts: • Initial, or bootstrapping, heuristic • Incremental heuristics SED heuristic: morpheme discovery

Zellig Harris (1909-1992) • Proposed successor frequency (SF) as a method for finding morpheme breaks. e i m g o v e r n o s # SED heuristic: morpheme discovery

SF (Z.Harris) works reasonably well for European languages, though it draws too many false positives. • It does not work well for languages with rich morphologies: where the average number of morphemes per word is high. SED heuristic: morpheme discovery

SF: false positives • Most of SF’s false positives can be weeded out by looking for signatures: multiple stems co-occurring with multiple suffixes. • That is: SED heuristic: morpheme discovery

SF peaks as FSA SED heuristic: morpheme discovery

Signature: reduces false positives of SF SED heuristic: morpheme discovery

Generalize the signature… Sequential FSA: each state has a unique successor. SED heuristic: morpheme discovery

Here is how we do it. SED heuristic: morpheme discovery

1. Alignments SED heuristic: morpheme discovery

1.1 Alignments: String edit distance algorithm SED heuristic: morpheme discovery

SED is slow; and there are many pairs of words in a corpus. • So we make an effort to avoid applying SED when it’s futile. • Alphabetize the letters of each word to quickly count the overlap in the bag of letters in each word: • Set a minimum threshold of 3 letters. SED heuristic: morpheme discovery

1.2 Alignments: make cuts SED heuristic: morpheme discovery

1.3 Result: elementary alignment SED heuristic: morpheme discovery

2.1 Collapsing elementary alignments context context SED heuristic: morpheme discovery

2.2 Two or more sequential FSAs with identical contexts are collapsed: SED heuristic: morpheme discovery

3. Further collapsing FSAs SED heuristic: morpheme discovery

4. 1 Evaluating the robustness of these templates (sequential FSAs) • Measure: How many letters do we save by expressing words in a template rather than by writing each one out individually?Answer: 36 -17 = 19. SED heuristic: morpheme discovery

4.2 In practice… • Significant templates save from 200 to 5,000 letters. • Ranking by this measure provides a good measure of how significant they are in the overall morphology of the language. SED heuristic: morpheme discovery

Swahili (Bantu, East Africa) SED heuristic: morpheme discovery

The goal… • Is to learn a FSA that matches what we know the morphology of Swahili to be. • As a first approximation: SED heuristic: morpheme discovery

Swahili verb SED heuristic: morpheme discovery

Swahili verb Subject marker SED heuristic: morpheme discovery

Swahili verb Subject marker Tense marker SED heuristic: morpheme discovery

Swahili verb Subject marker Tense marker Object marker SED heuristic: morpheme discovery

Swahili verb Subject marker Object marker Tense marker Root SED heuristic: morpheme discovery

Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) SED heuristic: morpheme discovery

Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) Finalvowel SED heuristic: morpheme discovery

Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) choyoye Finalvowel SED heuristic: morpheme discovery

4.3 Top templates: 8,200 Swahili words SED heuristic: morpheme discovery

4.4 Precision and recall SED heuristic: morpheme discovery

5.1 Improvements through disambiguation • When all of the final letters of the production of a (non-final) state S are identical, then there is an uncertainty in the analysis: SED heuristic: morpheme discovery

Can we distinguish grammatical from lexical morphemes? In general, yes: based on the number of distinct morphemes generated by a state transition: More than 5 means lexical. Better: use morpheme length and morpheme frequency in addition to size of arc production. • When one set of morphemes is lexical and the other is grammatical, then put the ambiguous material in the grammatical morphemes. Why? This keeps the number of letters in the morphology small(er). SED heuristic: morpheme discovery

When both sets are grammatical: • Now we think of the cost of the labels on the FSA edges in terms of the encoding length of the pointer to the morpheme. • We would rather have edges that point to high-frequency morphemes than low-frequency morphemes. • The overall use of a string in the grammar plays the crucial role here. SED heuristic: morpheme discovery

Actual implementation • We do not have access to any frequencies or probabilities yet (by construction). • We associate with each morpheme m the total robustness of each of the templates in which it appears so far. • If a word can be parsed in two ways, we choose the parse for which the sum of the robustness of the pieces is the greatest. SED heuristic: morpheme discovery

Example 1: Swahili SED heuristic: morpheme discovery

Collapsing templates to generate unseen words Label a transition as grammatical or lexical. We consider collapsing pairs of 4-state FSAs. Our conditions for collapsing: • Two lexical transition must share at least two stems in common. • One pair of grammatical transitions must be identical • Other pair: symmetric difference SED heuristic: morpheme discovery

SED heuristic: morpheme discovery

Adding “incomplete” stems • Try to reparse each word in the corpus according to the current templates. Success will (may) mean the hypothesis of a new stem T (lexical morpheme). If creating stem T predicts the existence of 3+ words that truly exist, then we admit stem T. SED heuristic: morpheme discovery

Results: Disambiguation Training corpus: 7,180 distinct words of Swahili (50,000 running words). SED heuristic: morpheme discovery

Collapsed templates SED heuristic: morpheme discovery

Next steps: Integrate these sub-FSAs into a single FSA; Split some of the single state-transitions into sequences: E.g., SED heuristic: morpheme discovery

The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili