Words in puddles of sound

Words in puddles of sound Padraic Monaghan University of York Morten Christiansen Cornell University

Words in a “sea of sound” (Saffran, 2001) • Discovering words • from continuous speech • with no reliable cues to word boundaries (Jones, 1918; Liberman et al., 1967) • where words are realised variably (Pollack & Pickett, 1964)

Segmentation and sublexical cues • Final syllables of words are longer (Klatt, 1975) • hamster v. ham (Saffran, Newport, & Aslin (1996; Salverda & McQueen, 2004): • First syllables of words are stressed • ~60% of the time in English (Crystal & House, 1990; Pierrehumbert, 1981) • Johnson & Jusczyk (2001); Thiessen & Saffran (2003) • Certain diphones are more likely to occur across words than within words (Mattys et al., 2005)

Multiple cues in speech segmentation • Hierarchical model • (Mattys, White, & • Melhorn, 2005)

Puddles • whosalovelybabyyesyouareyourealovelybabyarentyouyesyouare • In 5.5M words of child-directed speech:

Lexical approach to segmentation • Once you’ve got the words, segmentation is easy (Norris, 1994; 2007) • Assume each utterance is a word • until you know differently • if it’s repeated, you keep it • if it doesn’t occur again, you lose it

Aims of Modelling • Utterances can’t be used as don’t know when it’s a single word, when it’s multiple (Brent & Cartwright, 1996) • utterance boundaries are sufficient to get started • single-word utterances are useful anchors for segmentation • It is possible to distinguish (most) single-word from (most) multiple word utterances • Proper nouns have a special role • Frequent multiple-word sequences will be “lexicalised” (Tomasello, 2001)

Lexical approach to segmentation • Familiar words used for segmentation by “Maggie” (Bortfeld et al., 2005): • “maggie’s bike had big, black wheels” • “hannah’s cup was bright and shiny” • infants familiarised to “bike” more quickly than “cup” • Proper nouns often occur as single utterances: • 3.3% of utterances in “naomi” corpus in CHILDES • Very high frequent words are useful for categorising content words (Monaghan, Chater, & Christiansen, 2005; Redington, Chater, & Finch, 1998)

Corpora • 6 corpora from CHILDES: • child-directed speech to children aged < 2:6 • Orthographic transcription run through festival speech synthesiser (Black et al., 1990)

The model kitty thatsrightkitty kitty sayitagain lookkitty LEXICON

The model kitty thatsrightkitty kitty sayitagain lookkitty LEXICON kitty 1.0

The model kitty thatsrightkitty kitty sayitagain lookkitty LEXICON kitty 0.99

The model kitty thatsrightkitty kitty sayitagain lookkitty LEXICON kitty 1.99 thatsright 1.00

The model kitty thatsrightkitty kitty sayitagain lookkitty LEXICON kitty 3.96 thatsright 0.97 sayitagain 0.99 look 1.00

More constraints in the model: Phonological glue oh okay noway nevertheless LEXICON oh kay n way evertheless Candidate words with recognised beginnings and endings admitted Candidate words which divide a recognised word-internal diphone rejected

More constraints in the model: Phonological glue oh okay no nevertheless GLUE Beg: oh End: oh Glue: oh LEXICON oh

More constraints in the model: Phonological glue oh okay no nevertheless GLUE Beg: oh End: oh Glue: oh LEXICON oh x ka? oh? ok?

More constraints in the model: Phonological glue oh okay no nevertheless GLUE Beg: oh, ka End: oh, ay Glue: oh, ok, ka, ay LEXICON oh okay

More constraints in the model: Phonological glue oh okay no nevertheless GLUE Beg: oh, ka End: oh, ay Glue: oh, ok, ka, ay LEXICON oh okay no nevertheless

Testing the model • Decisions: • Internal diphone glue constraint • Legal beginnings/endings constraint • Decay-rate • Ordering of lexicon… • Accuracy: Proportion of words segmented that are words • Completeness: Proportion of words that are segmented • Baseline segmentation: correct number of words in utterance, randomly positioned boundary (Brent & Cartwright, 1996) Included Included 0 By length

Results: Accuracy t(5) = 19.637, p < .0001

Results: Completeness t(5) = 28.969, p < .0001

Results: Naomi’s Lexicon • Top 10 after 1K utterances: • Nomi • Say • No • Yes • The • Okay • Whatsthis • Blanket • Is • What

Results: Naomi’s Lexicon • Top 10 after 8K utterances: • You • Nomi • The • It • To • What • I • That’s • No • Your

Results: Naomi’s Lexicon 0.05 decay

Results: Naomi’s Lexicon 0.01 decay

Summary • Model based on puddles of sound • accurate, complete • reliance on Proper noun • frequent words “pop” out • same words useful for grammatical categorisation • No mechanism for alternative, competing parses of speech • first, cognitively plausible step for how lexicon may be generated • Relative role of phonological glue, legal boundaries, sorting by length/frequency

Words in puddles of sound

Words in puddles of sound

Presentation Transcript

The Dot Words for the ea sound

Sound of y at the End of Longer Words

Sound of y at the End of Longer Words

100 Words to Make You Sound Smart

100 Words to Make You Sound Smart

100 Words to Make You Sound Smart

100 Words to Make You Sound Smart

100 Words to Make You Sound Smart

100 Words To Make You Sound Great

Puddles

Family words use the – er sound.

Can you think of any words with the ‘w’ sound in?

Can you think of any words with the ‘j’ sound in?

South Carolina m embers of ‘Sound Words Ezra 710’

Spelling List 4 Long “ o ” sound in words

Can you think of any words with the ‘ch’ sound in?

Can you think of any words with the ‘ga’ sound in?

Can you think of any words with the ‘z’ sound in?

WORDS THAT SOUND ALIKE

Can you think of any words with the ‘ö’ sound in?

Can you think of any words with the ‘on’ sound in?

1 main idea about sound waves. 2 words to describe sound. 3 everyday uses of sound.