190 likes | 199 Views
Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm. Matthew Miller, Alexander Stoytchev Developmental Robotics Lab Department of Electrical and Computer Engineering Iowa State University mamille@cs.iastate.edu, alexs@iastate.edu www.cs.iastate.edu/~mamille/.
E N D
Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm Matthew Miller, Alexander Stoytchev Developmental Robotics Lab Department of Electrical and Computer Engineering Iowa State University mamille@cs.iastate.edu, alexs@iastate.edu www.cs.iastate.edu/~mamille/
Language: A Grand Challenge • A working example • Automatically acquires language • Well studied
Statistical Learning Experiments Saffran et. al. (1996): 8-month-olds can segment speech. Artificial Language: Acclimate tupiro golabu bedaku padoti Language: tu pi ro go la bu be da ku Transition Prob: 1.0 1.0 .25 1.0 1.0 .25 1.0 1.0 ... NovelWord • Hypothesis: Infants use local minima in single syllable transition probabilities to segment speech streams.
Voting Experts An algorithm for unsupervised segmentation Key Idea: Natural “chunks” have: Low Internal Information High Boundary Entropy itwasabrightcolddayinaprilandtheclockswere
Voting Experts An algorithm for unsupervised segmentation Key Idea: Natural “chunks” have: Low Internal Information High Boundary Entropy itwasabrightcolddayinaprilandtheclockswere
VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Window i t w a s a b r i g h t c o l d d a y i n a p r i l 1
VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Window i t w a s a b r i g h t c o l d d a y i n a p r i l 2
VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Break at vote peaks 1 1 0 0 6 1 0 2 0 1 0 0 3 0 3 i | t | w | a | s | a | b | r | i | g | h | t | c | o | l | d i t w a s a b r i g h t c o l d d a y i n a p r i l
VE Results • Results are surprisingly good on text • Especially giving its simplicity • Accuracy and Hit rate about 75% • Seems to capture something about the nature of “chunks” • Can we use this algorithm to segment real audio? It was a br igh t
Acoustic Model • Cluster spectral features using a GGSOM
Acoustic Model • Cluster spectral features using a GGSOM • Collapse state sequence
Acoustic Model • Cluster spectral features using a GGSOM • Collapse state sequence • Run VE to get breaks
Experiments and Results • Used the model to segment “1984” • CD 1 of audio book (40 mins) • Chosen for length, consistency • Evaluation: Human graders
New Experiments • Trained on infant datasets • Tested on manually generated keys Train Train Stream A: Acoustic Model A VE Model A Key A tupiro golabu bedaku padoti Test Test Test Test Stream B: Acoustic Model B VE Model B Key B dapiku tilado pagotu burobi Train Train
New Experiments • Trained on infant datasets • Tested on manually generated keys Stream A: Acoustic Model A VE Model A Key B tupiro golabu bedaku padoti Test Test Test Test Stream B: Acoustic Model B VE Model B Key A dapiku tilado pagotu burobi
Results • Experiment 1 • Accuracy: 50% on all induced breaks • Hit Rate: 75% of word breaks • Significantly better than chance • Experiment 2 • Accuracy: 16% on all induced breaks • Hit Rate: 1% of word breaks • Worse than chance • 18 breaks, 3 correct
Conclusions and Future Work • VE Model can be used to segment audio • Can reproduce the results of Infant studies • May model part of the human chunking mechanism • Have built more sophisticated acoustic models • Better results (nearly perfect)
Thank You • www.cs.iastate.edu/~mamille/