1 / 19

Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm

Explore the Voting Experts Algorithm for unsupervised segmentation of audio speech streams, replicating infant language acquisition studies. Experiment with Acoustic Models and VE Models for accurate segmentation. Key applications include speech analysis and artificial language learning.

wiersma
Download Presentation

Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm Matthew Miller, Alexander Stoytchev Developmental Robotics Lab Department of Electrical and Computer Engineering Iowa State University mamille@cs.iastate.edu, alexs@iastate.edu www.cs.iastate.edu/~mamille/

  2. Language: A Grand Challenge • A working example • Automatically acquires language • Well studied

  3. Statistical Learning Experiments Saffran et. al. (1996): 8-month-olds can segment speech. Artificial Language: Acclimate tupiro golabu bedaku padoti Language: tu pi ro go la bu be da ku Transition Prob: 1.0 1.0 .25 1.0 1.0 .25 1.0 1.0 ... NovelWord • Hypothesis: Infants use local minima in single syllable transition probabilities to segment speech streams.

  4. Voting Experts An algorithm for unsupervised segmentation Key Idea: Natural “chunks” have: Low Internal Information High Boundary Entropy itwasabrightcolddayinaprilandtheclockswere

  5. Voting Experts An algorithm for unsupervised segmentation Key Idea: Natural “chunks” have: Low Internal Information High Boundary Entropy itwasabrightcolddayinaprilandtheclockswere

  6. VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Window i t w a s a b r i g h t c o l d d a y i n a p r i l 1

  7. VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Window i t w a s a b r i g h t c o l d d a y i n a p r i l 2

  8. VE Implementation (Cohen 2006) Build an n-gram trie from text. Slide a window along the text sequence Two experts vote how to break the window One minimizes internal info Other maximizes boundary entropy Break at vote peaks 1 1 0 0 6 1 0 2 0 1 0 0 3 0 3 i | t | w | a | s | a | b | r | i | g | h | t | c | o | l | d i t w a s a b r i g h t c o l d d a y i n a p r i l

  9. VE Results • Results are surprisingly good on text • Especially giving its simplicity • Accuracy and Hit rate about 75% • Seems to capture something about the nature of “chunks” • Can we use this algorithm to segment real audio? It was a br igh t

  10. Acoustic Model

  11. Acoustic Model • Cluster spectral features using a GGSOM

  12. Acoustic Model • Cluster spectral features using a GGSOM • Collapse state sequence

  13. Acoustic Model • Cluster spectral features using a GGSOM • Collapse state sequence • Run VE to get breaks

  14. Experiments and Results • Used the model to segment “1984” • CD 1 of audio book (40 mins) • Chosen for length, consistency • Evaluation: Human graders

  15. New Experiments • Trained on infant datasets • Tested on manually generated keys Train Train Stream A: Acoustic Model A VE Model A Key A tupiro golabu bedaku padoti Test Test Test Test Stream B: Acoustic Model B VE Model B Key B dapiku tilado pagotu burobi Train Train

  16. New Experiments • Trained on infant datasets • Tested on manually generated keys Stream A: Acoustic Model A VE Model A Key B tupiro golabu bedaku padoti Test Test Test Test Stream B: Acoustic Model B VE Model B Key A dapiku tilado pagotu burobi

  17. Results • Experiment 1 • Accuracy: 50% on all induced breaks • Hit Rate: 75% of word breaks • Significantly better than chance • Experiment 2 • Accuracy: 16% on all induced breaks • Hit Rate: 1% of word breaks • Worse than chance • 18 breaks, 3 correct

  18. Conclusions and Future Work • VE Model can be used to segment audio • Can reproduce the results of Infant studies • May model part of the human chunking mechanism • Have built more sophisticated acoustic models • Better results (nearly perfect)

  19. Thank You • www.cs.iastate.edu/~mamille/

More Related