400 likes | 498 Views
Toward Zero Resources (or how to get something from nothing). Towards Spoken Term Discovery at Scale with Zero Resources Jansen, Church & Hermansky Interspeech-2010 NLP on Spoken Documents Without ASR Dredze, Jansen, Coppersmith & Church EMNLP-2010.
E N D
Toward Zero Resources(or how to get something from nothing) • Towards Spoken Term Discovery at Scalewith Zero Resources • Jansen, Church & Hermansky • Interspeech-2010 • NLP on Spoken Documents Without ASR • Dredze, Jansen, Coppersmith & Church • EMNLP-2010
We Don’t Need Speech Recognition To Process Speech At least for some tasks
Linking without Labeling • ASR = Linking + Labeling • Linking: find repetitions • Labeling: assign text strings • BOW (Bag of Words) BOP (Bag of Pseudo-terms) • Pseudo-Terms: Linking (without Labeling) • BOP: Sufficient for many NLP tasks
Speech Processing Chain Information Retrieval Full Transcripts Corpus Organization This Talk Speech Collection Speech Recognition Text Processing Information Extraction Bag of Words Representation Manual Transcripts Sentiment Analysis Good enough for many tasks
Our Goal Link Audio Segments Link Segments Extract Features Speech Recognition Find long (1s) repetitions Interspeech-2010 Label Segments with Text Labeling Full Transcripts Text Processing Extract Features BOW BOP EMNLP-2010 0 0 1 0 0 1 1 1
Definitions • Towards: • Not there yet • Zero Resources: • No nothing (no knowledge of language/domain) • The next crisis will be where we are least prepared • No training data, no dictionaries, no models, no linguistics • Low Resources: A little more than zero • Spoken Term Discovery (Linking without Labeling) • Spoken Term Detection (Word Spotting): Standard • Find instances of spoken phrase in spoken document • Input: spoken phrase + spoken document • Spoken Term Discovery: Non-standard task • Input: spoken document (without spoken phrase) • Output: spoken phrases (interesting repeated intervals in document)
What makes an interval of speech interesting? • Cues from text processing: • Long (~ 1 sec such as “The Ed Sullivan Show”) • Repeated • Bursty (tf * IDF) • tf: lots of repetitions within a particular document • IDF: with relatively few repetitions across other documents • Unique to speech processing: • Given-New: • First mention is articulated more carefully than subsequent • Dialog between two parties (A & B): • A: utters an important phrase • B: what? • A: repeats the important phrase
Related Work(Mostly Speech Literature and Mostly from Boston) • Other approaches • Phone recognition (Lincoln Labs) • Use existing phone recognizers to create phone n-grams for topic classification • Hazen et al., 2007, 2008 • Self organizing units (BBN) • Unsupervised discovery of phone like units for topic classification • Garcia and Gish, 2006; Siu et al, 2010 • Find recurring patterns of speech (MIT-CSAIL) • Park and Glass, 2006, 2008 • Similar goals • Audio summarization without ASR • Finds similar regions to include in summary • Zhu, 2009 (ACL)
n2 Time & Space • But the constants are attractive • Sparsity • Resigned algorithms to take advantage of sparsity • Median Filtering • Hough Transform • Line Segment Search
Representations for Learning • Back to NLP… • Group matched segments into Pseudo-Terms • BOW (bag of words) BOP (bag of pseudo-terms) 0 0 1 0 0 1 1 1 Matched Segments Feature Vectors
Creating Pseudo-Terms P2 P1 P3
Example Pseudo-Terms term_5term_6 term_63term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurancetermlife_insurancehow_much_welong_termbudget_forour_life_insurancebudgetend_of_the_monthstay_within_a_certainyou_knowhave_tocertain_budget
Graph Based Clustering • Nodes: each matched audio segment • Edges: edge between two segment if fractional overlap exceeds threshold • Extract connected components of graph • This work: One pseudo-term for each connected component • Future work: better graph clustering algorithms keep track a paper newspapers keep track of newspaper Pseudo-term 1 Pseudo-term 2
Tradeoff in Cluster Quality • We need to find the right tradeoff for our task • Select tradeoff based on dev data term_5 term_63 term_116 our_life_insurance life_insuranceour_life_insurance Similarity Threshold Smaller Larger Less More Pseudo-Terms
Feature Vectors: BOW BOP four score seven years ... 1 1 1 2 Four score and seven years is a lot of years. 0 0 1 0 1 term_12 term_5 term_12 term_12 term_5 … 2 1 0 0 1 0 1 Question: are pseudo-terms good enough?
Evaluation: Data • Switchboard telephone speech corpus • 600 conversation sides, 6 topics, 60+ hours of audio • Topics: family life, news media, public education, exercise, pets, taxes • Identify all pairs of matched regions • Graph clustering to produce pseudo-terms • O(n2) on 60+ hours is a lot! • Efficient algorithms and sparsity not as bad as you think • 500 terapixeldotplot from 60+ hours of speech • Compute time: 100 cores, 5 hours
Evaluation • Representations • Manual transcripts as bag of words • Requires full speech recognition • Pseudo-terms • Requires acoustic model
Two Evaluation Tasks • Topic clustering (unsupervised) • Automatically discover latent topics in conversations • Standard clusterer given correct number of topics • Topic classification (supervised) • Learn topic labels from supervised data • Several classification algorithms • CW (Dredze et al, 2008) • MaxEnt • 10 fold CV
Future Directions(More something from nothing) • Extend NLP of speech to new areas • Languages, domains, settings where we have little data for speech recognition • BOW (BOP) sufficient for many NLP tasks • BOW (BOP) TF*IDF! • Lingering Questions • What else can we do? • Topic models? • Information extraction? • Information retrieval? • …