Toward Zero Resources (or how to get something from nothing)

Toward Zero Resources(or how to get something from nothing) • Towards Spoken Term Discovery at Scalewith Zero Resources • Jansen, Church & Hermansky • Interspeech-2010 • NLP on Spoken Documents Without ASR • Dredze, Jansen, Coppersmith & Church • EMNLP-2010

We Don’t Need Speech Recognition To Process Speech At least for some tasks

Linking without Labeling • ASR = Linking + Labeling • Linking: find repetitions • Labeling: assign text strings • BOW (Bag of Words)  BOP (Bag of Pseudo-terms) • Pseudo-Terms: Linking (without Labeling) • BOP: Sufficient for many NLP tasks

Speech Processing Chain Information Retrieval Full Transcripts Corpus Organization This Talk Speech Collection Speech Recognition Text Processing Information Extraction Bag of Words Representation Manual Transcripts Sentiment Analysis Good enough for many tasks

Our Goal Link Audio Segments Link Segments Extract Features Speech Recognition Find long (1s) repetitions Interspeech-2010 Label Segments with Text Labeling Full Transcripts Text Processing Extract Features BOW  BOP EMNLP-2010 0 0 1 0 0 1 1 1

Definitions • Towards: • Not there yet • Zero Resources: • No nothing (no knowledge of language/domain) • The next crisis will be where we are least prepared • No training data, no dictionaries, no models, no linguistics • Low Resources: A little more than zero • Spoken Term Discovery (Linking without Labeling) • Spoken Term Detection (Word Spotting): Standard • Find instances of spoken phrase in spoken document • Input: spoken phrase + spoken document • Spoken Term Discovery: Non-standard task • Input: spoken document (without spoken phrase) • Output: spoken phrases (interesting repeated intervals in document)

What makes an interval of speech interesting? • Cues from text processing: • Long (~ 1 sec such as “The Ed Sullivan Show”) • Repeated • Bursty (tf * IDF) • tf: lots of repetitions within a particular document • IDF: with relatively few repetitions across other documents • Unique to speech processing: • Given-New: • First mention is articulated more carefully than subsequent • Dialog between two parties (A & B): • A: utters an important phrase • B: what? • A: repeats the important phrase

Related Work(Mostly Speech Literature and Mostly from Boston) • Other approaches • Phone recognition (Lincoln Labs) • Use existing phone recognizers to create phone n-grams for topic classification • Hazen et al., 2007, 2008 • Self organizing units (BBN) • Unsupervised discovery of phone like units for topic classification • Garcia and Gish, 2006; Siu et al, 2010 • Find recurring patterns of speech (MIT-CSAIL) • Park and Glass, 2006, 2008 • Similar goals • Audio summarization without ASR • Finds similar regions to include in summary • Zhu, 2009 (ACL)

n2 Time & Space • But the constants are attractive • Sparsity • Resigned algorithms to take advantage of sparsity • Median Filtering • Hough Transform • Line Segment Search

Representations for Learning • Back to NLP… • Group matched segments into Pseudo-Terms • BOW (bag of words)  BOP (bag of pseudo-terms) 0 0 1 0 0 1 1 1 Matched Segments Feature Vectors

Creating Pseudo-Terms P2 P1 P3

Example Pseudo-Terms term_5term_6 term_63term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurancetermlife_insurancehow_much_welong_termbudget_forour_life_insurancebudgetend_of_the_monthstay_within_a_certainyou_knowhave_tocertain_budget

Graph Based Clustering • Nodes: each matched audio segment • Edges: edge between two segment if fractional overlap exceeds threshold • Extract connected components of graph • This work: One pseudo-term for each connected component • Future work: better graph clustering algorithms keep track a paper newspapers keep track of newspaper Pseudo-term 1 Pseudo-term 2

Tradeoff in Cluster Quality • We need to find the right tradeoff for our task • Select tradeoff based on dev data term_5 term_63 term_116 our_life_insurance life_insuranceour_life_insurance Similarity Threshold Smaller Larger Less More Pseudo-Terms

Feature Vectors: BOW  BOP four score seven years ... 1 1 1 2 Four score and seven years is a lot of years. 0 0 1 0 1 term_12 term_5 term_12 term_12 term_5 … 2 1 0 0 1 0 1 Question: are pseudo-terms good enough?

Evaluation: Data • Switchboard telephone speech corpus • 600 conversation sides, 6 topics, 60+ hours of audio • Topics: family life, news media, public education, exercise, pets, taxes • Identify all pairs of matched regions • Graph clustering to produce pseudo-terms • O(n2) on 60+ hours is a lot! • Efficient algorithms and sparsity not as bad as you think • 500 terapixeldotplot from 60+ hours of speech • Compute time: 100 cores, 5 hours

Evaluation • Representations • Manual transcripts as bag of words • Requires full speech recognition • Pseudo-terms • Requires acoustic model

Two Evaluation Tasks • Topic clustering (unsupervised) • Automatically discover latent topics in conversations • Standard clusterer given correct number of topics • Topic classification (supervised) • Learn topic labels from supervised data • Several classification algorithms • CW (Dredze et al, 2008) • MaxEnt • 10 fold CV

Clustering (Unsupervised) Results

Classification (Supervised) Results

Future Directions(More something from nothing) • Extend NLP of speech to new areas • Languages, domains, settings where we have little data for speech recognition • BOW (BOP) sufficient for many NLP tasks • BOW (BOP)  TF*IDF! • Lingering Questions • What else can we do? • Topic models? • Information extraction? • Information retrieval? • …

Toward Zero Resources (or how to get something from nothing)

Toward Zero Resources (or how to get something from nothing)

Presentation Transcript

How to Get to Zero Waste

Zero

TOWARD ZERO DEATHS

Homeopathy: Something from “Nothing”?

Zero Energy Home

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that

IS1101 Nothing to Something

Economic Magic: Creating Something From Nothing

Our Earth’s Resources

Connecting People to Resources

Zero was a big round number.

Internet Services Administration CS35910

Resources - Land

NATURAL RESOURCES

Something for Nothing

How To Turn Your Online Store From Zero To Hero

Something-Anything-Nothing

LEARNING SOMETHING FROM NOTHING Mark 15:1-5