610 likes | 619 Views
The goal of the MALACH project is to improve access to large multilingual spoken word collections by leveraging the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews. This project focuses on acquisition, segmentation, description, synchronization, rights management, and preservation of spoken word collections.
E N D
Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA
MALACH Project’s Goal Dramatically improve access to large multilingual spoken word collections … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.
Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls
Some Statistics • 2,000 U.S. radio stations webcasting • 250,000 hours of oral history in British Library • 35 million audio streams indexed by SingingFish • Over 1 million searches per day • ~100 billion hours of phone calls each year
Economics of the Web in 1995 • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo
Spoken Word Collections Today 1.5 million words/$ • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo 30 million 20% of capacity 38% recent use
MALACH Research Issues • Acquisition • Segmentation • Description • Synchronization • Rights management • Preservation
Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing
Key Results from TREC/TDT • Recognition and retrieval can be decomposed • Word recognition/retrieval works well in English • Retrieval is robust with recognition errors • Up to 40% word error rate is tolerable • Retrieval is robust with segmentation errors • Vocabulary shift/pauses provide strong cues
Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection
Broadcast News Retrieval Study • NPR Online • Manually prepared transcripts • Human cataloging • SpeechBot • Automatic Speech Recognition • Automatic indexing
Study Design • Seminar on visual and sound materials • Recruited 5 students • After training, we provided 2 topics • 3 searched NPR Online, 2 searched SpeechBot • All then tried both systems with a 3rd topic • Each choosing their own topic • Rich data collection • Observation, think aloud, semi-structured interview • Model-guided inductive analysis • Coded to the model with QSR NVivo
Some Useful Insights • Recognition errors may not bother the system, but they do bother the user! • Segment-level indexing can be useful
Shoah Foundation’s Collection • Enormous scale • 116,000 hours; 52,000 interviews; 180 TB • Grand challenges • 32 languages, accents, elderly, emotional, … • Accessible • $100 million collection and digitization investment • Annotated • 10,000 hours (~200,000 segments) fully described • Users • A department working full time on dissemination
Existing Annotations • 72 million untranscribed words • From ~4,000 speakers • Interview-level ground truth • Pre-interview questionnaire (names, locations, …) • Free-text summary • Segment-level ground truth • Topic boundaries: average ~3 min/segment • Labels: Names, topic, locations, year(s) • Descriptions: summary + cataloguer’s scratchpad
Annotated Data Example Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria
Observational studies Formative evaluation Summative evaluation ASR Spontaneous Accented Language switching User Needs NLP Components Evidence integration Translingual search Spatial/temporal Multi-scale segmentation Multilingual classification Entity normalization Prototype MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection
ASR Spontaneous Accented Language switching MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection
ASR Research Focus • Accuracy • Spontaneous speech • Accented/multilingual/emotional/elderly • Application-specific loss functions • Affordability • Minimal transcription • Replicable process
Application-Tuned ASR • Acoustic model • Transcribe short segments from many speakers • Unsupervised adaptation • Language model • Transcribed segments • Interpolation
ASR Game Plan HoursWord LanguageTranscribedError Rate English 200 39.6% Czech 84 39.4% Russian 20 (of 100) 66.6% Polish Slovak As of May 2003
English Transcription Time ~2,000 hours to manually transcribe 200 hours from 800 speakers Instances (N=830) Hours to transcribe 15 minutes of speech
English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)
Observational studies Formative evaluation Summative evaluation User Needs MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection
History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests
Question Types • Content • Person, organization • Place, type of place (e.g., camp, ghetto) • Time, time period • Event, subject • Mode of expression • Language • Displayed artifacts (photographs, objects, …) • Affective reaction (e.g., vivid, moving, …) • Age appropriateness
Four searchers History/Political Science Holocaust studies Holocaust studies Documentary filmmaker Sequential observation Rich data collection Intermediary interaction Semi-structured interviews Observational notes Think-aloud Screen capture Four searchers Ethnography German Studies Sociology High school teacher Simultaneous observation Opportunistic data collection Intermediary interaction Semi-structured interviews Observational notes Focus group discussions Observational Studies Workshop 1 (June) Workshop 2 (August)
Observed Selection Criteria • Topicality (57%) • Judged based on: Person, place, … • Accessibility (23%) • Judged based on: Time to load video • Comprehensibility (14%) • Judged based on: Language, speaking style
NLP Components Multi-scale segmentation Multilingual classification Entity normalization MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection
Topic Segmentation “True” segmentation: transcripts aligned with scratchpad-based boundaries cataloguer
Rethinking the Problem • Segment-then-label models planned speech well • Producers assemble stories to create programs • Stories typically have a dominant theme • The structure of natural speech is different • Creation: digressions, asides, clarification, … • Use: intended use may affect desired granularity • Documentary film: brief snippet to illustrate a point • Classroom teacher: longer self-contextualizing story
OntoLog: Labeling Unplanned Speech • Manually assigned labels; start and end at any time • Ontology-based aggregation helps manage complexity
Goal Use available data to estimate the temporal extent of labels in a way that optimizes the utility of the resulting estimates for interactive searching and browsing
Labels Multi-Scale Segmentation Time
Characteristics of the Problem • Clear sequential dependencies • Living in Dresden negates living in Berlin • Heuristic basis for class models • Persons, based on type of relationship • Date/Time, based on part-whole relationship • Topics, based on a defined hierarchy • Heuristic basis for guessing without training • Text similarity between labels and spoken words • Heuristic basis for smoothing • Sub-sentence retrieval granularity is unlikely
Manually Assigned Onset Marks Location-Time Subject Person Berlin-1939 Employment Josef Stein Gretchen Stein Family Life Anna Stein interview time Relocation Transportation-rail Dresden-1939 Gunter Wendt Schooling Maria
Some Additional Results • Named entity recognition • F > 0.8 (on manual transcripts) • Cross-language ranked retrieval (on news) • Czech/English similar to other language pairs
Looking Forward: 2003 • Component development • ASR, segmentation, classification, retrieval • Ranked retrieval test collection • 1,000 hours of English recognition • 25 judged topics in English and Czech • Interactive retrieval • Integrating free text and thesaurus-based search