Multilingual Access to Large Spoken Archives

Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA

MALACH Project’s Goal Dramatically improve access to large multilingual spoken word collections … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.

Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls

Some Statistics • 2,000 U.S. radio stations webcasting • 250,000 hours of oral history in British Library • 35 million audio streams indexed by SingingFish • Over 1 million searches per day • ~100 billion hours of phone calls each year

Economics of the Web in 1995 • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo

Spoken Word Collections Today 1.5 million words/$ • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo 30 million 20% of capacity 38% recent use

MALACH Research Issues • Acquisition • Segmentation • Description • Synchronization • Rights management • Preservation

Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing

Key Results from TREC/TDT • Recognition and retrieval can be decomposed • Word recognition/retrieval works well in English • Retrieval is robust with recognition errors • Up to 40% word error rate is tolerable • Retrieval is robust with segmentation errors • Vocabulary shift/pauses provide strong cues

Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection

Broadcast News Retrieval Study • NPR Online • Manually prepared transcripts • Human cataloging • SpeechBot • Automatic Speech Recognition • Automatic indexing

NPR Online

SpeechBot

Study Design • Seminar on visual and sound materials • Recruited 5 students • After training, we provided 2 topics • 3 searched NPR Online, 2 searched SpeechBot • All then tried both systems with a 3rd topic • Each choosing their own topic • Rich data collection • Observation, think aloud, semi-structured interview • Model-guided inductive analysis • Coded to the model with QSR NVivo

Criterion-Attribute Framework

Some Useful Insights • Recognition errors may not bother the system, but they do bother the user! • Segment-level indexing can be useful

Shoah Foundation’s Collection • Enormous scale • 116,000 hours; 52,000 interviews; 180 TB • Grand challenges • 32 languages, accents, elderly, emotional, … • Accessible • $100 million collection and digitization investment • Annotated • 10,000 hours (~200,000 segments) fully described • Users • A department working full time on dissemination

Example Video

Existing Annotations • 72 million untranscribed words • From ~4,000 speakers • Interview-level ground truth • Pre-interview questionnaire (names, locations, …) • Free-text summary • Segment-level ground truth • Topic boundaries: average ~3 min/segment • Labels: Names, topic, locations, year(s) • Descriptions: summary + cataloguer’s scratchpad

Annotated Data Example Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria

Observational studies Formative evaluation Summative evaluation ASR Spontaneous Accented Language switching User Needs NLP Components Evidence integration Translingual search Spatial/temporal Multi-scale segmentation Multilingual classification Entity normalization Prototype MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

ASR Spontaneous Accented Language switching MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

ASR Research Focus • Accuracy • Spontaneous speech • Accented/multilingual/emotional/elderly • Application-specific loss functions • Affordability • Minimal transcription • Replicable process

Application-Tuned ASR • Acoustic model • Transcribe short segments from many speakers • Unsupervised adaptation • Language model • Transcribed segments • Interpolation

ASR Game Plan HoursWord LanguageTranscribedError Rate English 200 39.6% Czech 84 39.4% Russian 20 (of 100) 66.6% Polish Slovak As of May 2003

English Transcription Time ~2,000 hours to manually transcribe 200 hours from 800 speakers Instances (N=830) Hours to transcribe 15 minutes of speech

English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)

Observational studies Formative evaluation Summative evaluation User Needs MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests

Question Types • Content • Person, organization • Place, type of place (e.g., camp, ghetto) • Time, time period • Event, subject • Mode of expression • Language • Displayed artifacts (photographs, objects, …) • Affective reaction (e.g., vivid, moving, …) • Age appropriateness

Four searchers History/Political Science Holocaust studies Holocaust studies Documentary filmmaker Sequential observation Rich data collection Intermediary interaction Semi-structured interviews Observational notes Think-aloud Screen capture Four searchers Ethnography German Studies Sociology High school teacher Simultaneous observation Opportunistic data collection Intermediary interaction Semi-structured interviews Observational notes Focus group discussions Observational Studies Workshop 1 (June) Workshop 2 (August)

Segment Viewer

Observed Selection Criteria • Topicality (57%) • Judged based on: Person, place, … • Accessibility (23%) • Judged based on: Time to load video • Comprehensibility (14%) • Judged based on: Language, speaking style

References to Named Entities

Functionality

NLP Components Multi-scale segmentation Multilingual classification Entity normalization MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

Topic Segmentation “True” segmentation: transcripts aligned with scratchpad-based boundaries cataloguer

Effect of ASR Errors

Rethinking the Problem • Segment-then-label models planned speech well • Producers assemble stories to create programs • Stories typically have a dominant theme • The structure of natural speech is different • Creation: digressions, asides, clarification, … • Use: intended use may affect desired granularity • Documentary film: brief snippet to illustrate a point • Classroom teacher: longer self-contextualizing story

OntoLog: Labeling Unplanned Speech • Manually assigned labels; start and end at any time • Ontology-based aggregation helps manage complexity

Goal Use available data to estimate the temporal extent of labels in a way that optimizes the utility of the resulting estimates for interactive searching and browsing

Labels Multi-Scale Segmentation Time

Characteristics of the Problem • Clear sequential dependencies • Living in Dresden negates living in Berlin • Heuristic basis for class models • Persons, based on type of relationship • Date/Time, based on part-whole relationship • Topics, based on a defined hierarchy • Heuristic basis for guessing without training • Text similarity between labels and spoken words • Heuristic basis for smoothing • Sub-sentence retrieval granularity is unlikely

Manually Assigned Onset Marks Location-Time Subject Person Berlin-1939 Employment Josef Stein Gretchen Stein Family Life Anna Stein interview time Relocation Transportation-rail Dresden-1939 Gunter Wendt Schooling Maria

Some Additional Results • Named entity recognition • F > 0.8 (on manual transcripts) • Cross-language ranked retrieval (on news) • Czech/English similar to other language pairs

Looking Forward: 2003 • Component development • ASR, segmentation, classification, retrieval • Ranked retrieval test collection • 1,000 hours of English recognition • 25 judged topics in English and Czech • Interactive retrieval • Integrating free text and thesaurus-based search

Multilingual Access to Large Spoken Archives

Multilingual Access to Large Spoken Archives

Presentation Transcript

Multilingual Access to Large Spoken Archives

Multilingual Information Access in a Digital Library

Access to archives: aspects of public relations and publicity

Privacy Issues in Archives Access

Semantic Access to Existing Archives

Permanent access and archives

How can CERIF facilitate access to institutional archives? Matthew Mascord

Multilingual Access to Subjects (MACS)

Providing Online Access to the HKUST University Archives: EAD to INNOPAC

Electronic Archives Preservation and Access

Opening the legal literature Portal to multilingual access

Multilingual Information Access Technology Transfer Day

Support for Multilingual Information Access

Collections to Archives

Information Access I Multilingual Text Summarization

Web Accessibility Challenges in Multilingual web access

WP 10 Multilingual Access

Search and Access Technologies for Large Scale Web Archives

Question-Answering of Large News Video Archives

TIGGE Archives and Access

Search and Access Strategies for Web Archives

Providing Online Access to the HKUST University Archives: EAD to INNOPAC