260 likes | 278 Views
Explore the enormous scale of the MALACH Project, a $100 million investment in digitizing and annotating a collection of 52,000 interviews in 32 languages. Discover how this collection is accessible and its potential uses across disciplines.
E N D
Talking to the Future:The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information Studies CLIS Alumni Chapter
Shoah Foundation’s Collection • Enormous scale • 116,000 hours; 52,000 interviews; 180 TB • Grand challenges • 32 languages, accents, elderly, emotional, … • Accessible • $100 million collection and digitization investment • Annotated • 10,000 hours (~200,000 segments) fully described • Users • A department working full time on dissemination
History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests
Question Types • Content • Person, organization • Place, type of place (e.g., camp, ghetto) • Time, time period • Event, subject • Mode of expression • Language • Displayed artifacts (photographs, objects, …) • Affective reaction (e.g., vivid, moving, …) • Age appropriateness
Full-Description Cataloguing Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria
“Real-Time” Cataloguing Location-Time Subject Person Berlin-1939 Employment Josef Stein Gretchen Stein Family Life Anna Stein interview time Relocation Transportation-rail Dresden-1939 Gunter Wendt Schooling Maria
The Goal Dramatically improve access to large multilingual spoken word Collections … … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.
Four searchers History/Political Science Holocaust studies Holocaust studies Documentary filmmaker Sequential observation Rich data collection Intermediary interaction Semi-structured interviews Observational notes Think-aloud Screen capture Four searchers Ethnography German Studies Sociology High school teacher Simultaneous observation Opportunistic data collection Intermediary interaction Semi-structured interviews Observational notes Focus group discussions Observational Studies Workshop 1 (June) Workshop 2 (August)
Observed Selection Criteria • Topicality (57%) • Judged based on: Person, place, … • Accessibility (23%) • Judged based on: Time to load video • Comprehensibility (14%) • Judged based on: Language, speaking style
Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection
Observational studies Formative evaluation Summative evaluation ASR Spontaneous Accented Language switching User Needs NLP Components Evidence integration Multilingual search Spatial/temporal Multi-scale segmentation Multilingual classification Entity normalization Prototype Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection
Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing
English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)
Building a Test Collection • Overall relevanceAssessment is informed by the assessments for the individual reasons for relevance (categories of relevance), but the relationship is not straightforward • Provides direct evidence • Provides indirect / circumstantial evidence • Provides context(e.g., causes for the phenomenon of interest) • Provides comparison (similarity or contrast, same phenomenon in different environment, similar phenomenon) • Provides pointer to source of information
Some Statistics • 2,000 U.S. radio stations Webcasting • 250,000 hours of oral history in British Library • 35,000,000 audio streams on the Web
Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls
Building a Web of Spoken Words • Affordable storage • For $1, you can store 1.5 million spoken words • Adequate network capacity • Internet capacity: 30 million simultaneous programs • Works with any modem • You can even read email while playing audio • Replay capabilities • 38% of US users recently used streaming audio • Effective search capabilities • Not quite yet …
Looking Forward: 2006 • Working systems in five languages • Real users searching real data • Rich experience beyond broadcast news • Frameworks, components, systems • Affordable application-tuned systems • Oral history, lectures, speeches, meetings, …
For More Information • The MALACH project • http://www.clsp.jhu.edu/research/malach/ • NSF/EU Spoken Word Access Group • http://www.dcs.shef.ac.uk/spandh/projects/swag/ • Speech-based retrieval • http://www.glue.umd.edu/~dlrg/speech/