1 / 26

Talking to the Future: The MALACH Project

Explore the enormous scale of the MALACH Project, a $100 million investment in digitizing and annotating a collection of 52,000 interviews in 32 languages. Discover how this collection is accessible and its potential uses across disciplines.

Download Presentation

Talking to the Future: The MALACH Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Talking to the Future:The MALACH Project Douglas W. Oard Joanne Archer, Ammie Feijoo, Xiaoli Huang College of Information Studies CLIS Alumni Chapter

  2. Telling Our Stories

  3. Shoah Foundation’s Collection • Enormous scale • 116,000 hours; 52,000 interviews; 180 TB • Grand challenges • 32 languages, accents, elderly, emotional, … • Accessible • $100 million collection and digitization investment • Annotated • 10,000 hours (~200,000 segments) fully described • Users • A department working full time on dissemination

  4. History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests

  5. Question Types • Content • Person, organization • Place, type of place (e.g., camp, ghetto) • Time, time period • Event, subject • Mode of expression • Language • Displayed artifacts (photographs, objects, …) • Affective reaction (e.g., vivid, moving, …) • Age appropriateness

  6. Full-Description Cataloguing Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria

  7. “Real-Time” Cataloguing Location-Time Subject Person Berlin-1939 Employment Josef Stein Gretchen Stein Family Life Anna Stein interview time Relocation Transportation-rail Dresden-1939 Gunter Wendt Schooling Maria

  8. Thesaurus-Based Search

  9. The Goal Dramatically improve access to large multilingual spoken word Collections … … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.

  10. Joanne Archer

  11. Four searchers History/Political Science Holocaust studies Holocaust studies Documentary filmmaker Sequential observation Rich data collection Intermediary interaction Semi-structured interviews Observational notes Think-aloud Screen capture Four searchers Ethnography German Studies Sociology High school teacher Simultaneous observation Opportunistic data collection Intermediary interaction Semi-structured interviews Observational notes Focus group discussions Observational Studies Workshop 1 (June) Workshop 2 (August)

  12. Observed Selection Criteria • Topicality (57%) • Judged based on: Person, place, … • Accessibility (23%) • Judged based on: Time to load video • Comprehensibility (14%) • Judged based on: Language, speaking style

  13. Functionality

  14. Xiaoli Huang

  15. Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection

  16. Observational studies Formative evaluation Summative evaluation ASR Spontaneous Accented Language switching User Needs NLP Components Evidence integration Multilingual search Spatial/temporal Multi-scale segmentation Multilingual classification Entity normalization Prototype Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

  17. Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing

  18. English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)

  19. Effect of ASR Errors

  20. Building a Test Collection • Overall relevanceAssessment is informed by the assessments for the individual reasons for relevance (categories of relevance), but the relationship is not straightforward • Provides direct evidence • Provides indirect / circumstantial evidence • Provides context(e.g., causes for the phenomenon of interest) • Provides comparison (similarity or contrast, same phenomenon in different environment, similar phenomenon) • Provides pointer to source of information

  21. Ammie Feijoo

  22. Some Statistics • 2,000 U.S. radio stations Webcasting • 250,000 hours of oral history in British Library • 35,000,000 audio streams on the Web

  23. Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls

  24. Building a Web of Spoken Words • Affordable storage • For $1, you can store 1.5 million spoken words • Adequate network capacity • Internet capacity: 30 million simultaneous programs • Works with any modem • You can even read email while playing audio • Replay capabilities • 38% of US users recently used streaming audio • Effective search capabilities • Not quite yet …

  25. Looking Forward: 2006 • Working systems in five languages • Real users searching real data • Rich experience beyond broadcast news • Frameworks, components, systems • Affordable application-tuned systems • Oral history, lectures, speeches, meetings, …

  26. For More Information • The MALACH project • http://www.clsp.jhu.edu/research/malach/ • NSF/EU Spoken Word Access Group • http://www.dcs.shef.ac.uk/spandh/projects/swag/ • Speech-based retrieval • http://www.glue.umd.edu/~dlrg/speech/

More Related