1 / 21

Cross-Language Access to Recorded Speech in the MALACH Project

Cross-Language Access to Recorded Speech in the MALACH Project. Douglas Oard , Dina Demner-Fushman, Jan Hajic , Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka. Outline. The MALACH project Searching speech

jaimie
Download Presentation

Cross-Language Access to Recorded Speech in the MALACH Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Language Access to Recorded Speechin the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka

  2. Outline • The MALACH project • Searching speech • A cross-language retrieval experiment • Next steps

  3. The MALACH Project • 52,000 interviews with Holocaust survivors • 116,000 hours (180 TB MPEG-1) • 32 languages, recorded in 67 countries • Present: Manual indexing • 14,000 controlled vocabulary terms • Future: Automatic indexing • Speech recognition • Translation

  4. History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests

  5. Tomorrow (Josef Psutka) Today Research Challenges • Speech Recognition • Spontaneous, accented, elderly, language switching • Computational Linguistics • Segmentation, classification, summarization, extraction • Information Retrieval • Query formulation, search, selection, examination, use

  6. Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection

  7. Key Issues in Speech Retrieval • Recognition accuracy • Content-based retrieval works when WER<40% • Topic segmentation • Average MALACH interview is 2.3 hours! • Multi-scale summarization • Brief summaries: selection from a ranked list • Detailed summaries: minimize audio replay

  8. English Recognition Accuracy • 60% WER for off-the-shelf systems! • 3 systems (broadcast news, dictation, telephone) • MLLR adaptation helps • 33% WER for fluent speech • 46% WER for heavy accents/disfluent speech • Next step: retrain on transcribed interviews • 200 hours from 800 speakers

  9. Cross-Language Search • Query formulation • Spoken words (free text) • Thesaurus descriptors • Segment selection • Speech-to-text translation • multi-scale indicative summaries • Use of retrieved segments • Query reformulation • Incorporation in projects

  10. Documents Query Compute Term Weights Compute Term Weights Translation Lexicon Build Index Compute Document Score Sort Scores Ranked List Ranked Retrieval System Design

  11. Czech Queries Czech/English Translation Lexicon English Documents Ranked List Evaluation Relevance Judgments Measure of Effectiveness Evaluation Framework Ranked Retrieval

  12. Czech/English Test Collection • 113,000 English newspaper stories • Two sets of 33 Czech queries • S: Very short (1-3 words) • L: Sentence-length • Human “ground truth” relevance judgments • Pooled assessment methodology (CLEF-2000)

  13. Translation Lexicon • Machine-readable dictionary • Lemmatized Czech query words • Looked each up in “PC Translator” • Bilingual term list • Downloaded 800 term pairs from Ergane • Retained untranslatable terms • Stripped diacritics to match proper names • Optionally, made minor corrections (by hand) • e.g., “afrika” to “africa”

  14. Example Query • Original Czech query (S) • Architektura v Berlínì • Word-by-word translation into English • architecture architecture • at below beneath by embattled in inside into on per under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within • berlin

  15. Example Search Results • Creating a new architectural vocabulary for a democratic Berlin • UCLA merges architecture and arts into a new school • Best of Berlin for young travelers • Who owns the Nazi paper trail? • A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept … • On the record: Sanderling's dark take on Sibelius • Max Bill, 85; Controversial Swiss artist, sculptor and writer • The week ahead: Berlin; Farewell to allies • Roll over Beethoven; Jeff Berlin leaves the violin and classical … • Californians had right stuff for airlift; Europe: former pilots …

  16. Precision-Recall Graph Average Precision = 0.477 Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments

  17. Mean Average Precision = 0.188 Average Precision 0.477 Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments

  18. Results

  19. Results • Czech seems to pose no unusual problems • 55% of monolingual with simple techniques • Suitable Czech/English resources exist • Czech morphology • Czech/English bilingual lexicon • Multiword expression handling would help • Named entities, non-compositional phrases

  20. Some Next Steps • Integrate Czech/English statistical MT • Johns Hopkins (Summer 2002 Workshop) • Integrate with English and Czech ASR • IBM and Univ of West Bohemia/Charles Univ • Integrate into an interactive retrieval system • University of Maryland and Shoah Foundation

  21. For More Information • Cross-language and speech retrieval • http://www.clis.umd.edu/~dlrg/clir/ • http://www.clis.umd.edu/~dlrg/speech/ • The MALACH project • http://www.clsp.jhu.edu/research/malach/ • NSF/EU Spoken Word Access Working Group • http://www.dcs.shef.ac.uk/spandh/projects/swag/

More Related