210 likes | 287 Views
Cross-Language Access to Recorded Speech in the MALACH Project. Douglas Oard , Dina Demner-Fushman, Jan Hajic , Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka. Outline. The MALACH project Searching speech
E N D
Cross-Language Access to Recorded Speechin the MALACH Project Douglas Oard, Dina Demner-Fushman, Jan Hajic, Bhuvana Ramabhadran, Sam Gustman, Bill Byrne, Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka
Outline • The MALACH project • Searching speech • A cross-language retrieval experiment • Next steps
The MALACH Project • 52,000 interviews with Holocaust survivors • 116,000 hours (180 TB MPEG-1) • 32 languages, recorded in 67 countries • Present: Manual indexing • 14,000 controlled vocabulary terms • Future: Automatic indexing • Speech recognition • Translation
History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests
Tomorrow (Josef Psutka) Today Research Challenges • Speech Recognition • Spontaneous, accented, elderly, language switching • Computational Linguistics • Segmentation, classification, summarization, extraction • Information Retrieval • Query formulation, search, selection, examination, use
Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection
Key Issues in Speech Retrieval • Recognition accuracy • Content-based retrieval works when WER<40% • Topic segmentation • Average MALACH interview is 2.3 hours! • Multi-scale summarization • Brief summaries: selection from a ranked list • Detailed summaries: minimize audio replay
English Recognition Accuracy • 60% WER for off-the-shelf systems! • 3 systems (broadcast news, dictation, telephone) • MLLR adaptation helps • 33% WER for fluent speech • 46% WER for heavy accents/disfluent speech • Next step: retrain on transcribed interviews • 200 hours from 800 speakers
Cross-Language Search • Query formulation • Spoken words (free text) • Thesaurus descriptors • Segment selection • Speech-to-text translation • multi-scale indicative summaries • Use of retrieved segments • Query reformulation • Incorporation in projects
Documents Query Compute Term Weights Compute Term Weights Translation Lexicon Build Index Compute Document Score Sort Scores Ranked List Ranked Retrieval System Design
Czech Queries Czech/English Translation Lexicon English Documents Ranked List Evaluation Relevance Judgments Measure of Effectiveness Evaluation Framework Ranked Retrieval
Czech/English Test Collection • 113,000 English newspaper stories • Two sets of 33 Czech queries • S: Very short (1-3 words) • L: Sentence-length • Human “ground truth” relevance judgments • Pooled assessment methodology (CLEF-2000)
Translation Lexicon • Machine-readable dictionary • Lemmatized Czech query words • Looked each up in “PC Translator” • Bilingual term list • Downloaded 800 term pairs from Ergane • Retained untranslatable terms • Stripped diacritics to match proper names • Optionally, made minor corrections (by hand) • e.g., “afrika” to “africa”
Example Query • Original Czech query (S) • Architektura v Berlínì • Word-by-word translation into English • architecture architecture • at below beneath by embattled in inside into on per under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within • berlin
Example Search Results • Creating a new architectural vocabulary for a democratic Berlin • UCLA merges architecture and arts into a new school • Best of Berlin for young travelers • Who owns the Nazi paper trail? • A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept … • On the record: Sanderling's dark take on Sibelius • Max Bill, 85; Controversial Swiss artist, sculptor and writer • The week ahead: Berlin; Farewell to allies • Roll over Beethoven; Jeff Berlin leaves the violin and classical … • Californians had right stuff for airlift; Europe: former pilots …
Precision-Recall Graph Average Precision = 0.477 Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments
Mean Average Precision = 0.188 Average Precision 0.477 Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments
Results • Czech seems to pose no unusual problems • 55% of monolingual with simple techniques • Suitable Czech/English resources exist • Czech morphology • Czech/English bilingual lexicon • Multiword expression handling would help • Named entities, non-compositional phrases
Some Next Steps • Integrate Czech/English statistical MT • Johns Hopkins (Summer 2002 Workshop) • Integrate with English and Czech ASR • IBM and Univ of West Bohemia/Charles Univ • Integrate into an interactive retrieval system • University of Maryland and Shoah Foundation
For More Information • Cross-language and speech retrieval • http://www.clis.umd.edu/~dlrg/clir/ • http://www.clis.umd.edu/~dlrg/speech/ • The MALACH project • http://www.clsp.jhu.edu/research/malach/ • NSF/EU Spoken Word Access Working Group • http://www.dcs.shef.ac.uk/spandh/projects/swag/