Combining Multiple Models for Speech Information Retrieval

Muath Alzghool and Diana Inkpen Combining Multiple Models for Speech Information Retrieval University of Ottawa Canada LREC 2008

Presentation Outline • Task: Speech Information Retrieval. • Data: Mallach collection (Oard et al, 2004). • System description. • Model fusion. • Experiments using model fusion. • Results of the cross-language experiments. • Results of manual keywords and summaries. • Conclusion and future work. LREC 2008

The Mallach collection • Used in the Cross-Language Speech Retrieval (CLSR) task at Cross-Language Evaluation Forum (CLEF) 2007. • 8104 “documents” (segments) from 272 interviews with Holocaust survivors, totaling 589 hours of speech: ASR transcripts with a word error rate of 25-38%. • Additional metadata: automatically-assigned keywords, manually-assigned keywords, and a manual 3-sentence summary. • A set of 63 training topics and 33 test topics, created in English from actual user requests and translated into Czech, German, French, and Spanish by native speakers. • Relevance judgments were generated standard pooling. LREC 2008

Segments LREC 2008

Example topic (English) LREC 2008

Example topic (French) LREC 2008

System Description • SMART: Vector Space Model (VSM). • Terrier: Divergence from Randomness models (DFR). • Two Query Expansion Methods: • Based on thesaurus (novel technique). • Blind relevance feedback (12 terms from the top 15 documents): based on Bose-Einstein 1 model (Bo1 from Terrier). • Model Fusion: sum of normalized weighted similarity scores (novel way to compute weights). • Combined output of 7 machine translation tools. LREC 2008

Model Fusion • Combine the results of different retrieval strategies from SMART (14 runs) and Terrier (1 run). • Each technique will retrieve different sets of relevant documents; therefore combining the results could produce a better result than any of the individual techniques. LREC 2008

Experiments using Model Fusion • Applied the data fusion methods to 14 runs produced by SMART and one run produced by Terrier. • % change is given with respect to the run providing better performance in each combination on the training data. • Model fusion helps to improve the performance (MAP and Recall score) on the test data. • Monolingual (English): 6.5% improvement (notstatistically significant). • Cross-language experiments (French) : 21.7% improvement (significant). LREC 2008

Experiments using Model Fusion (MAP) LREC 2008

Experiments using Model Fusion (Recall) LREC 2008

Results of the cross-language experiments • The cross-language results for French are very close to Monolingual (English) on training data (the difference is not significant), but not on test data (the difference is significant). • The difference is significant between cross-language results for Spanish and Monolingual (English) on training data but not on test data (the difference is not significant). LREC 2008

Results of manual keywords and summaries Experiments on manual keywords and manual summaries showed high improvements comparing to Auto-English. Our results (for manual and automatic runs) are the highest to date on this data collection in CLEF/CLSR. LREC 2008

Conclusion and future work • Model fusion helps to improve the retrieval significantly for some experiments (Auto-French) and for other not significantly (Auto-English). • The idea of using multiple translations proved to be good (based on previous experiments). • Future work • we plan to investigate more methods of model fusion. • Removing or correcting some of the speech recognition errors in the ASR content words. LREC 2008

Combining Multiple Models for Speech Information Retrieval

Combining Multiple Models for Speech Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Graphical models for combining multiple sources of information in observational studies

10.0 Speech-based Information Retrieval

Information Retrieval – Language models for IR

Combining Speech Attributes for Speech Recognition

Graphical models for combining multiple data sources

Multiple Retrieval Models and Regression Models for Prior Art Search

Retrieving Spoken Information by Combining Multiple Speech Transcription Methods

Advanced Information- Retrieval Models

Information Retrieval Models

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Speech-based Information Retrieval

Information Retrieval Models

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources

Information Retrieval: Models and Methods

Discriminative Models for Information Retrieval