13.0 Voice-based Information Retrieval

13.0 Voice-based Information Retrieval References: 1. “ Speech and Language Technologies for Audio Indexing and Retrieval ”, Proceedings of the IEEE, Aug 2000 2. “ Discriminating Capabilities of Syllable-based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese”, IEEE Transactions on Speech and Audio Processing, Vol.10, No.5, July 2002, pp.303-314. 3. Baeza-Yates & Ribeiro Neto, “ Modern Information Retrieval”, ACM Press, 1999 4. ACM Special Interest Group on Information Retrieval, http://www.acm.org/sigir 5. “ A Hidden Markov Model Information Retrieval System”, ACM SIGIR, 1999 6. “ Improved Spoken Document Retrieval with Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA)”, Informational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2006 7. “Position Specific Posterior Lattices for Indexing Speech”, 43-th Annual Meeting of the Association for Computational Linguistics (ACL), 2005, pp.443-450

Voice –enabled Web-based Applications Public Information and Services text information voice information text-to-speech synthesis Future Networks voice input/ output spoken dialogue voice-based information retrieval Private and Personal Services • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Replaced by Voice in the Future • Human-Network Interactions can be Accomplished by Spoken Dialogues • Voice-based Information Retrieval needs to be integrated with Spoken Dialogues • More Multi-media Information including Voice but not including Enough Text will be Available on the Web in the Future

Voice Queries Text Queries 我想找有關紐約受到恐怖攻擊的新聞？ Text Information d1 Voice Information d2 d1 d3 d2 美國總統布希今天早上… d3 Voice-based Information Retrieval • Speech/Text Queries, Speech/Text Documents • Mobile/Office User Environments with Multi-modality • Speech Provides Better User Interface in Wireless Environment

Indexing Document representation :d Query formation User request representation :q Retrieval Matching query to documents Returning relevant documents Relevance feedback Assessing retrieved results Modifying initial query Iterated retrieval: automatic (blind)/manual Performance evaluation Performance measure documents user request Indexing Query Formation document representation: d query representation: q Retrieval list of relevant documents in order Feedback Evaluation performance Information Retrieval

Recall and Precision Rates Non-Interpolated Average Precision Averaged at all relevant documents retrieved and over all queries e.g. relevant documents ranked at 1, 5, 10, precisions are 1/1, 2/5, 3/10, non-interpolated average precision=(1/1+2/5+3/10)/3 Precision rate = Recall rate = A A+B A A+C B AC retrieved documents relevant documents Performance Measures • similar to missing/false alarm rates • recall-precision plot similar to ROC curves • recall rate may be difficult to evaluate, while precision rate is directly perceived by users

Indexing Elements Words: Large-vocabulary Based create text transcription of spoken documents/queries by speech recognition use text retrieval methods error propagation, out-of-vocabulary (OOV) problems, special terms Subword Units: Subword Based subword units: phones/syllables/something similar a segment of one to a few subword units may carry some indexing information not limited by the vocabulary small size/handling some OOV/probably more ambiguity Keywords: Keyword Based based on a set of keywords keyword selection: user specify/a prior/fixed/automatic generated special terms for dynamic documents Hybrid: Fusion of Information Approaches to Speech-based Information Retrieval • Indexing Features • a single element • different combinations of more than one elements • pre-defined, or automatically selected by data-driven approaches • each of such features is called an “indexing term” • Retrieval Model Examples • vector space models • latent semantic indexing (LSI) • statistical (probabilistic) models • hidden Marcov model (HMM) • combinations/hybrid models

Vector Representations of query q and document d for each type j of indexing feature a vector is generated each component in this vector is the weighted statistics zjt of a specific indexing term t ct: frequency counts for the indexing term t present in the query q or document d (for text), or sum of normalized recognition scores or confidence measures for the indexing term t (for speech) N: total number of documents in the database Nt: total number of documents in the database which include the indexing term t IDF: the significance (or importance) or indexing power for the indexing term t The Overall Relevance Score is the Weighted Sum of the Relevance Scores for all Types of Indexing Features Inverse Document Frequency (IDF) Term Frequency (TF) Vector Space Model

Blind Relevance Feedback the information from the relevant and irrelevant documents retrieved in the previous stage used to identify more helpfulindexing terms the initial query is reformulated accordingly: q=  · q+  ·  d -  ·  d q, d: vector representation for the query and documents Dr : selected set of relevant documents retrieved in the previous stage Dirr: selected set of irrelevant documents deleted in the previous stage q: new query representation ,,: weighting coefficients Query Expansion by Term Association the indexing terms co-occurring frequently in the same documents assumed to have some synonymity association build an association matrix for each type of the indexing features, in which each entry ( i , j ) stands for the association between indexing terms ti andtj: reformulate the query expression by adding indexing terms with higher synonymity Dr Dirr Improved Retrieval Technique Examples

Even for Text-based Information Retrieval, Flexible Wording Structure Makes it Difficult to Search by Comparing the Character Strings Alone name/title 李登輝→李前總統登輝，李前主席登輝(President T.H Lee) arbitrary abbreviation 北二高→北部第二高速公路(Second Northern Freeway) similar phrases 中華文化→中國文化(Chinese culture) translated terms 巴塞隆那→巴瑟隆納(Barcelona) Word Segmentation Ambiguity Even for Text-based Information Retrieval 腦科(human brain studies) →電腦科學(computer science) 土地公(God of earth) →土地公有政策(policy of public sharing of the land) Uncertainties in Speech Recognition errors (deletion, substitution, insertion) out of vocabulary (OOV) words, etc. very often the key phrases for retrieval are OOV Difficulties in Speech-based Information Retrieval for Chinese Language

A Whole Class of Syllable-Level Indexing Features with Complete Phonological Coverage and Better Discriminating Functions Overlapping syllable segments with length N Syllable pairs separated by M syllables Character- or Word-Level Features can be Similarly Defined S1 S2 S3 S4 S5 ………S10 S(N), N=1 N=2 N=3 P(M), M=1 M=2 Syllable-Level Indexing Features for Chinese Language

Single Syllables each syllable usually shared by more than one characters with different meanings, thus causing ambiguity all words are composed by syllables, thus partially handle OOV problem very often relevant words have some syllables in common Overlapping Syllable Segments with Length N capturing the information of polysyllabic words or phraseswith flexible wording structures majority of Chinese words are bi-syllabic not too many polysyllabic words share the same pronunciation Syllable Pairs Separated by M Syllables tackling the problems arising from the flexible wording structure, abbreviations, and deletion, insertion, substitution errors in speech recognition Syllable-Level Statistical Features

Syllable Lattice and syllable-level utterance verification Including multiple syllable hypothesis to construct syllable-aligned lattices for both query and documents Generating multiple syllable-level indexing features from syllable lattices filtering out indexing terms with lower acoustic confidence scores Infrequent term deletion (ITD) Syllable-level statistics trained with text corpus used to prune infrequent indexing terms Stop terms (ST) Indexing terms with the lowest IDF scores are taken as the stop terms syllables with higher acoustic confidence scores syllables with lower acoustic confidence scores syllable pairs S(N), N=2 pruned by ITD syllable pairs S(N), N=2 pruned by ST Improved Syllable-level Indexing Features

Hidden Markov Model (HMM) for Speech-based Information Retrieval • Modeling the Query q as a Sequence of Input Observations (Indexing Terms), q=t1t2...tn...tN, and each Document d as a HMM (1-state at the moment) Composed of Distributions of N-gram Parameters • MAP Principle (as a simple example) • Observation Probability in the HMM state (as a simple example) • m1,m2,m3,m4 trained by EM/MCE P (tn|d), p(tn|tn-1,d) unigram/bi-gram trained from the document d P (tn|C), p(tn|tn-1,C) unigram/bi-gram trained from a large corpus, specially helpful for missing terms in the documents p (tn|d) p (tn|C) p (tn|tn-1, d) q = t1 t2...tn...tN m1+m2+m3+m4=1 p (tn|tn-1, C) q: input query, d: all documents in the database “is R”: is relevant reduced to maximum likelihood without prior knowledge

, normalized with document length and term entropy, or T , S= diagonal with singular values T Concept Matching Term Matching Latent Semantic Indexing (LSI) Model for Speech-based Information Retrieval • Term-Document Matrix • M indexing terms {t1,t2,...tM} and N documents {d1,d2,....dN} • wij =lij·gi , lij: local weight gi: global weight • Singular Value Decomposition (SVD) • ui = uiS term vector vi = viS document vector • reduced to R-dimensional space of “latent semantic concepts” • Query q considered as a new document “folded-in” • relevance score:

Automatic Keyword Extraction from Texts integrated with Keyword Spotting Text Documents Extracted Keywords Automatic Keyword Extraction Keyword Spotting input speech query Indexing File Keyword Set Speech Documents Speech Recognition transcription of speech documents Spotted Keywords Keyword-based Retrieval Retrieved Text/Speech Documents Speech-based Information Retrieval by Keywords ― An Example • Integration with Other Approaches

Voice-based Information Retrieval —how far are we from the text-based information retrieval ? Lin-shan Lee National Taiwan University Taipei, Taiwan, ROC

Introduction:Voice-based Information Retrieval

Text-based Information Retrieval Extremely Successful information desired by the users can be obtained efficiently in real time all users like it producing very successful applications and industry All Roles of Texts can be Accomplished by Voice spoken information or multimedia information with voice in audio part voice instructions/queries via handheld devices How about Voice-based Information Retrieval? Text/Voice-based Information Retrieval user instructions/queries Server Internet Server Documents/Information

Voice-based Information Retrieval (1/2) Voice Instructions/Queries Text Instructions/Queries Newly elected president of US？ Text Information Voice Information (multimedia including audio part) • If Voice Documents/Queries can be Accurately Recognized • - voice-based reduced to text-based information retrieval • Correct but Never Possible

Voice-based Information Retrieval (2/2) Voice Instructions/Queries Text Instructions/Queries Newly elected president of US？ Text Information Voice Information (multimedia including audio part) • User Instructions and/or Network Content Can be in form of Voice • - text queries/spoken documents • - spoken queries/text documents • - spoken queries/spoken documents

Text Queries/Spoken Documents • Spoken Document Retrieval • started with longer documents/queries at relatively higher ASR accuracies • started with text-based approaches applied on 1-best transcriptions • inadequate for short documents/queries with relatively poor ASR accuracies • Spoken Term Detection • emerged probably from the successful term matching paradigm for text-based approaches • considering multiple alternatives from ASR output (e.g. lattices) to handle ASR errors • different from the traditional task of Keyword Spotting in that the query set is open [Chelba, Hazen, Saraclar, IEEE SPM 08][Vergyri, et al, Interspeech 07] [Saraclar & Sproat, HLT 04][Mamou, et al, SIGIR 06][Chelba & Acero, ACL 05]

Spoken Queries/Text Documents • Voice Search • information to be retrieved existing in a large text database (e.g. directory assistance) • out-of-vocabulary (OOV) words in the database • disambiguated by dialogues Database query Search user ASR Dialogue Manager n-best results Disambiguation • [Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08] • [Yu, Wang, Acero, Interspeech 07] • Spoken Query Processing • using a lattice of possible terms as the queries • more semantic analysis performed during retrieval [Moreno-Daniel, Juang, Wilpon, ICASSP 07, Interspeech 08]

Spoken Queries/Spoken Documents • Uncertainty on Both Sides • Query-by-example • [Chia, et al, SIGIR 08] • Comparing Two Lattices of Queries/Documents by Graphical • Model • [Lin et al, Interspeech 08]

Wireless and Multimedia Technologies are Creating An Environment of Voice-based Information retrieval text information Multimedia Content Text-to-Speech Synthesis Text Content voice information Spoken and multi-modal Dialogue Voice-based Information Retrieval Internet voice input/ output MultimediaContent Analysis Text Information Retrieval • Many Hand-held Devices with Multimedia Functionalities Commercially Available Today • Unlimited Quantities of Multimedia Content Available over the Internet • User-Content Interaction necessary for Information Retrieval can be Accomplished by Spoken and Multi-modal Dialogues • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Voice

Why Is Text-based Information Retrieval Useful and Attractive? How about Voice-based Information Retrieval? Text-based Voice-based • Rich resources—huge quantities of text documents available over the Internet • Quantity continues to increase exponentially due to convenient access • Spoken/multimedia content are the new trend • Can be realized even sooner given mature technologies Resources • Retrieval accuracy acceptable to users • Retrieved documents properly ranked and filtered • Problems with speech recognition errors, especially for spontaneous speech under adverse environments Accuracy • Retrieved documents easily summarized on-screen, thus easily scanned and selected by user • Users may easily select query terms suggested for next iteration retrieval in an interactive process • Spoken/multimedia documents not easily summarized on-screen, thus difficult to scan and select • Lacks efficient user-system interaction User-System Interaction

Accuracy for Voice-based Information retrieval

Low Recognition Accuracies for Spontaneous Speech including Out-of-Vocabulary (OOV) Words under Adverse Environment considering lattices with multiple alternatives rather than 1-best output higher probability of including correct words, but also including more noisy words correct words may still be excluded (OOV and others) huge memory and computation requirements other approaches: confusion matrix, fuzzy matching… W1 W4 W8 W6 W2 End node Start node W5 W3 W9 W10 W7 W8 Time index Accuracy for Voice-based Information Retrieval Wi: word hypotheses • [Mamou & Ramabhadran, Interspeech 08]

Lattices An Example of Indexing Structure reduced memory and computation requirements (still huge…) added possible paths noisy words discriminated by posterior probabilities or similar scores n-grams matched and accumulated for all possible n W1 W4 W8 W6 W2 End node Start node W5 W3 W9 W10 W7 W8 Time index Efficient Forms of Lattices for Indexing Purposes – Indexing Structures

Examples of Indexing Structures • Position Specific Posterior Lattices (PSPL) • [Chelba & Acero, ACL 05] • Confusion Networks (CN) • [Mamou, et al, SIGIR 06][Hori, Hazen, Glass, ICASSP 07] • Time-based Merging for Indexing (TMI) • [Zhou, Chelba, Seide, HLT 06][Seide, et al, ASRU 07] • Time-anchored Lattice Expansion (TALE) • [Seide, et al, ASRU 07] • WFST • directly compile the lattice into a weighted finite state transducer [Allauzen, et al, HLT 04][Saraclar & Sproat, HLT 04]

W2 W1 Start node W5 W3 W4 W9 W10 W6 W8 W7 W8 Time index W1W2, W3W4W5, W6W8W9W10, W7W8W9W10 CN structure: W1: prob W4: prob W9: prob W2: prob W3: prob W8: prob W5: prob W6: prob W10: prob W7: prob segment 1 segment 2 segment 3 segment 4 PSPL structure: W1: prob W2: prob W5: prob W10: prob W4: prob W3: prob W9: prob W6: prob W8: prob W7: prob segment 1 segment 2 segment 3 segment 4 Two Examples of Indexing Structures: Position Specific Posterior Lattices (PSPL), Confusion Networks (CN) Lattice: End node All paths: • PSPL: • Locating a word in a segment according to the order of the word in a path • CN: • Clustering several words in a segment according to similar time spans and word pronunciation

OOV Word W=w1w2w3w4 wi: subword units : phonemes, syllables… a, b, c, d, e : other subword units W can’t be Recognized and never Appears in Lattice can’t be found W=w1w2w3w4 hidden at subword level can be matched at subword level without being recognized Subword-based PSPL (S-PSPL) or CN (S-CN), for Example OOV or Rare Words Handled by Subword Units w3w4b Lattice: w3w4bcd aw1w2 w1w2 w3w4e w2w3 Time index

w2_1 w1_3 w2_2 w1_2 w2_3 w2_4 w1_1 w3_1 w5_3 w5_4 w3_2 w5_2 W5_1 w4_1 w4_2 w6_1 w7_1 w10_1 w9_2 w6_2 w8_1 w9_1 w8_2 w7_2 w8_2 w8_1 w10_2 Time index Subword-based Indexing Structures (1/2) • Constructed from Phone Lattices (assuming the subword unit is • the phone) from Phone Decoder • Relatively higher phone error rates [Ng, MIT 00][Wallace, et al, Interspeech 07] • Word Lattices Represented • by Subword Arcs: • Only sub-strings of subword units for in-vocabulary words can be generated [Saraclar & Sproat, HLT 04][Vergyri, et al, Interspeech 07]

S-PSPL structure: S-CN structure: w5_4: prob w2_4: prob w1_1: prob w1_2: prob …. w1_1: prob w1_2: prob …. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. segment 8 segment 2 segment 8 segment 2 segment 1 segment 1 Subword-based Indexing Structures (2/2) • Subword-based PSPL and CN (S-PSPL, S-CN) • Strings of subword units are not constrained by in-vocabulary words any longer [Pan & Lee, Interspeech 07][Pan & Lee, ASRU 07] • Hybrid Word-based and Subword-based Structures • [Yu & Seide, HLT 05]

Frequently Used Subword Units – Language Dependent (1/2) • Phonemes • English and many alphabetic languages • Phone n-grams • Particles : groups of phonemes obtained data-driven [Ng, MIT 00][Wallace, et al, Interspeech 07][Logan, et al, IEEE T. Multimedia 05] • Graphemes • [Wang & King, ICASSP 08] • Graphones • [Bisani & Ney, Interspeech 05][Akbacak & Vergyri, ICASSP 08] • Morphs • Morph-based languages : Finnish, Turkish, etc. • Morpheme-like units [Turunen & Kurimo, SIGIR 07][Parlak & Saraclar, ICASSP 08]

Frequently Used Subword Units – Language Dependent (2/2) • Phonetic Word Fragments • Derived bottom-up data-driven [Yu & Seide, HLT 05] • Syllables/Characters • Mandarin Chinese and similar monosyllable-based languages • Syllable/character n-grams • Syllable/character pair separated by a syllable/character [Chen & Lee, IEEE T. SAP 02][Pan & Lee, ASRU 07] [Meng & Seide, ASRU 07, Interspeech 08] [Shao & Seide, Interspeech 08]

User-System Interaction for Voice-based Information Retrieval

Text Documents (including those for voice search, etc.) are Better Structured and Easier to Browse — in paragraphs with titles, or in well structured databases — easily summarized on-screen — easily scanned and selected by user Multimedia/Spoken Documents are just Video/Audio Signals — not easily summarized on-screen — difficult to scan and select — lacks efficient scenario for user-system interaction Issues in User-System Interation —Difficulties in Browsing, Scanning, and Selecting Multimedia/Spoken Documents

Semantic Analysis for Spoken Documents — analyzing the semantic content of the spoken documents Key Term Extraction from Multimedia/Spoken Documents — very often are out-of-vocabulary (OOV) words such as person/organization/ location names Multimedia/Spoken Document Segmentation — automatically segmenting a spoken document into short paragraphs, each with a central topic Summarization and Title Generation for Multimedia/Spoken Documents — automatically generating a summary and a title (in text or speech form) for each short paragraph Topic Analysis and Organization for Multimedia/Spoken Documents — analyzing the subject topics for the short paragraphs and organizing them into graphic structures [Lee & Chen, IEEE SPM 05][Lee, et al, Interspeech 06] Proposed Approach—Multimedia/Spoken Document Understanding and Organization for Multi-modal User Interfaces

Creating A Set of Latent Topics between A Set of Terms and A Set of Documents Modeling the Relationships by Probabilistic Models Trained with EM Algorithm An Example Approach of Semantic Analysis for Spoken Documents : Probabilistic Latent Semantic Analysis (PLSA)

Latent Topic Entropy Key Term Extraction（1/2） - carries less topical information - carries more topical information [Kong & Lee, ICASSP 06][Hsieh & Lee, ICASSP 06]

Latent Topic Significance — a term tj with respect to a topic Tk P(Tk|Di) : how each document Di is focused on the topic Tk [1-P(Tk|Di)] : the probability that each document Di addresses all other topics different from Tk Key Term Extraction（1/2） [Kong & Lee, ICASSP 06]

Selecting Important Sentences to be Concatenated into a Summary — sentence scoring and selection — given a summarization ratio Selected Sentences Collectively Represent Some Concepts Closest to those of the Complete Document — removing the concepts already mentioned previously — concepts presented smoothly Spoken Document Summarization [Furui, et al, ICASSP 05, IEEE T. SAP 04][Hirschberg, et al, Interspeech 05] [Murray, Renals, et al, ACL 05, HLT 06][Kawahara, et al, ICASSP 04] [Nakagawa, et al, SLT 06][Zhu& Penn, Interspeech 06] [Fung, et al, ICASSP 08][Kong & Lee, ICASSP 06, SLT 06]

One example: Delicate Scored Viterbi Search Title Generation for Spoken Documents Term Selection Model Term Ordering Model Title Length Model Training corpus ASR and Automatic Summarization Summary Viterbi Algorithm Output Title Spoken document [Witbrock & Mittal, SIGIR 99][Jin & Hauptmann, HLT 01] [Chen & Lee, Interspeech 03] [Wang & Lee, SLT 08]

Global Semantic Structuring —Offering a global picture of the semantic structure of the entire archive Query-based Local Semantic Structuring —Offering a detailed semantic structure of the relevant documents retrieved by the query Latent Topic Analysis and Organization for Spoken Documents

Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or as a Multi-layered Map —Documents addressing similar topics grouped in the same cluster —Distance between clusters on the map has to do with relationships between topics for the documents — A cluster with many documents can be expanded into another map in the next layer Global Semantic Structuring for the Entire Archive Two-dimensional Tree Structure for Organized Topics [Li & Lee, Interspeech 05]

User’s Query Produces many Retrieved Spoken Documents Difficult to be shown on-screen A Topic Hierarchy Constructed for the Retrieved Documents each node represents a cluster of retrieved documents labeled by a key term (or topic) User may select or delete the nodes directly Better User-System Interaction Topic Hierarchy Spoken Document Archive Query/ Instruction User Retrieval System Retrieved Documents Query-based Local Semantic Structuring for Retrieved Spoken Documents Multi-modal Dialogue [Pan & Lee, ASRU 05]

Query Term Suggestions in Text-based Information Retrieval very helpful User-System Interaction for Spoken Document Retrieval Properly Ranking the Topics in the Topic Hierarchy suggesting important/relevant key terms on the top of the hierarchy automatically learned and performed by the dialogue manager Topic Hierarchy Spoken Document Archive Query/ Instruction User Retrieval System Multi-modal Dialogue Retrieved Documents Improved Interactive Retrieval of Spoken Documents by Ranking the Key Terms in the Topic Hierarchy [Pan & Lee, Interspeech 06, SLT 06]

Dialogue Modeling Well-organized Database Speech, Graph, Tables System Action Dialogue Manager Output Generator S Input Speech Utterance Dialogue State ^ U Au User Act Semantic Frame ASR Language Understanding Dialogue Act Classification words,lattices Spoken language Understanding User-System Interaction in Spoken Dialogue Systems • Spoken Dialogue Systems • Example Goals • Higher task success rate (reliability) • Smaller average number of turns for successful tasks (efficiency)

Dialogue Systems for Voice-based Information Retrieval • Voice-based Information Retrieval • Example Goals • higher task success rate (success: user’s information need satisfied) • smaller average number of dialogue turns (average number of query terms entered) for successful tasks • Dialogues Equally Useful in Voice Search for Text Documents [Pan & Lee, ASRU 07] [Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]

Concluding Remarks Voice/text-based Information Retrieval

13.0 Voice-based Information Retrieval

13.0 Voice-based Information Retrieval

Presentation Transcript

Text Based Information Retrieval - Text Mining

Information retrieval

Gravitation-Based Model for Information Retrieval

10.0 Speech-based Information Retrieval

Information Retrieval

Evidence-Based Information Retrieval in Bioinformatics

Information Retrieval

A Discourse-based Information Retrieval Approach

Speech-based Information Retrieval

Towards Compression-based Information Retrieval

Dictionary-based Amharic-French Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval for Evidence-based Practice

information retrieval

Information Retrieval