480 likes | 609 Views
Survey of Approaches to Information Retrieval of Speech Messages. Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology February 16 , 1996 DRAFT. 報告人:朱惠銘. Survey of Approaches to Information Retrieval of Speech Messages. Introduction
E N D
Survey of Approaches to Information Retrieval of Speech Messages Kenney NgSpoken Language Systems GroupLaboratory for Computer ScienceMassachusetts Institute of TechnologyFebruary 16 , 1996DRAFT 報告人:朱惠銘
Survey of Approaches to Information Retrieval of Speech Messages • Introduction • Information Retrieval • Text Retrieval • Differences between text and speech media • Information Retrieval of Speech Messages
1 Introduction • Process, organize, and analyze the data. • Present the data in human usable form. • Find the “interesting” piece of information efficiently. • Increasingly large portions in spoken language information: • recorded speech messages • radio and television broadcasts • Development of automatic methods.
2 Information Retrieval • 2.1 Definition • The representation, storage , organization and accessing of information items. • Return the best match of the “request” provided by the user. • There is no restriction on the type of documents. • Text Retrieval , Document Retrieval • Image Retrieval , Speech Retrieval • Multi-media Retrieval
2.3 Component Processes Creating document representations (indexing) Creating request representations (query formation) Comparing representations(retrieval) Evaluating retrieved documents(relevance feedback)
2.3 Component Processes (cont.)Performance • Recall • The fraction of all the relevant documents in the entire collection that are retrieved in response to a query. • Precision • The fraction of the retrieved documents that are relevant. • Average precision • The precision values obtained at each new relevant document in the ranked output for an individual query are averaged.
3 Text Retrieval • 3.1 Indexing and Document Representation • 3.2 Query Formation • 3.3 Matching Query and Document Representation
3.1 Indexing and Document Representation • Terms and Keywords • A list of words extracted from the full text document. • Construct a Stop list to remove the useless words. • Under the usage of synonyms • Construct a dictionary structure to modify • To replace each word in one class • Tradeoff exists between normalization and discrimination in the indexing process
Index Term Weighting • Term frequency • The frequency of occurrence of each term in the document • For term tk in document di
Index Term Weighting • Inverse document frequency • Approach of weighting each term inversely proportional to the number of documents in which the term occurs. • For term tkN is the total number of documentsntk is the number of documents with term tk
Index Term Weighting • Weights to terms • Terms that occur frequently in particular documents but rarely in the overall collection should receive a large weight.
3.2 Query Formation • Relevance Feedback • The IR system automatically modifies a query based on user feedback about documents retrieved in an initial run.
3.2 Query Formation • Extracting from a user request a representation of its content. • The indexing method also applicable to query formation.
3.3 Matching Query and Document Representations • Boolean Model, Extended Boolean Model • Vector Space Model • Probabilistic Models
Boolean Model • Document representation • Binary value variable • True: the term is present in the document • False: the term is absent in the document • The document can be represented in a binary vector • Query • Boolean query : AND, OR and NOT • Matching function • Standard rule of Boolean logic • If the document representation satisfy the query expression then that document matches the query
Extended Boolean Model • The retrieval decision of the Boolean Model may be too harsh. • The extended boolean model • This is maximal for a document contain all the terms and decreases the numbers of matching term decreases.
Extended Boolean Model • For the OR query • This is minimal for a document that contains none of the terms and increases as the number of matching terms increases. • The variable p is a constant in the range 1≤p≤∞ that is determined empirically;it is typically in the range 2≤p≤5.
Vector Space Model • Documents and queries are represented as vector in a K-dimensional space • K is the number of indexing terms.
Probabilistic Models • Baye’s Decision Rule • The probability that the document d is relevant to the query q denotes • The probability that the document d is non-relevant to the query q denotes • Cr is the cost of retrieving a non-relevant document • Cn is the cost of not retrieving a relevant document • The expected cost of retrieving a extraneous document is
Probabilistic Models (cont.) • How to compute the and which are posteriorprobabilities? • Base on Bayes’ Rule • , are the prioriprobabilities of relevance and non-relevance of a document. • , are the likelihoods or class conditionalprobabilities.
Probabilistic Models (cont.) • Now we have to estimate and
Probabilistic Models (cont.) • Assumptions • The document vectors are binary, indicating the presence or absence of each indexing term. • Each term has a binomial distribution. • There are no interactions between the terms.
Probabilistic Models (cont.) • wk is the same as the relevance weight of kth index term • Assume pk a constant value : 0.5 • qk overall frequency : nk/N
4 Differences between text and speech media • Speech is a richer and more expressive medium than text. (mood, tone) • Robustness of the retrieval models to noise or errors in transcription. • How to accurately extract and represent the contents of a speech message in a form that can be efficiently stored and searched.
5 Information Retrieval of Speech Messages • Speech Message Retrieval • Large Vocabulary Word Recognition Approach • Sub-Word Unit Approach • Word Spotting Approaches • Speech Message Classification and Sorting • Topic Identifications • Topic Spotting • Topic Clustering
Large Vocabulary Word Recognition Approach • Suggested by CMU in Information digital video library project. • A user can interact with the text retrieval system to obtain video clips stored in the library that are relevant to his request. Sound trackof video Large vocabularyspeech recognizer Textualtranscript Full-text informationretrieval system Natural languageunderstanding
Sub-Word Unit Approach • Syllabic Units • Phonetic Units
Syllabic Units • VCV-features • Sub-word units consist of a maximum sequence of consonants enclosed between two maximum sequences of vowels. • eg: INFORMATION has INFO,ORMA,ATIO vcv-features • Take subset of these features as the indexing terms.
Syllabic Units • Criteria • Occur frequently enough for a reliable acoustic model to be trained for it. • Not occur so frequently that its ability to discriminate between different messages is poor. • Process query VCV-features tf*idf weight Document representation Cosine similarity function Document with highest score
Syllabic Units • Major problem • The acoustic confusability of VCV-feature based approach is not taken into account during the selection of indexing features
Phonetic Units • Using variable length phone sequences as indexing feature. • These features can be viewed as “pseudo -word” and were shown to be useful for detecting or spotting topics in recorded military radio broadcasts. • An automatic procedure based on “digital trees” is used to search the possible subsequences • A Hidden Markov Model (HMM) phone recognizer with 52 monophone models is used to process the speech • More domain independent than a word based system.
Word Spotting Approaches • Between the simple phonetic and the complex large-vocabulary recognition. • Two different ways that word spotting has been used. • 1. Small, fixed number of keywords are selected a priori for both recognition and indexing. • 2. The speech messages in the collection are processed and stored in a form (e.g. phone lattice) that allows arbitrary keywords to be searched for after they are specified by the user.
Speech Message Classification and Sorting • Topic Identifications (1) • K keywords • nk is the binary value indicating the presence or absence of keyword wk. • Finding that topic Ti which maximum the score Si
Speech Message Classification and Sorting • Topic Identifications (1) • If there are 6 topics , top scoring 40 words each,total 240 keywords . • These keywords used on the text transcriptions of the speech messages 82.4% classification accuracy achieved • If a genetic algorithm used to reduced the number of keywords down to 126 with a small drop in classification performance to 78.2% .
Topic Identifications (2) • The topic dependent unigram language models • K is the number of keywords in the indexing vocabulary • nk is the number of times keyword wk occurs in the speech message • p( wk | Ti ) is the unigram or occurrence probability of keyword wk in the set of class Ti message.
Topic Identifications (3) • The length normalized topic score • N is the total number of words in speech message • K is the number of keywords in the indexing vocabulary • nk is the number of times keyword wk occurs in the speech message • p( wk | Ti ) is the unigram or occurrence probability of keyword wk in the set of class Ti message.
Topic Identifications (3) • 750 keywords • Classification accuracy is 74.6%
Topic Identifications (4) • The topic model is extended to a mixture of multinomial • M is the number of multinomial model components • Πmis the weight of the mth multinomial component • K is the number of keywords in the indexing vocabulary • nk is the number of times keyword wk occurs in the speech message • p( wk | Ti ) is the unigram or occurrence probability of keyword wk in the set of class Ti message.
Topic Identifications (4) • Experiments indicate that the more complex models do not perform as well as the simple single mixture model.
Topic Spotting (1) • “usefulness” measure how discriminating the word is for the topic. • and are the probabilities of detecting the keyword in the topic and unwanted • This measure select words that occur often in the topic and have high discriminability .
Topic Spotting (2) • Performed by accumulating over a window of speech (typically 60 seconds) • The log likelihood ratio of the detected keywords to produce a topic score for that region of the speech message.
Topic Spotting (2) • Try to capture dependencies between the keywords are examined. • w represent the vector of keywords • is the coefficient of model . • Their experiments show that using a carefully chosen log-linear model can give topic spotting performance that is better than using the basic model that assumes keyword independence
Topic Clustering • Try to discover structure or relationships between messages in a collection. • The clustering process • Tokenization • Similarity computation • Clustering
Topic Clustering (cont.) • Tokenization to come up with a suitable representation of the speech message which can be used in the next two steps. • Similarityit needs to compare every pair of messages,N-gram model is used. • Clusteringusing hierarchical tree clustering or nearest neighbor classification. • Work well under true transcription texts figure of merit (FOM) 90% rates • Using speech input is worse than texts, it down to 70% FOM using recognition output, unigram language models and tree-based clustering.