Special Topics on Information Retrieval

SpecialTopicsonInformationRetrieval Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx University of Alabama at Birmingham, Fall 2010.

SpokenDocumentRetrieval

Content of the section • Definition of the task • Traditional architecture for SDR • Automatic Speech Recognition • Basic ideas and common errors • Approaches for SDR • Manipulating the ASR system • Recovering from recognition errors Special Topics on Information Retrieval

Motivation • Speech is the primary and most convenient means of communication between humans • Multimedia documents are becoming increasingly popular, and finding relevant information in them is a challenging task “The problem of audio/speech retrieval is familiar to anyone who has returned from vacation to find an answering machine full of messages. If you are not fortunate, you may have to listen to the entire tape to find the urgent one from your boss” Special Topics on Information Retrieval

Spoken document retrieval (SDR) • SDR refers to the task of finding segments from recorded speech that are relevant to a user’s information need. • Large amount of information produced today is in spoken form: TV and radio broadcasts, recordings of meetings, lectures and telephone conversations. • There are currently no universal tools for speech retrieval. Ideas for carrying out this task? Special Topics on Information Retrieval

Ideal architecture • An ideal system would simply concatenate an automatic speech recognition (ASR) system with a standard text indexing and retrieval system. Problems with this architecture? AudioCollection Query SpeechRecognizer Text-basedIR system Results Recognizedtext Special Topics on Information Retrieval

Missing slide…about ASR • General architecture • Dimensions of the problem Special Topics on Information Retrieval

Basic components of the ASR system • Feature extraction • Transforms the input waveform into a sequence of acoustic feature vectors; each vector representing the information of in a small time window of the signal • Decoder • Combines information from the acoustic and language models and finds the word sequence with the highest probability given the observed speech features. • Acoustic models: have information on how phonemes in speech are realized as feature vectors • Lexicon: list of words with a pronunciation for each word expressed as a phone sequence • Language Model: estimates probabilities of word sequences More about this topic in: Spring 2011; CS 462/662/762 Natural Language Processing; Dr. ThamarSolorio. Special Topics on Information Retrieval

Common errors in the recognition stage • Recognition errors are deletions, insertions and substitutions of legitimate words. • Errors at word level • “governors”  “governess” • “kilos”  “killers” • “Sinn Fein”  “shame fame” • Errors on word boundaries • “Frattini”  “Freeh teeny” • “leap shark ”  “Liebschard” Special Topics on Information Retrieval

Transcription errors and IR • The main problem with the traditional SDR architecture is the accuracy of the recognition output • Around 50% word accuracy on real-world tasks • Example: • What will happen with a query about “Saddam Hussein”? • How to handle the errors or incomplete output provided by ASR systems? Special Topics on Information Retrieval

SDR in non spontaneous speech Conclusions from TREC 2000 editionUsing a corpus of 550 hours of Broadcast News • Spoken news retrieval systems achieved almost the same performance as traditional IR systems. • Even with error rates of around 40%, the eeffectiveness of an IR system falls less than 10%. • Long queries are better that short queries • Deletations and substitutions are more important than insertions, especially for long queries. Special Topics on Information Retrieval

New challenges • Spoken questions of short duration • Message-length documents • For example, voice-mail messages, announcements, and so on. • Other types of spoken documents such as dialogues, meetings, lectures, classes, etc. • Contexts where the word error rate is well over 50%. • Applications such as: • Question answering • Summarizing speech Special Topics on Information Retrieval

Main ideas for SDR • Manipulating the ASR system (white box) • Retrieval at phonetic level • Adding alternative recognition results • N most likely paths in the lattice • Using the complete word lattice • Recovering from recognition errors (black box) • Query and/or document expansion • Using multiple recognizers • Applying phonetic codification Special Topics on Information Retrieval

Alternative recognition results • Speech recognizers aim to produce a transcription with as few errors as possible. • It is possible that the correct word appears among the candidates the recognizer considers, it just gets mistakenly pruned away. • Retrieval performance can be improved by adding several candidates to the transcription Special Topics on Information Retrieval

Query/document expansion (1) • The effect of out-of-vocabulary query words and other recognition errors can be reduced by adding to the query extra terms that have similar meaning or that are otherwise likely to appear in the same documents as the query terms. • From the top ranked documents for the given query • Using associations extracted from the whole collection • Common to use a parallel written document set. • This collection must be thematically related Special Topics on Information Retrieval

Query/document expansion (2) • Other approach to the OOV problem is to expand word queries into in-vocabulary phrases according to intrinsic acoustic confusability and language model scores. • For example, talibanmay be expanded to tell a band. • The aim is to mimic mistakes the speech recognizer makes when transcribing the audio. • It is dependent on the ASR system Special Topics on Information Retrieval

Using multiple recognizers • Different independently developed recognizers tend to make different kinds of errors and combining the outputs might allow some errors to be recovered. • Combination of scores can be done by any traditional information fusion method. • Good results have been obtained with simple linear combinations. Special Topics on Information Retrieval

Using a phonetic codification • Phonetic codifications allow to characterize words with similar pronunciations through the same code. • Example of a Soundex codification: • Unix Sun Workstation → (U52000 S30000 W62300) • Unique some workstation → (U52000 S30000 W62300) • The idea of this approach is to build an enriched representation of transcriptions by combining words and phonetic codes. Special Topics on Information Retrieval

The algorithm at a glance • Compute the phonetic codification for each transcription using a given algorithm. • Combine transcriptions and their phonetic codifications in order to form an enriched document representation. • Remove unimportant tokens from the new document representation. • Stop words and the most frequent codes. • Create a combined index using words and codes. • Incoming queries need to be represented in the same way Special Topics on Information Retrieval

Example of the representation • Query: Actions of Raoul Wallenberg • {actions, raoul, wallenberg, A23520, R40000, W45162} Special Topics on Information Retrieval

GeographicInformationRetrieval

Content of the section • Definition of the task • Need of GIR • Kinds of geographical queries • Main challenges of GIR • Toponyms identification and Disambiguation • Indexing for GIR • Measuring document similarities • Re-ranking of retrieval results Special Topics on Information Retrieval

The need for GIR Geographical information is recorded in a wide variety of media and document types • Information technology for accessing geographical information has focused on the combination of digital maps and databases.  GIS • Systems to retrieve geographically specific information from the relatively unstructured documentsthat compose the Web.  GIR Special Topics on Information Retrieval

The size of the need • It is estimated that one fifth of the queries submitted to search engines have geographic meaning. • Among them, eighty percent can be associated with a geographic place Other Geographic Special Topics on Information Retrieval

Definition of the task • Geographical Information Retrieval (GIR) considers the search for documents based not only on conceptual keywords, but also on spatial information. • A geographic query is defined by a tuple: <what, relation, where> Whisky making inthe Scottish Islands • <what> represents the thematic part • <where> is used to specify the geographical areas of interest. • <relation> specifies the “spatial relation”, which connects the what and the where. Special Topics on Information Retrieval

Different kinds of queries • With concrete locations: • “ETA in France” (GC049) • With locations and simple rules of relevant locations: • “Car bombings near Madrid” (GC030) • With locations and complex rules of relevant locations: • “Automotive industry around the Sea of Japan” (GC036) • With very general locations that are not necessarily in a gazetteer: • “Snowstorms in North America” (GC028) • With quasi-locations (e.g. political) that are not found in a gazetteer: • “Malaria in the tropics” (GC034) • Describing characteristics of the geographical location: • “Cities near active volcanoes” (GC040) Special Topics on Information Retrieval

The problem • In classic IR, retrieved documents are ranked by their similarity to the text of the query. • In a search engine with geographic capabilities, the semantics of geographic terms should be considered as one of the ranking criteria. The problem of weighting the geographic importance of a document can be reduced to computing the similarity between two geographic locations, one associated with the query and other with the document. Special Topics on Information Retrieval

Challenges of GIR • Detecting geographical references in the form of place names within text documents and in users’ queries • Disambiguating place names to determine which particular instance of a name is intended • Geometric interpretation of the meaning of vague place names (‘Midlands’) and spatial relations (‘near’) • Indexing documents with respect to their geographic context as well as their non-spatial thematic content • Ranking the relevance of documents with respect to geography as well as theme • Developing effective user interfaces that help users to find what they want Special Topics on Information Retrieval

Detecting geographic references • The process of geo-parsing is concerned with analyzing text to identify the presence of place names  extension of Named Entity Recognition • Problem is that place names (or toponyms) can be used to refer to places on Earth, but they also occur frequently within the names of organizations and as part of people’s names. • Washington, president or place? (PER vs. LOC) • Mexico, country or football team? (LOC vs. ORG) Special Topics on Information Retrieval

Two main approaches • Knowledge-based • Using an existing gazetteer • List containing information on geographical references (e.g. name, name variations, coordinates, class, size, additional information). • Data-driven or supervised • Using statistical or machine learning methods • Typical features are: Capitalisation, numeric symbols, punctuation marks, position in the sentence and the words. Advantages and disadvantages? Special Topics on Information Retrieval

Disambiguating place names • Once it has been established that a place name is being used in a geographic sense, the problem remains of determining uniquely the place to which the name refers  Toponym resolution • Paris is a place name, but it may refer to the capital of France, or to one of the more than a dozen Paris in the US, Canada and Gambia Theambiguity of a toponymdependsfromtheworldknowledgethat a system has Special Topics on Information Retrieval

Human Errors in TR(taken from a presentation of DavideBuscaldi –UPV,Spain)

Selected Toponym Resources(taken from a presentation of DavideBuscaldi –UPV,Spain) • Gazetteers • Geonames http://www.geonames.org • Wikipedia-World http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikipedia-World/en • Structured Resources • Yahoo! GeoPlanethttp://developer.yahoo.com/geo/geoplanet/ • Getty Thesaurus of Geographical Names http://www.getty.edu/research/conducting_research/vocabularies/tgn/ • (Geo)WordNethttp://users.dsic.upv.es/grupos/nle/resources/geo-wn/download.html

Methods for Toponym Resolution • Three broad categories: • Map-based • They need geographical coordinates • Knowledge-based • They need hierarchical resources • Data-driven or supervised • They need a large enough set of labeled data • Many names occurring only once (it is impossible to estimate their probabilities)

A Map-Based method: Smith 2001 • The right referent is the one with minimum average distance from the context locations • Reported precision74% to 93%depending ontest collection “One hundred years ago there existed in England the Association for the Promotion of the Unity of Christendom. ... A Birminghamnewspaperprinted in a column for children an article entitled “The True Story of Guy Fawkes,” ... An Anglican clergyman in Oxford sadly but frankly acknowledgedto me that this is true. ... A notable example of this was the discussion of Christian unity by the Catholic Archbishop of Liverpool, …”

Conceptual Density TR: Buscaldi 2008 • Adaptation of a WSD methodbasedon Conceptual Densitycomputedoverhierarchies of hypernymstohierarchies of holonyms • Givenanambiguous place name, differentsubhierarchies are obtainedfromWordNet: thesenserelatedtothemostdenselymarkedsubhierarchyisselected World “One hundred years ago there existed in England the Association for the Promotion of the Unity of Christendom. ... A Birminghamnewspaperprinted in a column for children an article entitled “The True Story of Guy Fawkes,” ... An Anglican clergyman in Oxfordsadly but frankly acknowledgedto me that this is true. ... A notable example of this was the discussion of Christian unity by the Catholic Archbishop of Liverpool, …” USA UK Mississippi England Alabama Oxford Liverpool Oxford Birmingham(2) Birmingham(1)

Spatial and textual indexing • Once identified the toponyms, it is necessary to take advantage of them for indexing. • Main approach consists in a combination of: • A textual index, using all words except toponyms • A geographic index, considering only the toponyms • Toponyms ambiguity may or may not be resolved  implications of this? • Geo-index may be enriched using synonyms and holonyms how to do it? Other related words? Other indexing alternative? Special Topics on Information Retrieval

Geographical relevance ranking Retrieval of relevant documents requires matching the query specification to the characteristics of the indexed documents • In Geo-IR there is a need to match the geographical component of the query with the geographical context of documents. • Traditionally, they are used two scores, for thematic and geographic relevance, that are combined to find an overall relevance: How to evaluate these similarities? How to consider the <relation> information? Special Topics on Information Retrieval

About the ranking function • Consider the following two queries: • “Car bombings near Madrid” bomb • “Automotive industry around the Sea of Japan” • What happen with a document mentioning: • Barajas, or Legánes, or Toledo? Or talking about dynamite? • Toyota but not “Automotive industry ”, or “Mikura Island”? • How to differentiate between different relations (“in”, “near”, “at the north”, etc.)? Special Topics on Information Retrieval

Measuring geographic similarities • Main approach: query expansion using an external resource and traditional word-comparison of documents. • Add to the query some related names of places • Geographic or topological distance is lesser than a threshold. • Satisfy the query relation • Documents may also be expanded at indexing time (synonyms and holonyms) • Alternative approach: evaluate a geographic distance between the locations from the query and the document. • Using geographic distances • Distance between points (latitude and longitude) • Intersection between Minimal Bounding Rectangles • Topology distance • Computed from a given geographic resource Special Topics on Information Retrieval

Relevance feedback in Geo-IR • Traditional IR systems are able to retrieve the majority of the relevant documents for most queries, but that they have severe difficulties to generate a pertinent ranking of them. • Idea: Use relevance feedback for selecting some relevant documents and then re-rank the retrieval list using this information. Special Topics on Information Retrieval

Our proposed solution • Based on a Markov random field (MRF) that aims at classifying the ranked documents as relevant or irrelevant. • The MRF takes into account: • Information provided by the base retrieval system • Similarities among documents in the list • Relevance feedback information • We reduced the problem of document re-ranking to that of minimizing an energy-function that represents a trade-off between document relevance and inter-document similarity. Special Topics on Information Retrieval

Proposed architecture Special Topics on Information Retrieval

Definition of the MRF • Each node represents a document from the original retrieved list. • Each fi is a binary random variable • fi = 1 indicates i-th document is relevant • fi = 0 indicates that it is irrelevant. • The task of the MRF is to find the most probable configuration F={f1,…,fN}. • Configuration that minimizes a given energy function • Necessary to use an optimizationtechnique; we used ICM. Special Topics on Information Retrieval

Energy function • Combines the following information: • Inter-document similarity (interaction potential) • Query-document similarity and rank information (observation potential) • These two similarities are computed in a traditional way, without special treatment for the geographic information. Special Topics on Information Retrieval

Observation potential Assumption that relevant documents are very similar tothe query and at the same time it is very likely thatthey appear in the top positions • Captures the affinity between the document associated to node fi and the query q. • Incorporates information from the initial retrieval system • Use the position of documents in original list Special Topics on Information Retrieval

Interaction potential Assumption that relevant documents are very similarto each other, and less similar to irrelevant documents • Assess how much support give same-valued documents to keep current value, and how much support give oppose-valued documents to change to contrary value. Special Topics on Information Retrieval

Relevance feedback • Use it as seed for building the initial configuration of the MRF • We set fi = 1 for relevance feedback documents and we set fj = 0 for the rest. • The MRF starts the energy minimization process knowing what documents are potentially relevant to the query. • Inference process consists of identifying further relevant documents in the list by propagating through the MRF the user relevance feedback information. Special Topics on Information Retrieval

Evaluation • We employed the Geo-CLEF document collection composed from news articles from years 1994 and 1995. • 100 topics from GeoCLEF 2005 to 2008. • We evaluated results using the Mean Average Precision (MAP) and the precision at N(P@N). • Initial results produced by the vectorial space model configured in Lemur using a TFIDF weighting scheme. Special Topics on Information Retrieval

Results Special Topics on Information Retrieval

Special Topics on Information Retrieval