170 likes | 296 Views
Inf 722 Information Organisation. Class notes: Information Retrieval Jagdish S. Gangolly. FOA Process. FOA Process Asking a question (Query formulation) Constructing an answer (retrieval algorithms) Assessing the answer (feedback on relevance). FOA Process. Query language
E N D
Inf 722 Information Organisation Class notes: Information Retrieval Jagdish S. Gangolly Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • FOA Process • Asking a question (Query formulation) • Constructing an answer (retrieval algorithms) • Assessing the answer (feedback on relevance) Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Query language • Natural or artificial • Vocabulary • Syntax: operators, arguments • Query expansion, specialization, disambiguation, relevance feedback Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Constructing the answer • Information need accurately translated in the query? • How to provide answer in a form suitable to the user? • Provide background to the user so (s)he can verbalise the information need better? • How to represent the query as well as the corpus efficiently and effectively Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Constructing the answer (Contd) • Generate a set of index terms which render the documents in the collection as different as possible • Conflation algorithms • Removal of function/fluff/stop words (usually from closed class words) • Stripping suffixes (lemmatization) • Detection of equivalent/associated words Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Facets of documents: • Structure (dtd) • Format (css, xsl) • Content (xsd) • Unit of interest • Tagging of corpora • content tagging, grammatical tagging Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Step 1: Selection of corpora to build • Population from which documents to be included are selected (domain, genre,..) • Step 2: Selection of Tagging, if necessary • grammatical or other tagging schemes Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Step 3: Indexing • Index: doci {kwj} • Index-1: {kwj} doci • Extracting lexical features: • Step a: Selection of tokens, separators • Step b: Stemming decisions on number, gender (for some languages), hyphenation, phrases, idioms, morphological features,… • Step c: Removal of stop words using a list Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Use of Zipf’s Law in indexing Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Zipf’s Law Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Explanations of Zipf’s Law • Zipf: Principle of Least Effort • Mandelbrot: A more general version of Zipf law, and the similarity with cantor dust (fractals) Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Word occurrences as Poisson process and the detection of stop words Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Resolving power of words in discrimination between documents • relationship between word frequencies and word significance (non function words), I.e., words are more frequently used to signify their importance • To be index terms, words must help discriminate between documents Inf 722 Information Organisation (Fall 2007) (Gangolly)
FAO Process • Precision v. Recall Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Specificity v. Exhaustivity • An index is specific if it reflects the information needs of the users • An index is exhaustive if it reflects all topics covered by the documents • There is tension between the two Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • word frequency: the number of times that a word is used in a document • inverse document frequency: the number of documents in the corpus in which a word is used. • Robertson - Sparck-Jones weighting Inf 722 Information Organisation (Fall 2007) (Gangolly)
Vector Space Model Vector Space model: Inf 722 Information Organisation (Fall 2007) (Gangolly)