Information Retrieval

Information Retrieval CSE 8337 (Part I) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze http://informationretrieval.org

CSE 8337 Outline • Introduction • Text Processing • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.

Motivation • IR: representation, storage, organization of, and access to information items • Focus is on the user information need • User information need (example): • Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. • Emphasis is on the retrieval of information (not data)

DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles

Unstructured data • Typically refers to free text • Allows • Keyword queries including operators • More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents

Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as • Title contains data AND Bullets contain search … to say nothing of linguistic structure

DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important

Motivation • IR software issues: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!

Basic Concepts • The User Task • Retrieval • information or data • purposeful • Browsing • glancing around • Feedback Response Retrieval Database Browsing Feedback

Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents

Text User Interface user need Text Text Operations logical view logical view Query Operations DB Manager Module Indexing user feedback inverted file query Searching Index retrieved docs Text Database / WWW Ranking ranked docs The Retrieval Process

Basic assumptions of Information Retrieval • Collection: Fixed set of documents • Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task

Fuzzy Sets and Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: • T = {x | x is a person and x is tall} • Let f(x) be the probability that x is tall • Here f is the membership function

Fuzzy Sets

IR is Fuzzy Relevant Relevant Not Relevant Not Relevant Simple Fuzzy

IR Query Result Measures IR

CSE 8337 Outline • Introduction • Text Processing (Background) • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

Text Processing TOC • Simple Text Storage • String Matching • Approximate (Fuzzy) Matching (Spell Checker) • Parsing • Tokenization • Stemming/ngrams • Stop words • Synonyms

Text storage • EBCDIC/ASCII • Array of character • Linked list of character • Trees- B Tree, Trie • Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

Pattern Matching(Recognition) • Pattern Matching: finds occurrences of a predefined pattern in the data. • Applications include speech recognition, information retrieval, time series analysis.

Similarity Measures • Determine similarity between two objects. • Similarity characteristics: • Alternatively, distance measures measure how unlike or dissimilar objects are.

String Matching Problem • Input: • Pattern – length m • Text string – length n • Find one (next, all) occurrences of string in pattern • Ex: • String: 00110011011110010100100111 • Pattern: 011010

String Matching Algorithms • Brute Force • Knuth-Morris Pratt • Boyer Moore

011010 011010 011010 Brute Force String Matching • Brute Force • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html • Space O(m+n) • Time O(mn) 00110011011110010100100111

FSR

Creating FSR • Create FSM: • Construct the “correct” spine. • Add a default “failure bus” to state 0. • Add a default “initial bus” to state 1. • For each state, decide its attachments to failure bus, initial bus, or other failure links.

Knuth-Morris-Pratt • Apply FSM to string by processing characters one at a time. • Accepting state is reached when pattern is found. • Space O(m+n) • Time O(m+n) • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

Boyer-Moore • Scan pattern from right to left • Skip many positions on illegal character string. • O(mn) • Expected time better than KMP • Expected behavior better • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

Approximate String Matching • Find patterns “close to” the string • Fuzzy matching • Applications: • Spelling checkers • IR • Define similarity (distance) between string and pattern

String-to-String Correction • Levenshtein Distance • http://www.mendeley.com/research/binary-codes-capable-of-correcting-insertions-and-reversals/ • Measure of similarity between strings • Can be used to determine how to convert from one string to another • Cost to convert one to the other • Transformations • Match: Current characters in both strings are the same • Delete: Delete current character in input string • Insert: Insert current character in target string into string

Distance Between Strings

Spell Checkers • Check or Replace or Expand or Suggest • Phonetic • Use phonetic spelling for word • Truespelwww.foreignword.com/cgi-bin//transpel.cgi • Phoneme – smallest sounds • Jaro Winkler • distance measure • http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance • Autocomplete • www.amazon.com

Tokenization • Find individual words (tokens) in text string. • Look for spaces, commas, etc. • http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

Stemming/ngrams • Convert token/word into smallest word with similar derivations • Remove suffixes (s, ed, ing, …) • Remove prefixes (pre, re, un, …) • ngram –subsequences of length n

Stopwords • Common words • “Bad” words • Implementation: • Text file

Synonyms • Exact/similar meaning • Hierarchy • One way • Bidirectional • Expand Query • Replace terms • Implementation: • Synonym File or dictionary

CSE 8337 Outline • Introduction • Text Processing • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

Index • Common access is by keyword • Fast access by keyword • Index organizations? • Hash • B-tree • Linked List • Process document and query to identify keywords

Term-document incidence 1 if play contains word, 0 otherwise BrutusANDCaesar but NOTCalpurnia

Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100. • http://www.rhymezone.com/shakespeare/

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

Brutus Calpurnia Caesar Dictionary Postings lists Inverted index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers Posting 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Sort by terms. Core indexing step.

Multiple term entries in a single document are merged. • Frequency information is added. Why frequency? Will discuss later.

The result is split into a Dictionary file and a Postings file.

Where do we pay in storage? Terms Pointers

Information Retrieval