Information Retrieval

Information Retrieval September 19, 2005 Dragomir R. Radev University of Michigan radev@umich.edu

About the instructor • Dragomir R. Radev • Associate Professor, University of Michigan • School of Information • Department of Electrical Engineering and Computer Science • Department of Linguistics • Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan • Treasurer, North American Chapter of the ACL • Ph.D., 1998, Computer Science, Columbia University • Email: radev@umich.edu • Home page: http://tangra.si.umich.edu/~radev

Introduction

IR systems • Google • Vivísimo • AskJeeves • NSIR • Lemur • MG • Nutch

Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ). • Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language

Need for IR • Advent of WWW - more than 8 Billion documents indexed on Google • How much information? 200TB according to Lyman and Varian 2003. http://www.sims.berkeley.edu/research/projects/how-much-info/ • Search, routing, filtering • User’s information need

Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running

Mappings and abstractions Reality Data Information need Query From Korfhage’s book

Typical IR system • (Crawling) • Indexing • Retrieval • User interface

Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query

Documents

Documents • Not just printed paper • collections vs. documents • data structures: representations • Bag of words method • document surrogates: keywords, summaries • encoding: ASCII, Unicode, etc.

Document preprocessing • Formatting • Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc) • Casing (cat vs. CAT) • Stemming (computer, computation) • Soundex

Document representations • Term-document matrix (m x n) • term-term matrix (m x m x n) • document-document matrix (n x n) • Example: 3,000,000 documents (n) with 50,000 terms (m) • sparse matrices • Boolean vs. integer matrices

Document representations • Term-document matrix • Evaluating queries (e.g., (AB)C) • Storage issues • Inverted files • Storage issues • Evaluating queries • Advantages and disadvantages

IR models

Major IR models • Boolean • Vector • Probabilistic • Language modeling • Fuzzy retrieval • Latent semantic indexing

Major IR tasks • Ad-hoc • Filtering and routing • Question answering • Spoken document retrieval • Multimedia retrieval

Venn diagrams z x w y D1 D2

Boolean model B A

Boolean queries restaurants AND (Mideastern OR vegetarian) AND inexpensive • What types of documents are returned? • Stemming • thesaurus expansion • inclusive vs. exclusive OR • confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses • Use of negation: potential problems • Conjunctive and Disjunctive normal forms • Full CNF and DNF

Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses

Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information  retrieval” • Q2 = “information ¬computer”

Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

Vector models Term 1 Doc 1 Doc 2 Term 3 Doc 3 Term 2

Vector queries • Each document is represented as a vector • non-efficient representations (bit vectors) • dimensional compatibility

The matching process • Document space • Matching is done between a document and a query (or between two documents) • distance vs. similarity • Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Miscellaneous similarity measures • The Cosine measure  (di x qi) |X  Y|  (D,Q) = =  (di)2 *  (qi)2 |X| * |Y| • The Jaccard coefficient |X  Y|  (D,Q) = |X  Y|

Exercise • Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> • Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.

Evaluation

Relevance • Difficult to change: fuzzy, inconsistent • Methods: exhaustive, sampling, pooling, search-based

Contingency table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

Precision and Recall w Recall: w+x w Precision: w+y

Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.

[From Salton’s book]

Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?

Issues • Why not use accuracy A=(w+z)/N? • Average precision • Average P at given “document cutoff values” • Report when P=R • F measure: F=(b2+1)PR/(b2P+R) • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

Relevance collections • TREC ad hoc collections, 2-6 GB • TREC Web collections, 2-100GB

Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

<DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE>March 16, 1989, Thursday, Home Edition </DATE> <SECTION>Business; Part 4; Page 1; Column 5; Financial Desk </SECTION> <LENGTH>586 words </LENGTH> <HEADLINE>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </HEADLINE> <BYLINE>By LINDA WILLIAMS, Times Staff Writer </BYLINE> <TEXT> The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. ... </TEXT> <GRAPHIC> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </GRAPHIC> <SUBJECT> TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </SUBJECT> </DOC>

TREC (cont’d) • http://trec.nist.gov/tracks.html • http://trec.nist.gov/presentations/presentations.html

Word distribution models

Shakespeare • Romeo and Juliet: • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; • … • A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html

The BNC (Adam Kilgarriff) • 1 6187267 the det • 2 4239632 be v • 3 3093444 of prep • 4 2687863 and conj • 5 2186369 a det • 6 1924315 in prep • 7 1620850 to infinitive-marker • 8 1375636 have v • 9 1090186 it pron • 10 1039323 to prep • 11 887877 for prep • 12 884599 i pron • 13 760399 that conj • 14 695498 you pron • 15 681255 he pron • 16 680739 on prep • 17 675027 with prep • 18 559596 do v • 19 534162 at prep • 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155

Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

Information Retrieval