Introduction to Information Retrieval

Introduction to Information Retrieval James Allan Center for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts, Amherst

Goals of this talk • Understand the IR problem • Understand IR vs. databases • Understand basic idea behind IR solutions • How does it work? • Why does it work? • Why don’t IR systems work perfectly? • Understand that you shouldn’t be surprised • Understand how research systems are evaluated

Overview • What is Information Retrieval • Some history • Why IR  databases • How IR works • Evaluation

What is Information Retrieval? • Process of finding documents (text, mostly) that help someone satisfy an information need (query) • Includes related organizational tasks: • Classification - assign documents to known classes • Routing - direct documents to proper person • Filtering - select documents for a long- standing request • Clustering - unsupervised grouping of related documents

In case that’s not obvious… Query Ranked results

Sample Systems • IR systems • Verity, Fulcrum, Excalibur, Oracle • InQuery, Smart, Okapi • Web search and In-house systems • West, LEXIS/NEXIS, Dialog • Lycos, AltaVista, Excite, Yahoo, HotBot, Google • Database systems • Oracle, Access, Informix, mysql, mdbms

History of Information Retrieval • Foundations • Library of Alexandria (3rd century BC, 500K volumes) • First concordance of the bible (13th century AD) • Printing press (15th century) • Johnson’s dictionary (1755) • Dewey Decimal classification (1876) • Early automation • Luhn’s statistical retrieval/abstracting (1959), Salton (60s) • MEDLINE (1964), Dialogue (1967) • Recent developments • Relevance ranking available (late 80’s) • Large-scale probabilistic system (West, 1992) • Multimedia, Internet, Digital Libraries (late 90’s)

Goals of IR • Basic goal and original motivation • Find documents that help answer query • IR is not “question answering” • Technology is broadly applicable to related areas • Linking related documents • Summarizing documents or sets of documents • Entire collections • Information filtering • Multi- and cross-lingual • Multimedia (images and speech)

Issues of IR • Text (and other media) representation • What is “good” representation and how to generate? • Queries • Appropriate query language? how to formulate? • How to translate user’s need into query language? • Comparison of representations • What is “good” model of retrieval • How is uncertainty recognized? • Evaluation of methods • What is a good measure and a good testbed?

IR vs. Databases • Databases • Structured data (relations) • Fields with reasonably clear semantics • i.e., attributes • (age, SNN, name) • Strict query languages (relational algebra, SQL) • Information Retrieval • Unstructured data (generally text) • No semantics on “fields” • Free text (“natural language”) queries • Structured queries (e.g., Boolean) possible

IR vs. Database Systems (more) • IR has emphasis on effective, efficient retrieval of unstructured data • IR systems typically have very simple schemas • IR query languages emphasize free text although Boolean combinations of words also common • Matching is more complex than with structured data (semantics less obvious) • Easy to retrieve the wrong objects • Need to measure accuracy of retrieval • Less focus on concurrency control and recovery, although update is very important

Basic Approach • Most successful approaches are statistical • Direct, or effort to capture probabilities • Why not natural language understanding? • State of the art is brittle in unrestricted domains • Can be highly successful in predictable settings • e.g., information extraction on terrorism or takeovers (MUC) • Could use manually assigned headings • Human agreement is not good • Expensive • “Bag of words”

What is this about? 6 parrot Santos 4 Alvarez 3 Escol investigate police suspect 2 asked bird burglar buy case Fernando headquarters intruder planned scream steal 1  … accompanied admitted after alarm all approaches asleep birdseed broke called charges city daily decided drop during early exactly forgiveness friday green help house identified kept living manila master mistake mum Nanding national neighbors outburst outside paid painstaking panicked pasay peso pet philippine pikoy press quoted reward room rushed saying scaring sell speckled squawks star stranger surrendered taught thursday training tried turned unemployed upstairs weaverbird woke 22 1800 $44

The original text Fernando Santos' painstaking training of his pet parrot paid off when a burglar broke into his living room. Doing exactly what it had been taught to do -- scream when a stranger approaches -- Pikoy the parrot screamed: “Intruder! Intruder!” The squawks woke up his master who was asleep upstairs early Thursday, while scaring off the suspect, investigator Nanding Escol said Friday. The suspect, identified as Fernando Alvarez, 22, panicked because of the parrot's outburst and soon surrendered to Santos and his neighbors who rushed to the house to help. Alvarez, who is unemployed, admitted to police that he tried to steal the bird and sell it, Escol said. During investigation at Pasay City Police Headquarters, just outside Manila, the suspect pleaded that Santos drop the case because he did not steal the parrot after all. Alvarez turned to the speckled green bird and asked for its forgiveness as well. But Alvarez called it a weaverbird by mistake, and Santos asked investigators to press charges. Santos was quoted by a national daily Philippine Star as saying that he had planned to buy a burglar alarm but has now decided to buy birdseed for his 1,800-peso ($44) parrot as a reward. The parrot, which accompanied Santos to police headquarters, kept mum on the case, Escol said. http://cnn.com/ASIANOW/southeast/9909/24/fringe/screaming.parrot.ap/index.html

Components of Approach (1 of 4) • Reduce every document to features • Words, phrases, names, … • Links, structure, metadata, … • Example: • Pikoy the parrot screamed: “Intruder! Intruder!” The squawks woke up his master... investigator Nanding Escol said Friday • pikoy parrot scream intrude intrude squawk wake master investigate nand escol said friday • “pikoy the parrot”, “nanding escol” • DATE = 1999-09-24, SOURCE = CNN_Headline_News

Components of Approach (2 of 4) • Assign weights to selected features • Most systems combine tf, idf, and doc’s length • Example: • Frequency within a document (tf) • intrude occurs twice, so more important • Document’s length • two intrude’s in passage important • two intrude’s over 10 pages less important • Frequency across documents (idf) • if every document contains intrude, has little value • may be important part of a document’s meaning • but does nothing to differentiate documents

Components of Approach (3 of 4) • Reduce query to set of weighted features • Parallel of document reduction methods • Example: • I am interested in stories about parrots and the police • interested stories parrots police • parrots police • Optional: expand query to capture synomyms • parrot  bird+jungle+squawk+... • Problems: parrot  mimic, repeat

Components of Approach (4 of 4) • Compare query to documents • Fundamentally: • Looking for word (feature) overlap • More features in common between query and doc more likely doc is relevant to query • However: • Highly weighted features more important • Might impose some feature presence criteria • e.g., at least two features must be present

police parrot Vector Space Model

Inference Networks (InQuery) • Query represents a combination of evidence user believes will capture relevance • Assign each query feature a “belief” • Similar in motivation to a probability • Like your beliefs in an outcome given evidence • In fact, VS and IN use same weighting/belief • Lots of ways to combine query beliefs • #sum(a b), #parsum200(a b) • #and(a b), #band(a b), #not(a), #or(a b) • #wsum( 8 a 3 b )

InQuery Beliefs • Belief that document is about term I • Combine tf, idf, and length normalization

Efficient Implementations • How to handle comparisons efficiently • Inverted lists for access to large collections • Several gigabytes of text now common • Millions of documents • TREC’s VLC: 20Gb in 1997, 100Gb in 1998 • 20.1Gb is about 7.5 million documents • The Web... • Indexing must be fast also • Hundreds of megabytes to a gigabyte per hour

Indexes: Inverted Lists • Inverted lists are most common indexing technique • Source file: collection, organized by document • Inverted file: collection organized by term • one record per term, listing locations where term occurs

Variations on a Theme • Wide range of feature selection and weighting • Different models of “similarity” • Exact match • Title, author, etc. • Boolean • Greater sense of “control” but less effective • Probabilistic • P(query|document) and P(document|query) • More rigorous, better for research • Discoveries here can be co-opted by other approaches! • Topological • Deform space of documents based on user feedback

Summary of approach • Reduce documents to features • Weight features • Reduce query to set of weighted features • Compare query to all documents • Select closest ones first • Vector space model is most common • Usually augmented with Boolean constructs • Most other models are research-only systems

Evaluation of IR • Need to measure quality of IR systems • What does “50% accuracy” mean • Typically measured with test collections • Set of known documents • Set of known queries • Relevance judgments for those queries • Run systems A and B or systems A and Â • Measure difference in returned results • If sufficiently large, can rank systems • Usually requires many queries (at least 25 at a time) and many collections to believe it is predictive • TREC (NIST) is best known IR evaluation workshop

Precision and recall • Precision - proportion of retrieved set that is relevant • Precision = |relevant Ç retrieved|/|retrieved| = P(relevant|retrieved) • Recall - proportion of all relevant documents in the collection included in the retrieved set • Recall = |relevant Ç retrieved|/|relevant| = P(retrieved|relevant| • Precision and recall are well-defined for sets • For ranked retrieval • Compute a P/R point for each relevant document, interpolate • Compute at fixed recall points (e.g., precision at 20% recall) • Compute at fixed rank cutoffs (e.g., precision at rank 20)

Precision and recall example = the relevant documents Ranking #1 Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0 Precis. 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5 Avg Prec = ( 1.0 + 0.67 + 0.5 + 0.44 + 0.5 ) / 5 = 0.62 Ranking #2 Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0 Precis. 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5 Avg Prec = ( 0.5 + 0.4 + 0.5 + 0.57 + 0.63 ) / 5 = 0.52

Precision and recall, second example = the relevant documents (as before) Ranking #1 Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0 Precis. 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5 = different query’s relevant documents Ranking #3 Recall 0.0 0.33 0.33 0.33 0.67 0.67 1.0 1.0 1.0 1.0 Precis. 0.0 0.5 0.33 0.25 0.4 0.33 0.43 0.38 0.33 0.3

Interpolation and averaging • Hard to compare individual P/R graphs or tables • Two main types of averaging • microaverage - each relevant document is a point in the average • macroaverage - each query is a point in the average • Average precision at standard recall points • For given query, compute P/R point for every relevant doc • Interpolate precision at standard recall levels • Average over all queries to get average precision at each recall level • Average over all recall levels to get a single result • overall average is not very useful itself • still commonly used: strong correlation with other measures

Recall-precision table

Recall-precision graph

Contingency table

Improvements in IR over the years (data thanks to Chris Buckley, Sabir) SMART system version TREC-1 TREC-2 TREC-3 TREC-4 TREC-5 TREC-6 TREC-7 TREC-1 0.2442 0.3056 0.3400 0.3628 0.3759 0.3709 0.3778 25.1 39.2 48.6 53.9 51.9 54.7 TREC-2 0.2615 0.3344 0.3512 0.3718 0.3832 0.3780 0.3839 27.9 34.3 42.2 46.6 44.6 46.8 TREC-3 0.2099 0.2828 0.3219 0.3812 0.3992 0.4011 0.4003 34.8 53.4 81.6 90.2 91.1 90.7 TREC-4 0.1533 0.1728 0.2131 0.2819 0.3107 0.3044 0.3142 12.8 39.0 83.9 102.7 98.6 105.0 TREC-5 0.1048 0.1111 0.1287 0.1842 0.2046 0.2028 0.2116 6.0 22.9 75.8 95.3 93.6 102.0 TREC-6 0.0997 0.1125 0.1242 0.1807 0.1844 0.1768 0.1804 12.8 24.6 81.3 85.0 77.3 80.9 TREC-7 0.1137 0.1258 0.1679 0.2262 0.2547 0.2510 0.2543 10.6 47.7 99.0 124.0 120.8 123.7 Ad-hoc Task

Summary of talk • What is Information Retrieval • Some history • Why IR  databases • How IR works • Evaluation

Overview • What is Information Retrieval • Some history • Why IR  databases • How IR works • Evaluation • Collaboration • Other research • Interactive • Event detection • Cross-language IR • Timelines • Hierarchies

Introduction to Information Retrieval