Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside River

Text Similarity Dr Eamonn Keogh Computer Science & Engineering DepartmentUniversity of California - RiversideRiverside,CA 92521eamonn@cs.ucr.edu

6 4 3 5 2 1

Information Retrieval • Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. • This assumption underlies the field of Information Retrieval.

Index Query Parse Rank Evaluate Pre-process Information need Collections How is the query constructed? text input How is the text processed?

Terminology Token: A natural language word “Swim”, “Simpson”, “92513” etc Document: Usually a web page, but more generally any file.

Some IR History • Roots in the scientific “Information Explosion” following WWII • Interest in computer-based IR from mid 1950’s • H.P. Luhn at IBM (1958) • Probabilistic models at Rand (Maron & Kuhns) (1960) • Boolean system development at Lockheed (‘60s) • Vector Space Model (Salton at Cornell 1965) • Statistical Weighting methods and theoretical advances (‘70s) • Refinements and Advances in application (‘80s) • User Interfaces, Large-scale testing and application (‘90s)

Relevance • In what ways can a document be relevant to a query? • Answer precise question precisely. • Who is Homer’s Boss? Montgomery Burns. • Partially answer question. • Where does Homer work? Power Plant. • Suggest a source for more information. • What is Bart’s middle name? Look in Issue 234 of Fanzine • Give background information. • Remind the user of other knowledge. • Others ...

Index Query Parse Rank Evaluate Pre-process Information need Collections How is the query constructed? text input How is the text processed? The section that follows is about Content Analysis (transforming raw text into a computationally more manageable form)

Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • Bike, Biking • Swim, Swimmer, Swimming • What about… build, building;

Examples of Stemming (using Porters algorithm) • Original Words • … • consignconsignedconsigningconsignmentconsistconsistedconsistencyconsistentconsistentlyconsistingconsists… • Stemmed Words • … • consignconsignconsignconsignconsistconsistconsistconsistconsistconsistconsist Porters algorithms is available in Java, C, Lisp, Perl, Python etc from http://www.tartarus.org/~martin/PorterStemmer/

Errors Generated by Porter Stemmer (Krovetz 93) Homework!! Play with the following URL http://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html

Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution

Government documents, 157734 tokens, 32259 unique 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by

Plotting Word Frequency by Rank • Main idea: count • How many times tokens occur in the text • Over all texts in the collection • Now rank these according to how often they occur. This is called the rank.

The Corresponding Zipf Curve Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

Zipf Distribution • The Important Points: • a few elements occur veryfrequently • a medium number of elements have medium frequency • manyelements occur very infrequently

Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant • Rank = order of words’ frequency of occurrence • Another way to state this is with an approximately correct rule of thumb: • Say the most common term occurs C times • The second most common occurs C/2 times • The third most common occurs C/3 times • …

Zipf Distribution(linear and log scale) Illustration by Jacob Nielsen

What Kinds of Data Exhibit a Zipf Distribution? • Words in a text collection • Virtually any language usage • Library book checkout patterns • Incoming Web Page Requests • Outgoing Web Page Requests • Document Size on Web • City Sizes • …

Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. • Called “stop words” in IR • English examples: to, from, on, and, the, ... • There are always a large number of tokens that occur once and can mess up algorithms. • Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

Lexical Associations • Subjects write first word that comes to mind • doctor/nurse; black/white (Palermo & Jenkins 64) • Text Corpora yield similar associations • One measure: Mutual Information (Church and Hanks 89) • If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

Statistical Independence • Compute for a window of words a b c d e f g h i jk l m n o p w1 w11 w21

Interesting Associations with “Doctor”(AP Corpus, N=15 million, Church & Hanks 89)

Un-Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun.

Associations Are Important Because… • We may be able to discover that phrases that should be treated as a word. I.e. “data mining”. • We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”

Content Analysis Summary • Content Analysis: transforming raw text into more computationally useful forms • Words in text collections exhibit interesting statistical properties • Word frequencies have a Zipf distribution • Word co-occurrences exhibit dependencies • Text documents are transformed to vectors • Pre-processing includes tokenization, stemming, collocations/phrases

Index Query Parse Rank Evaluate Pre-process Information need Collections text input How is the index constructed? The section that follows is about Index Construction

Inverted Index • This is the primary data structure for text indexes • Main Idea: • Invert documents into a big index • Basic steps: • Make a “dictionary” of all the tokens in the collection • For each token, list all the docs it occurs in. • Do a few things to reduce redundancy in the data structure

Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically.

How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled.

How Inverted Files are Created • Then the file can be split into • A Dictionary file and • A Postingsfile

How Inverted Files are Created Dictionary Postings

Inverted Indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms

How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Index Query Parse Rank Evaluate Pre-process Information need Collections text input How is the index constructed? The section that follows is about Querying (and ranking)

Simple query language: Boolean • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • connectors • AND • OR • NOT • NEAR (Pseudo Boolean) • Word Doc • Cat x • Dog • Collar x • Leash

Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

Boolean Searching Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term?

Boolean Model • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial Information Retrieval systems until the WWW Since the Boolean model is limited, lets consider a generalization…

Vector Model • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse • Smithers secretly loves Monty Burns • Monty Burns secretly loves Smithers • Both map to… • [ Burns, loves, Monty, secretly, Smithers]

Document VectorsOne location for each word Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet

Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside River