Boolean IR and Text Processing - Lecture Overview

Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ Lecture 4: Boolean IR and Text Processing SIMS 202: Information Organization and Retrieval

Advertisement • Not doing anything on Friday afternoon? • Please come to the Friday Afternoon Seminar – Open to ALL • This Week: • Clifford Lynch, director of the Coalition for Networked Information and Adjunct Professor of SIMS on “Research Questions in Digital Stewardship” • See • http://www.sims.berkeley.edu/academics/courses/is296a-1/f04/

Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

IR is an Iterative Process Repositories Goals Workspace

Berry-Picking Model A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q2 Q4 Q3 Q1 Q5 Q0

Restricted Form of the IR Problem • The system has available only pre-existing, “canned” text passages • Its response is limited to selecting from these passages and presenting them to the user • It must select, say, 10 or 20 passages out of millions or billions!

Information Retrieval • Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries • This set of assumptions underlies the field of Information Retrieval

Paradox • The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe • The need to describe that which you do not know in order to find it

Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

Central Concepts in IR • Documents • Queries • Collections • Evaluation • Relevance

Documents • What do we mean by a document? • Full document? • Document surrogates? • Pages? • Buckland (JASIS, Sept. 1997) “What is a Document” • Are IR systems better called Document Retrieval systems? • A document is a representation of some aggregation of information, treated as a unit

Collection • A collection is some physical or logical aggregation of documents • A database • A Library • An index? • Others?

Queries • A query is some expression of a user’s information needs • Can take many forms • Natural language description of need • Formal query in a query language • Queries may not be accurate expressions of the information need • Differences between conversation with a person and formal query expression

Evaluation: Why Evaluate? • Determine if the system is desirable • Make comparative assessments • Others?

What To Evaluate? • How much of the information need was satisfied • How much was learned about a topic • Incidental learning • How much was learned about the collection • How much was learned about other topics • How inviting the system is…

What To Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) • Coverage of information • Form of presentation • Effort required/ease of use • Time and space efficiency • Recall • Proportion of relevant material actually retrieved • Precision • Proportion of retrieved material actually relevant Effectiveness

Query Languages • A way to express the question (information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)

Simple Query Language: Boolean • Terms + Operators • Terms • Words • Normalized (stemmed) words • Phrases • Thesaurus terms • Boolean Operators • AND • OR • NOT

Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works: Recall the card based systems? They mechanically implement Boolean AND

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations works:

Boolean Logic A B

Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 D11 D4 D5 D3 m1 D6 m2 m4 D10 m7 m8 D8 D7 t3 m1= t1t2t3 m2= t1 t2t3 m3 = t1 t2t3 m4 = t1t2t3 m5 = t1t2t3 m6 = t1t2t3 m7 = t1t2t3 m8= t1t2t3

Boolean Searching Formal Query: CracksANDBeams ANDWidth_measurement ANDPrestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

Another View of IR Information Need Collections Pre-Process Text Input Query Index Parse Rank

Result Sets • Run a query, get a result set • Two choices • Reformulate query, run on entire collection • Reformulate query, run on result set • Example: Dialog query • (Redford AND Newman) • -> S1 1450 documents • (S1 AND Sundance) • ->S2 898 documents

Feedback Queries Information Need Collections Pre-Process Text Input Query Index Parse Reformulated Query Rank Re-Rank

Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • Order chronologically • Order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term? • Fancier methods have been investigated • p-norm is most famous • Usually impractical to implement • Usually hard for user to understand

Boolean • Advantages • Simple queries are easy to understand • Relatively easy to implement • Disadvantages • Difficult to specify what is wanted • Too much returned, or too little • Ordering not well determined • Dominant language in commercial IR systems until the WWW, and still the language of Database Management Systems

Faceted Boolean Query • Strategy: Break query into facets (polysemous with earlier meaning of facets) • Conjunction of disjunctions • a1 OR a2 OR a3 • b1 OR b2 • c1 OR c2 OR c3 OR c4 • Each facet expresses a topic • “rain forest” OR jungle OR amazon • medicine OR remedy OR cure • Smith OR Zhou AND AND Also known as Conjunctive Normal Form or CNF

Faceted Boolean Query • Query still fails if one facet missing • Alternative: Coordination level ranking • Order results in terms of how many facets (disjuncts) are satisfied • Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: Assign weights to facets

Proximity Searches • Proximity: Terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”

Filters • Filters: Reduce set of candidate docs • Often specified simultaneous with query • Usually restrictions on metadata • Restrict by: • Date range • Internet domain (.edu .com .berkeley.edu) • Author • Size • Limit number of documents returned

Boolean Systems • Most of the commercial database search systems that pre-date the WWW are based on Boolean search • Dialog, Lexis-Nexis, etc. • Most Online Library Catalogs are Boolean systems • E.g., MELVYL • Database systems use Boolean logic for searching • Many of the search engines sold for intranet search of web sites are Boolean

Why Boolean? • Easy to implement • Efficient searching across very large databases • Easy to explain results • “Has to have all of the words…” (AND) • “Has to have at least one of the words…” (OR)

Content Analysis • Automated Transformation of raw text into a form that represents some aspect(s) of its meaning • Including, but not limited to: • Automated Thesaurus Generation • Phrase Detection • Categorization • Clustering • Summarization

Techniques for Content Analysis • Statistical • Single Document • Full Collection • Linguistic • Syntactic • Semantic • Pragmatic • Knowledge-Based (Artificial Intelligence) • Hybrid (Combinations)

Text Processing • Standard Steps: • Recognize document structure • Titles, sections, paragraphs, etc. • Break into tokens • Usually space and punctuation delineated • Special issues with Asian languages • Stemming/morphological analysis • Store in inverted index (to be discussed later)

Content Analysis Areas Information Need Collections How is the query constructed? Pre-Process Text Input How is the text processed? Query Index Parse Rank

Document Processing Steps From “Modern IR” Textbook

Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • tengo, tienes, tiene, tenemos, tienen • Derivational Morphology • Derive one word from another, • Often change grammatical class • build, building; health, healthy

Automated Methods • Powerful multilingual tools exist for morphological analysis • PCKimmo, Xerox Lexical technology • Require a grammar and dictionary • Use “two-level” automata • Stemmers: • Very dumb rules work well (for English) • Porter Stemmer: Iteratively remove suffixes • Improvement: Pass results through a lexicon

Errors Generated by Porter Stemmer From Krovetz ‘93

Boolean IR and Text Processing - Lecture Overview