1.09k likes | 1.1k Views
This review discusses the process of information retrieval, including the sources of information, the information acquisition process, and the assumptions and capabilities of an automated information retrieval system. It also explores the measures of performance for an IR system and the algorithms used in IR software. Additionally, the review covers the hierarchy of data, information, and knowledge, the growth of information resources, and the use of Big-O notation to evaluate algorithms. Lastly, it addresses the importance of queries and the matching criteria used in retrieving documents.
E N D
What is information retrieval • Gathering information from a source(s) based on a need • Major assumption - that information exists. • Broad definition of information • Sources of information • Other people • Archived information (libraries, maps, etc.) • Web • Radio, TV, etc.
Information retrieved • Impermanent information • Conversation • Documents • Text • Video • Files • Etc.
The information acquisition process • Know what you want and go get it • Ask questions to information sources as needed (queries) - SEARCH • Have information sent to you on a regular basis based on some predetermined information need • Push/pull models
What IR assumes • Information is stored (or available) • A user has an information need • An automated system exists from which information can be retrieved • Why an automated system? • The system works!!
What IR is usually not about • Usually just unstructured data • Retrieval from databases is usually not considered • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult for large data collections
What an IR system should do • Store/archive information • Provide access to that information • Answer queries with relevant information • Stay current • WISH list • Understand the user’s queries • Understand the user’s need • Acts as an assistant
How good is the IR system Measures of performance based on what the system returns: • Relevance • Coverage • Recency • Functionality (e.g. query syntax) • Speed • Availability • Usability • Time/ability to satisfy user requests
How do IR systems work Algorithms implemented in software • Gathering methods • Storage methods • Indexing • Retrieval • Interaction
Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine
Crawlers • Web crawlers (spiders) gather information (files, URLs, etc) from the web. • Primitive IR systems
Information Seeking Behavior • Two parts of the process: • search and retrieval • analysis and synthesis of search results
What is knowledge? • Data - Facts, observations, or perceptions. • Information - Subset of data, only including those data that possess context, relevance, and purpose. • Knowledge -A more simplistic view considers knowledge as being at the highest level in a hierarchy with data (at the lowest level) and information (at the middle level). • Data refers to bare facts void of context. • A telephone number. • Information is data in context. • A phone book. • Knowledge is information that facilitates action. • Recognizing that a phone number belongs to a good client, who needs to be called once per week to get his orders.
From Facts to Wisdom(Haeckel & Nolan, 1993)one example of the hierarchy
Size of information resources • Why important? • Scaling • Time • Space • Which is more important?
Trying to fill a terabyte in a year Moore’s Law and its impact!
Measuring the Growth of Work While it is possible to measure the work done by an algorithm for a given set of input, we need a way to: • Measure the rate of growth of an algorithm based upon the size of the input • Compare algorithms to determine which is better for the situation
Time vs. Space Very often, we can trade space for time: For example: maintain a collection of students’ with SSN information. • Use an array of a billion elements and have immediate access (better time) • Use an array of number of students and have to search (better space)
Introducing Big O Notation • Will allow us to evaluate algorithms. • Has precise mathematical definition • Used in a sense to put algorithms into families
Why Use Big-O Notation • Used when we only know the asymptotic upper bound. • What does asymptotic mean? • What does upper bound mean? • If you are not guaranteed certain input, then it is a valid upper bound that even the worst-case input will be below. • Why worst-case? • May often be determined by inspection of an algorithm.
Simplifying O( ) Answers We say Big O complexity of 3n2 + 2 = O(n2) drop constants! because we can show that there is a n0 and a c such that: 0 3n2 + 2 cn2 for n n0 i.e. c = 4 and n0 = 2 yields: 0 3n2 + 2 4n2 for n 2 What does this mean?
Comparing Algorithms • Now that we know the formal definition of O( ) notation (and what it means)… • If we can determine the O( ) of algorithms… • This establishes the worst they perform. • Thus now we can compare them and see which has the “better” performance.
Comparing Factors N2 N Work done log N 1 Size of input
Why the interest in Queries? • Queries are ways we interact with IR systems • Nonquery methods? • Types of queries?
Issues with Query Structures Matching Criteria • Given a query, what document is retrieved? • In what order?
Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries • Natural Language Queries • Vector queries • Others?
Simple query language: Boolean • Earliest query model • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT
Simple query language: Boolean • Geek-speak • Variations are still used in search engines!
Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0
Problems with Boolean Queries • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony
Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?
Order of Preference • Define order of preference • EX: a OR b AND c • Infix notation • Parenthesis evaluated 1st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s • a OR b AND c becomes • a OR (b AND c)
Infix Notation • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a
Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”
Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?
Boolean Query - Summary • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial systems until the WWW
Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents
Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse
Queries Vocabulary (dog, house, white) Queries: • dog (1,0,0) • house (0,1,0) • white (0,0,1) • house and dog (1,1,0) • dog and house (1,1,0) • Show 3-D space plot
Documents (queries) in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2
Vector Query Problems • Significance of queries • Can different values be placed on the different terms – eg. 2dog 1house • Scaling – size of vectors • Number of words in the dictionary? • 100,000
Proximity Searches • Proximity: terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”
Representation of documents and queries Why do this? • Want to compare documents • Want to compare documents with queries • Want to retrieve and rank documents with regards to a specific query A document representation permits this in a consistent way (type of conceptualization)
Measures of similarity • Retrieve the most similar documents to a query • Equate similarity to relevance • Most similar are the most relevant • This measure is one of “lexical similarity” • The matching of text or words
Document space • Documents are organized in some manner - exist as points in a document space • Documents treated as text, etc. • Match query with document • Query similar to document space • Query not similar to document space and becomes a characteristic function on the document space • Documents most similar are the ones we retrieve • Reduce this a computable measure of similarity
Representation of Documents • Consider now only text documents • Words are tokens (primitives) • Why not letters? • Stop words? • How do we represent words? • Even for video, audio, etc documents, we often use words as part of the representation
Documents as Vectors • Documents are represented as “bags of words” • Example? • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse
Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents
The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term i in a document or query j is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)