Review for midterm

Review for midterm

What is information retrieval • Gathering information from a source(s) based on a need • Major assumption - that information exists. • Broad definition of information • Sources of information • Other people • Archived information (libraries, maps, etc.) • Web • Radio, TV, etc.

Information retrieved • Impermanent information • Conversation • Documents • Text • Video • Files • Etc.

The information acquisition process • Know what you want and go get it • Ask questions to information sources as needed (queries) - SEARCH • Have information sent to you on a regular basis based on some predetermined information need • Push/pull models

What IR assumes • Information is stored (or available) • A user has an information need • An automated system exists from which information can be retrieved • Why an automated system? • The system works!!

What IR is usually not about • Usually just unstructured data • Retrieval from databases is usually not considered • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult for large data collections

What an IR system should do • Store/archive information • Provide access to that information • Answer queries with relevant information • Stay current • WISH list • Understand the user’s queries • Understand the user’s need • Acts as an assistant

How good is the IR system Measures of performance based on what the system returns: • Relevance • Coverage • Recency • Functionality (e.g. query syntax) • Speed • Availability • Usability • Time/ability to satisfy user requests

How do IR systems work Algorithms implemented in software • Gathering methods • Storage methods • Indexing • Retrieval • Interaction

Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Crawlers • Web crawlers (spiders) gather information (files, URLs, etc) from the web. • Primitive IR systems

Information Seeking Behavior • Two parts of the process: • search and retrieval • analysis and synthesis of search results

What is knowledge? • Data - Facts, observations, or perceptions. • Information - Subset of data, only including those data that possess context, relevance, and purpose. • Knowledge -A more simplistic view considers knowledge as being at the highest level in a hierarchy with data (at the lowest level) and information (at the middle level). • Data refers to bare facts void of context. • A telephone number. • Information is data in context. • A phone book. • Knowledge is information that facilitates action. • Recognizing that a phone number belongs to a good client, who needs to be called once per week to get his orders.

From Facts to Wisdom(Haeckel & Nolan, 1993)one example of the hierarchy

Size of information resources • Why important? • Scaling • Time • Space • Which is more important?

Trying to fill a terabyte in a year Moore’s Law and its impact!

Measuring the Growth of Work While it is possible to measure the work done by an algorithm for a given set of input, we need a way to: • Measure the rate of growth of an algorithm based upon the size of the input • Compare algorithms to determine which is better for the situation

Time vs. Space Very often, we can trade space for time: For example: maintain a collection of students’ with SSN information. • Use an array of a billion elements and have immediate access (better time) • Use an array of number of students and have to search (better space)

Introducing Big O Notation • Will allow us to evaluate algorithms. • Has precise mathematical definition • Used in a sense to put algorithms into families

Why Use Big-O Notation • Used when we only know the asymptotic upper bound. • What does asymptotic mean? • What does upper bound mean? • If you are not guaranteed certain input, then it is a valid upper bound that even the worst-case input will be below. • Why worst-case? • May often be determined by inspection of an algorithm.

Simplifying O( ) Answers We say Big O complexity of 3n2 + 2 = O(n2) drop constants! because we can show that there is a n0 and a c such that: 0  3n2 + 2  cn2 for n  n0 i.e. c = 4 and n0 = 2 yields: 0  3n2 + 2  4n2 for n  2 What does this mean?

Comparing Algorithms • Now that we know the formal definition of O( ) notation (and what it means)… • If we can determine the O( ) of algorithms… • This establishes the worst they perform. • Thus now we can compare them and see which has the “better” performance.

Comparing Factors N2 N Work done log N 1 Size of input

Why the interest in Queries? • Queries are ways we interact with IR systems • Nonquery methods? • Types of queries?

Issues with Query Structures Matching Criteria • Given a query, what document is retrieved? • In what order?

Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries • Natural Language Queries • Vector queries • Others?

Simple query language: Boolean • Earliest query model • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT

Simple query language: Boolean • Geek-speak • Variations are still used in search engines!

Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

Problems with Boolean Queries • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony

Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?

Order of Preference • Define order of preference • EX: a OR b AND c • Infix notation • Parenthesis evaluated 1st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s • a OR b AND c becomes • a OR (b AND c)

Infix Notation • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?

Boolean Query - Summary • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial systems until the WWW

Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

Queries Vocabulary (dog, house, white) Queries: • dog (1,0,0) • house (0,1,0) • white (0,0,1) • house and dog (1,1,0) • dog and house (1,1,0) • Show 3-D space plot

Documents (queries) in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

Vector Query Problems • Significance of queries • Can different values be placed on the different terms – eg. 2dog 1house • Scaling – size of vectors • Number of words in the dictionary? • 100,000

Proximity Searches • Proximity: terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”

Representation of documents and queries Why do this? • Want to compare documents • Want to compare documents with queries • Want to retrieve and rank documents with regards to a specific query A document representation permits this in a consistent way (type of conceptualization)

Measures of similarity • Retrieve the most similar documents to a query • Equate similarity to relevance • Most similar are the most relevant • This measure is one of “lexical similarity” • The matching of text or words

Document space • Documents are organized in some manner - exist as points in a document space • Documents treated as text, etc. • Match query with document • Query similar to document space • Query not similar to document space and becomes a characteristic function on the document space • Documents most similar are the ones we retrieve • Reduce this a computable measure of similarity

Representation of Documents • Consider now only text documents • Words are tokens (primitives) • Why not letters? • Stop words? • How do we represent words? • Even for video, audio, etc documents, we often use words as part of the representation

Documents as Vectors • Documents are represented as “bags of words” • Example? • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term i in a document or query j is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Review for midterm

Review for midterm

Presentation Transcript

L15: Review for Midterm

Review for Midterm 2

Review for Midterm Exam

Review for Midterm Exam

Review for Midterm 1

Review for Midterm 8th

Review for Midterm Exam

Review for Midterm 1

Review for Midterm 2

Review for Midterm

Review for Midterm 1

Review for Midterm 2

Review for Midterm

Review for Midterm

Algebra review for midterm

Review for Midterm 3

Review for Midterm 2

Review for Midterm

Review for Midterm 2

Review for Midterm

Review for Midterm 3

Review for Midterm #2