650 likes | 822 Views
WIRED Week 2. Syllabus Update (at least another week) Readings Overview Many ways to explain the same things Always think of the user Skip the math (mostly) Readings Review Most complicated of the entire semester Refer back to it often Readings Review - More Models Non-Linear
E N D
WIRED Week 2 • Syllabus Update (at least another week) • Readings Overview • Many ways to explain the same things • Always think of the user • Skip the math (mostly) • Readings Review • Most complicated of the entire semester • Refer back to it often • Readings Review - More Models • Non-Linear • Projects and/or Papers Overview
Opening Credits • Material for these slides obtained from: • Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ • Berthier Ribeiro-Neto • Introduction to Modern Information Retrieval by Gerard Salton and Michael J. McGill, McGraw-Hill, 1983. • Ray Mooney CSE 8335 Spring 2003 • Joydeep Ghosh
IR originally mostly for systems, not people IR in the last 25 years: classification and categorization systems and languages user interfaces and visualization A small world of concern The Web changed everything Huge amount of accessible information Varied information sources Relatively easy to look for information Improving IR means improving learning Digital technology changes everything (again) Why IR?
WIRED Focus • Information Retrieval: representation, storage, organization of, and access to information items • Focus is on the user information need • User information need: • Find all docs containing information on Austin which: • Are hosted by utexas.edu • Discuss restaurants • Emphasis is on the retrieval of information (not data, not just a keyword match)
So just what is Information then? • Oh no, not more about informaiton.... • “The difference that makes a difference” Gregory Bateson • Element in the communications process • Information Theory • Data • Something that informs a user • Not just Data • “Orange” • “1,741,405.339” • Helps users learn • Helps users make decisions • Facts (in context)
Differences Data retrieval Keywords match for documents Well-defined semantics A single erroneous object implies failure Information retrieval information about a subject or topic Loose semantics (language issues) Small errors are tolerated IR system: interpret contents of information items (documents) generate a ranking which reflects relevance for the query notion of relevance is most important
Let’s Talk about Finding • How many ways are there to find something? • Find a specific thing? • Find a concept? • Learn about a topic?
Ways of Finding are IR models • Each way of finding is one type of Information Retrieval • Different ways to search for person, place or thing • Real life information retrieval combines several of the methods • All at once • In succession
User Interaction with (IR) System Users can do 2 distinct tasks Which one is more important? Can you do both at once? Which is more difficult? Leverage the strengths of each Retrieval “Database” Browsing
Quick Overview of the IR Process Documents documents index match Ranking Information Need ? query
How do you get an index? A “bag of words” with these logical parts that are processed The Web has (some) structure in markup languages Follow from Left to Right to see the document get condensed Structure + the right Logical parts (+ things added) = Index structure Index terms Full text Accents spacing Noun groups Manual indexing stopwords Docs stemming structure
Our friend the Index • IR systems usually rely on an index or indices to process queries • Index term: • a keyword or group of selected words • “Golf” or “Golf Swing” • any word (not always actually in the document!) • “Waste of time” • Stemming might be used: • connect: connecting, connection, connections • An inverted file is built for the chosen index terms • A list of each keyword (or keyword group) and its location in the document(s)
Indexes, huh? Matching at index term level is not always the best way to find Users get confused and frustrated (without an understanding of the IR Model) Search features not easy Number of search terms is small Results not presented well Web searching actually not so easy either Junk pages Ranking games Duplicate information Bad page or site Information Architecture
More about indexes • Indexes make sense of the order (to the system) • Relevance is measured for the user’s query among the indices • Ranking of results is how the index is exposed to the user.
Ranking A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model
Relevance • Relevance is the correct document for your situation. • Relevance feedback is doing the search again with changes to search terms • By the user • By the system • Called Query Reformulation • Relevance is based on the user • A huge area for improvement in search interfaces
IR Models • Classic IR Models • Boolean Model • Vector Model • Probabilistic Model • Set Theoretic Models • Fuzzy Set Model • Extended Boolean Model • Generalized Vector Model • Latent Semantic Indexing • Neural Network Model
Types of IR Models Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic Boolean Vector Probablistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext U s e r T a s k Retrieval: Adhoc Filtering Browsing
Classic IR Models - Basics Each document represented by a set of representative keywords or index terms An index term is a word from the document that describes the document Classicaly, index terms are nouns because nouns have meaning by themselves Most human indexers use nouns & verbs Now, search engines assume that all words are index terms (full text representation) A great index has words added to it for additional meaning and context
Classic IR Models – More Basics Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Ki - an index term dj- a document F - the framework for document representations R – ranking function in relation to query & document wij - a weight associated with (ki,dj) The weight wijquantifies the importance of the index term for describing the document contents
Classic IR Models – Basic Variables t is the total number of index terms K = {k1, k2, …, kt} is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to doc dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)
Wait a minute! • Why are these variables so poorly labeled and organized. • Tradition • Security through Obscurity? • Difference from other Mathematical symbolism? • Sadly, not consistent even in the IR literature
Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = ka (kb kc) Terms are either present or absent. Thus, wij {0,1} Consider q = ka (kb kc) qdnf = (1,1,1) (1,1,0) (1,0,0) qcc= (1,1,0) is a conjunctive component
Boolean Model q = ka (kb kc) sim(q,dj) = 1 if vec(qcc) | (vec(qcc) vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise Ka Kb (1,1,0) (1,0,0) (1,1,1) Kc
Boolean Model Problems Retrieval based is binary: no partial matching No ranking of the documents is provided (absence of a grading scale) Users aren’t good at boolean queries too simplistic wrong As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
Vector Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching
Vector Model wij > 0 whenever ki appears in dj wiq >= 0 associated with the pair (ki,q) dj = (w1j, w2j, ..., wtj) q = (w1q, w2q, ..., wtq) To each term ki is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
Vector Model Sim(q,dj) = cos() = [dj q] / |dj| * |q| = [ wij * wiq] / |dj| * |q| Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 A document is retrieved even if it matches the query terms only partially j dj q i
Vector Model Sim(q,dj) = [ wij * wiq] / |dj| * |q| How to compute the weights wij and wiq? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency wij = tf(i,j) * idf(i)
Vector Model N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj The idf factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
Vector Model • The best term-weighting schemes take both into account. • wij = fi,j * log(N/ni) • This strategy is called a tf-idf weighting scheme • Text frequency – Inverse document frequency • How rare is the word in this document in comparison with other documents too?
Vector Model For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni) The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as any known ranking alternatives. It is also simple and fast to compute.
Weights wij and wiq ? One approach is to examine the frequency of the occurence of a word in a document: Absolute frequency: tf factor, the term frequency within a document freqi,j - raw frequency of ki within dj Both high-frequency and low-frequency terms may not actually be significant Relative frequency: tf divided by number of words in document Normalized frequency: fi,j = (freqi,j)/(maxl freql,j)
Inverse Document Frequency • Importance of term may depend more on how it can distinguish between documents. • Quantification of inter-documents separation • Dissimilarity not similarity • idf factor, the inverse document frequency
IDF • N = the total number of docs in the collection • ni = the number of docs which contain ki • The idf factor is computed as • idfi = log (N/ni) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki. • For example: • N=1000, n1=100, n2=500, n3=800 • idf1= 3 - 2 = 1 • idf2= 3 – 2.7 = 0.3 • idf3 = 3 – 2.9 = 0.1
Vector Model Considered Most Common Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Computationally Efficient Disadvantages: assumes independence of index terms - not clear that this is bad though Not best if applied exclusively
Probabilistic Model • Uses a probabilistic framework to get a result • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve probablity by iteration & additional data
Probabilistic Model Iterated An initial set of documents is retrieved User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document djinteresting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. How to compute probabilities? What is the sample space?
Probabilistic Ranking Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous judgement Definition: wij {0,1} P(R | dj) :probability that given doc is relevant P(R | dj) : probability doc is not relevant
Prob Ranking Pros & Cons Advantages: Docs ranked in decreasing order of probability of relevance Disadvantages: need to guess initial estimates for P(ki | R) method does not take into account tf and idf factors
Models Compared Boolean model does not provide for partial matches and is considered to be the weakest classic model Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community
Break! • Fuzzy Models • Extended Models • Generalized Models • Advanced Models
Set Theoretic Models • The Boolean model imposes a binary criterion for deciding relevance • How about something a little more complex? • Extend the Boolean model to use partial matching and a ranking is a goal • Two set theoretic models • Fuzzy Set Model • Extended Boolean Model
Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: with each term is associated a fuzzy set each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)
Fuzzy Set Theory Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic
Fuzzy Information Retrieval Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows: Let vec(c) be a term-term correlation matrix Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) ni: number of docs which contain ki nl: number of docs which contain kl n(i,l): number of docs which contain both ki and kl This allows for proximity among index terms.
Fuzzy Model Issues • The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows. • Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory • Experiments with standard test collections are not available • Difficult to compare at this time
Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra