3.23k likes | 3.46k Views
Information Retrieval and Recommendation Techniques. 國立中山大學資管系 黃三益. Abstraction. Reality (real world) can not known in its entirety Reality is represented by a collection of data abstracted from observation of the real world. Information need drives the storage and retrieval of information.
E N D
Information Retrieval and Recommendation Techniques 國立中山大學資管系 黃三益
Abstraction • Reality (real world) can not known in its entirety • Reality is represented by a collection of data abstracted from observation of the real world. • Information need drives the storage and retrieval of information. • Relationships among reality, information need, data and query (see Figure 1.1).
Information Systems • Two portions: endosystem and ectosystem. • Ectosystem has three human components: • User • Funder • Server: information professional who operates the system and provide service to the user. • Endosystem has four components: • Media • Devices • Algorithms • Data structures
Measures • The performance is dictated by the endosystem but judged by the ecosystem. • User is mainly concerned about effectiveness. • Server is more aware of the efficiency. • Founder is more concerned about economy of the system. • This course concentrates primarily on effectiveness measures. • The so called user-satisfaction has many meanings and different users may use different criteria. • A fixed set of criteria must be established for fair comparison.
From Signal to Wisdom • Five stepstones • Signal: bit stream, wave, etc. • Data: impersonal, available to any users • Information: a set of data matched to a particular information need. • Knowledge: coherence of data, concepts, and rules. • Wisdom: a balanced judgment in the light of certain value criteria.
What is a document? • A paper or a book? A section or a chapter? • There is no strict definition on the scope and format of a document. • The document concept can be extended to include programs, files, email messages, images, voices, and videos. • However, most commercial IR systems handle multimedia documents through their textual representations. • The focus of this course is on text retrieval.
Data Structures of Documents • Fully formatted documents: typically, these are entities stored in DBMSs. • Fully unformatted documents: typically, these are data collected via sensors, e.g., medical monitering, sound and image data, and a text editor. • Most textual documents, however, is semi-structured, including title, author, source, abstract, and other structural information.
Document Surrogates • A document surrogate is a limited representation of a full document. It is the main focus of storing and querying for many IR system. • How to generate and evaluate document surrogates in response to users’ information need is an important topic.
Ingredients of document surrogates • Document identifier: could be less meaningless such as record id, or a more elaborate identifier such as Library of Congress classification scheme for books (e.g., T210 C37 1982). • Title • Names: author, corporate, publisher • Dates: for timeliness and appropriateness • Unit descriptor: Introduction, Conclusion, Bibliography.
Ingredients of document surrogates • Keywords • Abstract: a brief one- or two-paragraph description of the contents of a paper. • Extracts: similar to abstract but created by someone other than the authors. • Review: similar to extract but meant to be critical. The review itself is a separate document that worth retrieving.
Vocabulary Control • It specifies a finite set of vocabularies to be used for specifying keywords. • Advantages: • Uniformity throughout the retrieval system • More efficient • Disadvantages: • Authors/users cannot give/retrieve a more detailed information. • Most IR system nowadays opt to an uncontrolled vocabulary and rely on a sound internal thesaurus for bring together related terms.
Encoding Standards • ASCII: a standard for English text encoding. However, it does not cover characters of different fonts, macthematical symbols, etc. • Big-5: traditional chinese character set with 2 bytes. • GB: simplified chinese charater set with XX bytes. • CCCII: a full traditional chinese character set with at most 6 bytes. • Unicode: a unified encoding trying to cover characters from multiple nations.
Markup languages • Initially used by word processor (.doc, .tex) and printer (.ps, .pdf) • Recently used for representing a document with hypertext information (HTML, SGML) WWW. • A document written in markup language can be segmented into several portions that better represent that document for searching.
Query Structures • Two types of matches • Exact match (equality match and range match) • Approximate match
Boolean Queries • Based on Boolean algebra • Common connectives: AND, OR, NOT • E.g., A AND (B OR C) AND D • Each term could be expanded by stemming or a list of related terms from a thesaurus. • E.g., inf -> information, vegetarian->mideastern countries • A xor B (A AND NOT B) OR (NOT A AND B) • By far the most popular retrieval approach.
Boolean Queries (Cont’d) • Additional operators • Proximity (e.g., icing within 3 words of chocolate) • K out of N terms (e.g., 3 OF (A, B, C) • Problems: • No good way to weigh terms • E.g., music by Beethoven, preferably sonata. (Beethoven AND sonata) OR (Beethoven) • Easy to misuse (e.g., People who like to have dinner with sports or symphony may specify “dinner AND sports AND symphony”).
Boolean Queries (Cont’d) • Order of preference may not be natural to users (e.g., A OR B AND C). People tend to interpret requests depending on the semantics. • E.g., coffee AND croissant OR muffin • Raincoat AND umbrella OR sunglass • User may construct a highly complex query. • There are techniques on simplifying a given query into disjunctive normal form (DNF) or conjunctive normal form (CNF) • It has been shown that every Boolean expression can be converted to an equivalent DNF or CNF.
Boolean Queries (Cont’d) • DNF: a disjunction of several conjuncts, each of which includes two terms connected by AND. • E.g., (A AND B) OR (A AND NOT C) • (A AND B AND C) OR (A AND B AND NOT C) is equivalent to (A AND B). • CNF: a conjunction of several disjuncts, each of which includes two terms connected by OR. • Normalization to DNF can be done by looking at the TRUE rows, while that to CNF can be done by looking at the FALSE rows.
Boolean Queries (Cont’d) • The size of returned set could be explosively large. Sol: return only a limited number of records. • Though there are many problems with Boolean queries, they are still popular because people tend to use only two or three terms at a time.
Vector Queries • Each document is represented as a vector, or a list of terms. • The similarity between a document and a query is based on the presence of terms in both the query and the document. • The simplest model is 0-1 vector. A more general model is weighted vector. • Assigning weights to a document or a query is a complex process. • It is reasonable to assume that more frequent terms are more important.
Vector Queries (Cont’d) • It is better to give a user the freedom to assign weights. In this case, a conversion between user weight and system weight must be done. [Show the conversion equ.] • There are two types of vector queries (for similarity search) • top-N queries • Threshold-based queries
Extended Boolean Queries • This approach incorporates weights into Boolean queries. A general form is Aw1 * Bw2 (e.g., A0.2 AND B0.6). • A OR B0.2 retrieves all documents that contain A and those documents in B that are within top 20% closest to the documents in A. • A OR B1A OR B • A OR B0A • See Figure 3.1 for a diagrammatic illustration.
Extended Boolean Queries (Cont’d) • A AND B0.2 • A AND B0A • A AND B1A AND B • See Figure 3.2 for graphical illustration. • A AND NOT B0.2 • A AND NOT B0A • A AND NOT B1A AND NOT B • See Figure 3.3 for graphical illustration. • A0.2 OR B0.6 returns 20% of the documents in A-B that are closest to B and 60% of the documents in B-A that are closest to A.
Extended Boolean Queries (Cont’d) • See Example 3.1. • One needs to define the distance between a document and a set of document (contains A). • The computation of an extended Boolean query could be time-consuming. • This model have not become popular.
Fuzzy Queries • It is based on fuzzy set. • In a fuzzy set S, each element in S is associated with a membership grade. • Formally, S={<x, s(x)>|} s>0}. • AB = {x:xA and x B, (x)=min (A(x), B(x)). • AB = {x:xA or B, (x)=max(A(x), B(x)). • NOT A = {x:xA, (x)=1- A(x)}.
Fuzzy Queries (Cont’d) • To use fuzzy queries, documents must be fuzzy too. • The documents are returned to the users in decreasing order of their fuzzy values associated with the fuzzy query.
Probabilistic Queries • Similar to fuzzy queries but now the membership function is probabilities. • The probability of a document in association with a query (or term) can be calculated through some probability theory (e.g., Bayes Theorem) after some observation.
Natural Language Queries • Convenient • Imprecise, inaccurate, and frequently ungrammatical. • The difficulties lie in obtaining an accurate interpretation of a longer text, which may rely on common sense. • The successful system must restrict to a narrowly defined domain (e.g., medicine v.s. diagnosis of illness).
Information Retrieval and Database Systems • Should one use a database system to handle information retrieval requests? • DBMS is a mature and successful technolgy in handling precise queries. • It is not appropriate to handle imprecise textual elements. • OODB provide the augment functions to the textual or image elements and is considered a good candidate.
Boolean based matching • It divides the document space into two: those satisfying the query and those that do not. • Finer grading of the set of retrieved documents can be defined on the number of terms satisfied (e.g., A OR B OR C).
Vector-based matching • Measures • Based on the idea of distance • Minkowski metric (Lq) Lq=(|Xi1-Xj1|q+|Xi2-Xj2|q+|Xi3-Xj3|q+…+|Xip-Xjp|q)1/q • Special cases: Manhattan distance (q=1), Euclidean distance (q=2), and maximum direction distance (q=). • See example in p.133. • Based on the idea of angle • Cosine function ((QD)/(|Q||D|).
Mapping distance to similarity • It is better to map distance (or dissimilarity) into some range, e.g. [0, 1]. • A simple inversion function is =b-u. • A more general inversion function is =b-p(u), where p(u) is a monotone nondecreasing func s.t. p(0)=0. • See Fig. 4.1 for graphical illustration.
Distance or cosine? • <1, 3> , <100, 300>, <3, 1>? Which pair is similar? • In practice, distance and angular measures seem to give results of similar quality because the cluster of documents all roughly lie in the same direction.
Missing terms and term relationships • The conventional value 0 means • Truly missing • No information • However, if 0 is regarded as undefined. It becomes impossible to measure the distance between two documents (e.g., <3, -> and <-, 4>. • Terms used to define the vector model are clearly not independent, e.g., “digital” and “computer” have a strong relationship. • However, the effect of dependent terms is hardly known.
Probability matching • For a given query, we can define the probability that a document is related as P(rel)=n/N. • The discriminant function on the selected set is dis(selected)=P(rel|selected)/P(rel|selected). • The desirable discriminant function value of a set is at least 1. • Let a document be represented by terms t1, …, tn, and they are statistically independent. P(selected|rel)=P(t1|rel)P(t2|rel)…P(tn|rel). • We can use Bayes theorem to calculate the probability that a document should be selected. • See Example 4.1.
Fuzzy matching • The issue is on how to define the fuzzy grade of documents w.r.t. a query. • One can define the fuzzy grade based on the closeness to a query. For example, 秋田狗 v.s. 狼狗 v.s. 狐狸狗。
Proximity matching • The proximity criteria can be used independently of any other criteria. • A modification is to use phrases rather than words. But it causes problems in some cases (e.g., information retrieval v.s. the retrieval of information). • Another modification is to use order of words (e.g., junior college v.s. college junior). However, this still causes the same problem as before. • Many systems introduce a measure on the proximity.
Effects of weighting • Weights can be given on sets of words, rather than individual words. • E.g., (beef and broccoli):5; (beef but not broccoli):2; (broccoli but not beef):2, noodles:1; snow peas:1; water chestnuts:1.
Effects of scaling • An extensive collection is likely to contain fewer additional relevant documents. • Information filtering aims at producing a relatively small set. • Another possibility is to use several models together, leading to so called data fusion.
A user-centered view • Each user has an individual vocabulary that may not match that of the author, editor, or indexer. • Many times, the user does not know how to specify his/her information need. “I’ll know it when I see it”. Therefore, it is important to allow users direct access to the data (browsing).
Indexing • Indexing is the act of assigning index terms to a document. • Many nonfiction books have indexes created by authors. • The indexing language may be controlled or uncontrolled. • For manual indexing, an uncontrolled indexing language is generally used. • Lack of consistency (the agreement in index term assignment may be as little as 20%) • Difficult for fast evolving field.
Indexing (Cont’d) • Characteristics of an indexing language • Exhaustivity (the breadth) and specificity (the depth) • The ingredients of indexes • Links (occur together) • Roles • Cross referencing • See: Coal, see fuel • Related terms: microcomputer, see also personal computer • Broader term (BT): poodle, BT dog • Narrower term (NT): dog, NT poodle, cocker spaniel, pointer.
Index (Cont’d) • Automatic indexing will play an ever-increasing role. • Approaches for automatic indexing • Word counting • Based on deeper linguistic knowledge • Based on semantics and concepts within a document collection. • Often inverted file is used to store indexes of documents in a document collection.
Matrix Representations • Term-document matrix A: • Aij indicates the occurrence or the count of term i in document j. • Term-term matrix T: • Tij indicates the occurrence or the count of term i and term j. • Document-document matrix D: • Dij indicates the degree of term overlapping between document i and document j. • These matrices are usually sparse and better be stored by lists.
Term Extraction and Analysis • It has been observed that frequencies of words in a document follow the so called Zipf’s law: (f=kr-1 ) 1, ½, 1/3, ¼, … • Many similar observations have been made: • Half of a documents is made up of 250 distinct words. • 20% of the text words account for 70% of term usage. • None of the observations are supported by Zipf’s law. • High frequncy terms are not desirable because they are so common. • Rare words are not desirable because very few documents will be retrieved.
Term Association • Term association is expanded with the concept of word proximity. • Proximity measure depends on • the number of intervening words • The number of words appearing in the same sentence. • Word order • Punctuation • However, there are risks: “The felon’s information assured the retrieval of the money”, and the retrieval of information, and information retrieval.
Term significance • Frequent words in a document collection may not be significant. (e.g., digital computer in computer science collection). • Absolute term frequency ignores the size of a document. • Relative term frequency is often used. • Absolute term frequency / length of doc. • Term frequency of a document collection • Total frequency count of a term / total words in documents of a document collection • Number of documents containing the term / total number of documents.