Information Retrieval and Recommendation Techniques

Information Retrieval and Recommendation Techniques 國立中山大學資管系黃三益

Abstraction • Reality (real world) can not known in its entirety • Reality is represented by a collection of data abstracted from observation of the real world. • Information need drives the storage and retrieval of information. • Relationships among reality, information need, data and query (see Figure 1.1).

Information Systems • Two portions: endosystem and ectosystem. • Ectosystem has three human components: • User • Funder • Server: information professional who operates the system and provide service to the user. • Endosystem has four components: • Media • Devices • Algorithms • Data structures

Measures • The performance is dictated by the endosystem but judged by the ecosystem. • User is mainly concerned about effectiveness. • Server is more aware of the efficiency. • Founder is more concerned about economy of the system. • This course concentrates primarily on effectiveness measures. • The so called user-satisfaction has many meanings and different users may use different criteria. • A fixed set of criteria must be established for fair comparison.

From Signal to Wisdom • Five stepstones • Signal: bit stream, wave, etc. • Data: impersonal, available to any users • Information: a set of data matched to a particular information need. • Knowledge: coherence of data, concepts, and rules. • Wisdom: a balanced judgment in the light of certain value criteria.

Chapter 2 Document and Query Forms

What is a document? • A paper or a book? A section or a chapter? • There is no strict definition on the scope and format of a document. • The document concept can be extended to include programs, files, email messages, images, voices, and videos. • However, most commercial IR systems handle multimedia documents through their textual representations. • The focus of this course is on text retrieval.

Data Structures of Documents • Fully formatted documents: typically, these are entities stored in DBMSs. • Fully unformatted documents: typically, these are data collected via sensors, e.g., medical monitering, sound and image data, and a text editor. • Most textual documents, however, is semi-structured, including title, author, source, abstract, and other structural information.

Document Surrogates • A document surrogate is a limited representation of a full document. It is the main focus of storing and querying for many IR system. • How to generate and evaluate document surrogates in response to users’ information need is an important topic.

Ingredients of document surrogates • Document identifier: could be less meaningless such as record id, or a more elaborate identifier such as Library of Congress classification scheme for books (e.g., T210 C37 1982). • Title • Names: author, corporate, publisher • Dates: for timeliness and appropriateness • Unit descriptor: Introduction, Conclusion, Bibliography.

Ingredients of document surrogates • Keywords • Abstract: a brief one- or two-paragraph description of the contents of a paper. • Extracts: similar to abstract but created by someone other than the authors. • Review: similar to extract but meant to be critical. The review itself is a separate document that worth retrieving.

Vocabulary Control • It specifies a finite set of vocabularies to be used for specifying keywords. • Advantages: • Uniformity throughout the retrieval system • More efficient • Disadvantages: • Authors/users cannot give/retrieve a more detailed information. • Most IR system nowadays opt to an uncontrolled vocabulary and rely on a sound internal thesaurus for bring together related terms.

Encoding Standards • ASCII: a standard for English text encoding. However, it does not cover characters of different fonts, macthematical symbols, etc. • Big-5: traditional chinese character set with 2 bytes. • GB: simplified chinese charater set with XX bytes. • CCCII: a full traditional chinese character set with at most 6 bytes. • Unicode: a unified encoding trying to cover characters from multiple nations.

Markup languages • Initially used by word processor (.doc, .tex) and printer (.ps, .pdf) • Recently used for representing a document with hypertext information (HTML, SGML) WWW. • A document written in markup language can be segmented into several portions that better represent that document for searching.

Query Structures • Two types of matches • Exact match (equality match and range match) • Approximate match

Boolean Queries • Based on Boolean algebra • Common connectives: AND, OR, NOT • E.g., A AND (B OR C) AND D • Each term could be expanded by stemming or a list of related terms from a thesaurus. • E.g., inf -> information, vegetarian->mideastern countries • A xor B  (A AND NOT B) OR (NOT A AND B) • By far the most popular retrieval approach.

Boolean Queries (Cont’d) • Additional operators • Proximity (e.g., icing within 3 words of chocolate) • K out of N terms (e.g., 3 OF (A, B, C) • Problems: • No good way to weigh terms • E.g., music by Beethoven, preferably sonata. (Beethoven AND sonata) OR (Beethoven) • Easy to misuse (e.g., People who like to have dinner with sports or symphony may specify “dinner AND sports AND symphony”).

Boolean Queries (Cont’d) • Order of preference may not be natural to users (e.g., A OR B AND C). People tend to interpret requests depending on the semantics. • E.g., coffee AND croissant OR muffin • Raincoat AND umbrella OR sunglass • User may construct a highly complex query. • There are techniques on simplifying a given query into disjunctive normal form (DNF) or conjunctive normal form (CNF) • It has been shown that every Boolean expression can be converted to an equivalent DNF or CNF.

Boolean Queries (Cont’d) • DNF: a disjunction of several conjuncts, each of which includes two terms connected by AND. • E.g., (A AND B) OR (A AND NOT C) • (A AND B AND C) OR (A AND B AND NOT C) is equivalent to (A AND B). • CNF: a conjunction of several disjuncts, each of which includes two terms connected by OR. • Normalization to DNF can be done by looking at the TRUE rows, while that to CNF can be done by looking at the FALSE rows.

Boolean Queries (Cont’d) • The size of returned set could be explosively large. Sol: return only a limited number of records. • Though there are many problems with Boolean queries, they are still popular because people tend to use only two or three terms at a time.

Vector Queries • Each document is represented as a vector, or a list of terms. • The similarity between a document and a query is based on the presence of terms in both the query and the document. • The simplest model is 0-1 vector. A more general model is weighted vector. • Assigning weights to a document or a query is a complex process. • It is reasonable to assume that more frequent terms are more important.

Vector Queries (Cont’d) • It is better to give a user the freedom to assign weights. In this case, a conversion between user weight and system weight must be done. [Show the conversion equ.] • There are two types of vector queries (for similarity search) • top-N queries • Threshold-based queries

Extended Boolean Queries • This approach incorporates weights into Boolean queries. A general form is Aw1 * Bw2 (e.g., A0.2 AND B0.6). • A OR B0.2 retrieves all documents that contain A and those documents in B that are within top 20% closest to the documents in A. • A OR B1A OR B • A OR B0A • See Figure 3.1 for a diagrammatic illustration.

Extended Boolean Queries (Cont’d) • A AND B0.2 • A AND B0A • A AND B1A AND B • See Figure 3.2 for graphical illustration. • A AND NOT B0.2 • A AND NOT B0A • A AND NOT B1A AND NOT B • See Figure 3.3 for graphical illustration. • A0.2 OR B0.6 returns 20% of the documents in A-B that are closest to B and 60% of the documents in B-A that are closest to A.

Extended Boolean Queries (Cont’d) • See Example 3.1. • One needs to define the distance between a document and a set of document (contains A). • The computation of an extended Boolean query could be time-consuming. • This model have not become popular.

Fuzzy Queries • It is based on fuzzy set. • In a fuzzy set S, each element in S is associated with a membership grade. • Formally, S={<x, s(x)>|} s>0}. • AB = {x:xA and x B, (x)=min (A(x), B(x)). • AB = {x:xA or B, (x)=max(A(x), B(x)). • NOT A = {x:xA, (x)=1- A(x)}.

Fuzzy Queries (Cont’d) • To use fuzzy queries, documents must be fuzzy too. • The documents are returned to the users in decreasing order of their fuzzy values associated with the fuzzy query.

Probabilistic Queries • Similar to fuzzy queries but now the membership function is probabilities. • The probability of a document in association with a query (or term) can be calculated through some probability theory (e.g., Bayes Theorem) after some observation.

Natural Language Queries • Convenient • Imprecise, inaccurate, and frequently ungrammatical. • The difficulties lie in obtaining an accurate interpretation of a longer text, which may rely on common sense. • The successful system must restrict to a narrowly defined domain (e.g., medicine v.s. diagnosis of illness).

Information Retrieval and Database Systems • Should one use a database system to handle information retrieval requests? • DBMS is a mature and successful technolgy in handling precise queries. • It is not appropriate to handle imprecise textual elements. • OODB provide the augment functions to the textual or image elements and is considered a good candidate.

The Matching Process

Boolean based matching • It divides the document space into two: those satisfying the query and those that do not. • Finer grading of the set of retrieved documents can be defined on the number of terms satisfied (e.g., A OR B OR C).

Vector-based matching • Measures • Based on the idea of distance • Minkowski metric (Lq) Lq=(|Xi1-Xj1|q+|Xi2-Xj2|q+|Xi3-Xj3|q+…+|Xip-Xjp|q)1/q • Special cases: Manhattan distance (q=1), Euclidean distance (q=2), and maximum direction distance (q=). • See example in p.133. • Based on the idea of angle • Cosine function ((QD)/(|Q||D|).

Mapping distance to similarity • It is better to map distance (or dissimilarity) into some range, e.g. [0, 1]. • A simple inversion function is =b-u. • A more general inversion function is =b-p(u), where p(u) is a monotone nondecreasing func s.t. p(0)=0. • See Fig. 4.1 for graphical illustration.

Distance or cosine? • <1, 3> , <100, 300>, <3, 1>? Which pair is similar? • In practice, distance and angular measures seem to give results of similar quality because the cluster of documents all roughly lie in the same direction.

Missing terms and term relationships • The conventional value 0 means • Truly missing • No information • However, if 0 is regarded as undefined. It becomes impossible to measure the distance between two documents (e.g., <3, -> and <-, 4>. • Terms used to define the vector model are clearly not independent, e.g., “digital” and “computer” have a strong relationship. • However, the effect of dependent terms is hardly known.

Probability matching • For a given query, we can define the probability that a document is related as P(rel)=n/N. • The discriminant function on the selected set is dis(selected)=P(rel|selected)/P(rel|selected). • The desirable discriminant function value of a set is at least 1. • Let a document be represented by terms t1, …, tn, and they are statistically independent. P(selected|rel)=P(t1|rel)P(t2|rel)…P(tn|rel). • We can use Bayes theorem to calculate the probability that a document should be selected. • See Example 4.1.

Fuzzy matching • The issue is on how to define the fuzzy grade of documents w.r.t. a query. • One can define the fuzzy grade based on the closeness to a query. For example, 秋田狗 v.s. 狼狗 v.s. 狐狸狗。

Proximity matching • The proximity criteria can be used independently of any other criteria. • A modification is to use phrases rather than words. But it causes problems in some cases (e.g., information retrieval v.s. the retrieval of information). • Another modification is to use order of words (e.g., junior college v.s. college junior). However, this still causes the same problem as before. • Many systems introduce a measure on the proximity.

Effects of weighting • Weights can be given on sets of words, rather than individual words. • E.g., (beef and broccoli):5; (beef but not broccoli):2; (broccoli but not beef):2, noodles:1; snow peas:1; water chestnuts:1.

Effects of scaling • An extensive collection is likely to contain fewer additional relevant documents. • Information filtering aims at producing a relatively small set. • Another possibility is to use several models together, leading to so called data fusion.

A user-centered view • Each user has an individual vocabulary that may not match that of the author, editor, or indexer. • Many times, the user does not know how to specify his/her information need. “I’ll know it when I see it”. Therefore, it is important to allow users direct access to the data (browsing).

Text Analysis

Indexing • Indexing is the act of assigning index terms to a document. • Many nonfiction books have indexes created by authors. • The indexing language may be controlled or uncontrolled. • For manual indexing, an uncontrolled indexing language is generally used. • Lack of consistency (the agreement in index term assignment may be as little as 20%) • Difficult for fast evolving field.

Indexing (Cont’d) • Characteristics of an indexing language • Exhaustivity (the breadth) and specificity (the depth) • The ingredients of indexes • Links (occur together) • Roles • Cross referencing • See: Coal, see fuel • Related terms: microcomputer, see also personal computer • Broader term (BT): poodle, BT dog • Narrower term (NT): dog, NT poodle, cocker spaniel, pointer.

Index (Cont’d) • Automatic indexing will play an ever-increasing role. • Approaches for automatic indexing • Word counting • Based on deeper linguistic knowledge • Based on semantics and concepts within a document collection. • Often inverted file is used to store indexes of documents in a document collection.

Matrix Representations • Term-document matrix A: • Aij indicates the occurrence or the count of term i in document j. • Term-term matrix T: • Tij indicates the occurrence or the count of term i and term j. • Document-document matrix D: • Dij indicates the degree of term overlapping between document i and document j. • These matrices are usually sparse and better be stored by lists.

Term Extraction and Analysis • It has been observed that frequencies of words in a document follow the so called Zipf’s law: (f=kr-1 ) 1, ½, 1/3, ¼, … • Many similar observations have been made: • Half of a documents is made up of 250 distinct words. • 20% of the text words account for 70% of term usage. • None of the observations are supported by Zipf’s law. • High frequncy terms are not desirable because they are so common. • Rare words are not desirable because very few documents will be retrieved.

Term Association • Term association is expanded with the concept of word proximity. • Proximity measure depends on • the number of intervening words • The number of words appearing in the same sentence. • Word order • Punctuation • However, there are risks: “The felon’s information assured the retrieval of the money”, and the retrieval of information, and information retrieval.

Term significance • Frequent words in a document collection may not be significant. (e.g., digital computer in computer science collection). • Absolute term frequency ignores the size of a document. • Relative term frequency is often used. • Absolute term frequency / length of doc. • Term frequency of a document collection • Total frequency count of a term / total words in documents of a document collection • Number of documents containing the term / total number of documents.

Information Retrieval and Recommendation Techniques

Information Retrieval and Recommendation Techniques

Presentation Transcript

Alternative Retrieval Techniques

INFORMATION RETRIEVAL TECHNIQUES BY DR . ADNAN ABID

Information Retrieval Techniques

Retrieval and Evaluation Techniques for Personal Information

Information Retrieval

Information Retrieval Techniques

Learning Techniques for Information Retrieval

Information Retrieval and Information Extraction

Information Retrieval

Information Retrieval