360 likes | 372 Views
This text is an introduction to models and techniques used in information retrieval, covering topics such as logical, vector processing, and probabilistic models. It also explores the drawbacks of certain models and offers insights into the development of new fuzzy set models.
E N D
Digital Days Information Retrieval Bassiou Nikoletta Artificial Intelligence and Information Analysis Lab Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Introduction • Models / Techniques • Evaluation of Results • Clustering • References Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Introduction • Research in developing algorithms and models for retrieving information from document repositories (document / text retrieval) • Main activities: • Indexing: representation of documents • Searching: way documents are examined to be characterized as relevant Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques • Logical • Vector Processing • Probabilistic • Cognitive Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model • Documents: represented by index terms or keywords • Requests: logical combinations (AND, OR, NOT) of these terms • Document retrieved when it satisfies the logical expression of the request • Example: D1={A, B}, D2={B, C}, D3={A, B, C} Q=A^B^ ~C A={D1} Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model (cont.) • Drawbacks: • Formulation of the query is difficult / trained intermediaries have to search on behalf of the user • Results: partition of the database into two discrete subjects no mechanism of ranking according to decreasing probability of relevance • All query terms are considered to be equal: they are either present or not • Closed word Assumption: absence of an index term in a document false index for that document • Development of fuzzy set models Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model • Documents/queries are represented in a high-dimensional space • Each dimension corresponds to a word in the document collection • Most relevant documents for a query:documents represented by the vectors closest to the query Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Document: t-dimensional vector Di= (di1, di2, …, dit) dij: weights of the j-th term dij = 0 when j-th term is absent form document Di • Indexing of documents : number of term occurrences in documents, number of documents each word are present, or other • Query : Qj= (qj1, qj2, …, qjt) Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Similarity Computation : • Inner Product: • Cosine: *** When applied to normalized vectors same ranking of similarities as Euclidean Distance Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Drawbacks • Use of indexing terms to define the dimension of the space involves the incorrect assumption that the terms are orthogonal • Practical limitations : for discriminating ranking several query terms are needed while in Boolean models two or three ANDed terms are enough • Difficulty of explicitly specifying synonymic and phrasal relationships Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model • Probability Ranking Principle: Ranking of documents in order of decreasing probability of relevance to a user’s information need • Term-Weight Specification: selectivity/what makes a good term good whether it can pick any of the few relevant documents from the many non-relevant ones Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Collection Frequency: terms in few documents more valuable n : number of document term t (i) occurs in N : the number of documents in the collection • Term Frequency: terms occurring more often in a document are more likely to be important for that document Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term Frequency • Document Length Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Normalized Document Length: serves for the evaluation of Term Frequency • Combined Weight: combination of the above weight measures used for score calculation k1(=2) : affects the extent of influence of Term Frequency b(=0.75) : affects the extent of Document Length’s influence Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term-Weighting Components Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Typical term-weighting formulas Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative searching: Terms Reweigthing / Query Expansion • Relevance weighting: relation between the relevant and non-relevant documents for a search term r=the number of known relevant documents term t(i) appears R=the number of known relevant documents for a request Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative Combination: • Query Expansion: adding to a query new search terms taken by documents assessed as relevant Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Models / Techniques (cont.) • Cognitive Model • Focus on: • user’s information-seeking behaviour • ways in which IR systems are used in operational environments • Experiments on • the way in which a user’s information needs may change during his interaction with the IR system more flexible interfaces Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results • Precision: proportion of retrieved documents that are relevant • Recall: proportion of relevant documents that are retrieved • Fallout: proportion of non-relevant documents that are not retrieved • Generality: proportion of relevant documents within the entire collection Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Example: Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Precision-Recall graph Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Evaluation of Results (cont.) • Three-point average precision: averaging the precision in three different recall levels • Eleven-point average precision: averaging the precision in eleven different recall levels Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering • Non-Exclusive (Overlapping): ex. Fuzzy Clustering degree of belongingness • Exclusive • Extrinsic (Supervised) • Intrinsic (Unsupervised)*: Agglomerative-Devisive • Hierarchical: nested sequence of partitions • Partitional: single partition Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Hierarchical: transformation of the proximity matrix (similarity/dissimilarity indices) into a sequence of nested partitions • Threshold graph G(v) for each dissimilarity level v: inserting an edge (i, j) between nodes i and j if objects i, j are less dissimilar than v((i ,j) G(v)if and only ifd(i, j) v) Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Single-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the number of components (maximally connected subgraphs) in G(k) is less than the number of clusters in the current clustering , redefine the current clustering by naming each component of G(k) as a cluster. • If G(k) consists of a single connected graph, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Complete-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the two of the current clusters form a clique (maximally completed subgraph) in G(k), redefine the current clustering by merging these two clusters into a single cluster. • If k=n(n-1)/2, so that G(k) is the complete graph on the n nodes, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Other Algorithms: • Hubert’s Algorithm for Single-Link and Complete Link • Graph Theory Algorithm for Single-Link Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link • Begin with disjoint clustering having level L(0)=0 and sequence number m=0. • Find the least dissimilar pair in the current clustering, {(r), (s)}, according to d[(r), (s)]=min {d[(i), (j)]} • Set mm+1. Merge clusters (r) and (s). Set the level to L(m)=d[(r) , (s)] • Update the proximity matrix by deleting the rows and columns corresponding to clusters (r) and (s) by adding a row and column for the newly formed cluster. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link (cont.) • The proximity between the new cluster (r, s) and old cluster (k) is defined as follows d[(k), (r, s)]=min {d[(k), (r)], d[(k), (s)]} (single-link) d[(k), (r, s)]=max {d[(k), (r)], d[(k), (s)]} (complete-link) • !!! Generalized Formula d[(k), (r, s)]= ar d[(k), (r)] + as d[(k), (s)] + βd[(r), (s)] + γ|d[(k), (r)]-d[(k), (s)]| Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Coefficient Values for Matrix Updating Algorithms Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Main characteristics of the methods • Single-link: closest pair of objects clusters with little homogeneity • Complete-link (more conservative):most distant pair clusters not well separated • UPGMA: weights equally the contribution of each object taking into account the sizes of the clusters • WPGMA: weights objects in small clusters more heavily than patterns in large clusters • UPGMC-WPGMC: • proximity measure is Euclidean distance • geometric interpretation distance between centroids Aristotle University of Thessaloniki Informatics Department
Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department
Information Retrieval • References • Manning C.D. and Schutze H., Foundations of Statistical Natural Language Processing, MIT Press, 1999. • Jones K.S. and Willett P., Readings in Information Re-trieval,Morgan Kaufman Publishers, San Francisco, California, 1997. • Salton G., Wong A., and Yang C.S., “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, pp. 613–620, 1975. • G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” 1988. Aristotle University of Thessaloniki Informatics Department
Information Retrieval • References (cont.) • Salton G., “The smart environment for retrieval system evaluation-advantages and problem areas,” In K. Sparck Jones (Ed.), Information Retrieval Experiment, pp. 316–329, 1981. • Robertson S.E., “The probability ranking principle in ir,” Journal of Documentation, vol. 33, pp. 126–148, 1977. • Robertson S.E. and Jones S.K., “Simple, proven approaches to text retrieval”, TR 356, Cambridge University, Computer Laboratory, May 1977. • A.K. Jain and R.C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, 1988. Aristotle University of Thessaloniki Informatics Department