180 likes | 307 Views
An Interface for navigating clustered document sets returned by queries. Robert Allen, Pascal Obry & Michael Littman. presented by: Anna Hunecke. Overview. Information Retrieval The Interface The procedure Results Discussion. Information Retrieval (IR).
E N D
An Interface for navigating clustered document sets returned by queries Robert Allen, Pascal Obry & Michael Littman presented by: Anna Hunecke
Overview • Information Retrieval • The Interface • The procedure • Results • Discussion
Information Retrieval (IR) • find the information relevant to a user query in a database • the results can be documents, text, images... • For the IR a mathematical model of the database is created • Example: Web-Search Engines
Problems of IR • Synonymy: concept can be described by several words • Polysemy: one word has more than one meaning • -> Latent Semantic Indexing (LSI) is supposed to solve these problems
ldea of the Interface • creating an interface that answers user queries by giving back the most likely articles about it from a corpus • results should be represented by a dendrogram rather than by a linear representation of similar articles • The dendrogram should be created via Hierarchical Clustering
Dendrogram • a tree where the apparent similarity between the leaves is shown by the height of the connection which joins them • Example
The Interface text window lists of proximal documents user query Dendrogram Subtree Document lists
The Corpus • 25.629 articles from the Academic American Encyclopaedia • articles were preprocessed using latent semantic indexing (LSI) • Cross-reference links were not used
How does the retrieval work? • user query is processed using LSI • results are gathered in a return set consisting of 400 articles (the top 1.6%) • these are clustered using Ward’s algorithm
Latent Semantic Analysis (LSA) • Idea: determine the similarity of meaning of words and passages in large corpora • applications: • model human conceptual knowledge: • synonym tests • calculating text coherence • ... • information retrieval (-> LSI)
LSI • Documents are converted into a matrix A: • The values are transformed so that they express the word’s importance and the information it possesses • This is done by using the entropy measure
LSI • In the next step the document matrices are converted into vectors using singular value decomposition (SVD) • This compresses the large matrix into (comparatively) small vectors
LSI • The documents are now modelled as vectors • The query is also modelled as a vector • Similarity between vectors reflects which documents are related • Similarity measure is the cosine of the vectors
Ward’s algorithm (Ward 1963) • Start: 400 sets which each contains one document-vector • Repeat: • find the two most similar sets according to the similarity measure and unite them • Similarity measure is the cosine of vectors • In sets with more than one element it is the cosine of the average vector • Stop if all documents are in one set
Does the clustering improve theresult from LSI? • informal results: • sometimes clustering does not improve retrieval: • LSI does not return relevant articles for clustering • LSI already returns good results • clustering improves results, if the query consists of several parts or include ambigue terms
Points of criticism • The user needs ca. 20 minutes to get used to the interface • the computational costs of the clustering are very high • the paper does not make a clear statement about the clustering features • the paper does not give a proper evaluation
Limitations • For effective search the corpus must have entries with clear titles or descriptions • descriptions of clusters are only a list of the documents they contain • The usefulness of clustering depends strongly on the corpus and the query
Discussion • Alternatives to hierarchical Clustering? • Alternatives to the average similarity? • Analysis of the query in order to decide if the clustering should be conducted? • Other applications?