An Interface for navigating clustered document sets returned by queries

An Interface for navigating clustered document sets returned by queries Robert Allen, Pascal Obry & Michael Littman presented by: Anna Hunecke

Overview • Information Retrieval • The Interface • The procedure • Results • Discussion

Information Retrieval (IR) • find the information relevant to a user query in a database • the results can be documents, text, images... • For the IR a mathematical model of the database is created • Example: Web-Search Engines

Problems of IR • Synonymy: concept can be described by several words • Polysemy: one word has more than one meaning • -> Latent Semantic Indexing (LSI) is supposed to solve these problems

ldea of the Interface • creating an interface that answers user queries by giving back the most likely articles about it from a corpus • results should be represented by a dendrogram rather than by a linear representation of similar articles • The dendrogram should be created via Hierarchical Clustering

Dendrogram • a tree where the apparent similarity between the leaves is shown by the height of the connection which joins them • Example

The Interface text window lists of proximal documents user query Dendrogram Subtree Document lists

The Corpus • 25.629 articles from the Academic American Encyclopaedia • articles were preprocessed using latent semantic indexing (LSI) • Cross-reference links were not used

How does the retrieval work? • user query is processed using LSI • results are gathered in a return set consisting of 400 articles (the top 1.6%) • these are clustered using Ward’s algorithm

Latent Semantic Analysis (LSA) • Idea: determine the similarity of meaning of words and passages in large corpora • applications: • model human conceptual knowledge: • synonym tests • calculating text coherence • ... • information retrieval (-> LSI)

LSI • Documents are converted into a matrix A: • The values are transformed so that they express the word’s importance and the information it possesses • This is done by using the entropy measure

LSI • In the next step the document matrices are converted into vectors using singular value decomposition (SVD) • This compresses the large matrix into (comparatively) small vectors

LSI • The documents are now modelled as vectors • The query is also modelled as a vector • Similarity between vectors reflects which documents are related • Similarity measure is the cosine of the vectors

Ward’s algorithm (Ward 1963) • Start: 400 sets which each contains one document-vector • Repeat: • find the two most similar sets according to the similarity measure and unite them • Similarity measure is the cosine of vectors • In sets with more than one element it is the cosine of the average vector • Stop if all documents are in one set

Does the clustering improve theresult from LSI? • informal results: • sometimes clustering does not improve retrieval: • LSI does not return relevant articles for clustering • LSI already returns good results • clustering improves results, if the query consists of several parts or include ambigue terms

Points of criticism • The user needs ca. 20 minutes to get used to the interface • the computational costs of the clustering are very high • the paper does not make a clear statement about the clustering features • the paper does not give a proper evaluation

Limitations • For effective search the corpus must have entries with clear titles or descriptions • descriptions of clusters are only a list of the documents they contain • The usefulness of clustering depends strongly on the corpus and the query

Discussion • Alternatives to hierarchical Clustering? • Alternatives to the average similarity? • Analysis of the query in order to decide if the clustering should be conducted? • Other applications?

An Interface for navigating clustered document sets returned by queries

An Interface for navigating clustered document sets returned by queries

Presentation Transcript

Returned Invoices

Graphical Interface for Queries

Introduction to SharePoint Document Sets

Navigating REMI: Review of User Interface

Navigating by Questions

Compact Representation for Answer Sets of n-ary Regular Queries

Sets: An introduction

Navigating an Open Future

Compact Representation for Answer Sets of n- ary Regular Queries

Grouping Search-Engine Returned Citations for Person Name Queries

Partitioning Search-Engine Returned Citations for Proper-Noun Queries

A Natural Language Interface for Crime-related Spatial Queries

Generic Remote Interface Unit (RIU) Interface Control Document (ICD)

Multiple Document Interface (MDI) application

Navigating Through the New Curriculum Essentials Document

Clustered Planarity = Flat Clustered Planarity

Returned Invoices

Grouping Search-Engine Returned Citations for Person Name Queries

Grouping Search-Engine Returned Citations for Person Name Queries

Clustered Planarity = Flat Clustered Planarity