Document Maps

Document Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"

Agenda • Motivation • What is a document map • Map creation • Clustering • Experimental results • Future directions

Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future

Document map • Many attempts have been made to visualize sets of dicuments not just like a list, but rather in two dimensions • A document map is a mapping of a set of documents to 2-D representing their inter-relationships

Linear relationship presentation(Internet Cartographer)

A relationship • A link between hypertext documents • Citation in the bibliography • Content similarity

A tree of relations with central subject (Inxight – Tree Studio)

Selforganizing map (WebSOM)dissimilarity of grouops of documents

Document frequency in clusters

A meta search engine map

Our approach – multiple representations (BEATCA)

Map visualizations in 3D (BEATCA)

Future research – hypergeometric representation (Fish-Eye eEffect)

The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query

How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • multi-phase document clustering: • initial grouping to identify major topics • Initial document grouping • WEBSOM on document groups • fuzzy cell clusters extraction and labelling

Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk

Document model in search engines dog • The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document. food Query: walk walk

Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact

Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values”

Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)

Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints

Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy

Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method

Convergence – alpha functions (linear versus reciprocal)

Convergence – winner search (joint versus local)

Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps

Execution time - map size

Execution time - optimizations

Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects

Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation

Thank you! Any questions?

Document Maps

Document Maps

Presentation Transcript

Karnaugh Maps (K-Maps)

Maps

MAPs

Karnaugh Maps (K maps)

Maps

MAPS

Maps

MAPS

Maps

Maps

MAPS

Multilingual document mining and navigation using self-organizing maps

Maps

MAPS

Story Maps Story Maps Story Maps

Maps

Maps, Maps, Maps!

Maps, Maps, Maps

MAPS

Maps

Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Maps