Mapping document collections in non-standard geometries

Mapping document collections in non-standard geometries Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw

Agenda • Motivation • Our approach • Architecture • User interface • Visualization • Map creation • Clustering • Experimental results • Future directions Mining Document Maps

Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future • The BEATCA project targets atcreation a full-fledged search engine for moderate size document collections (millions of documents)capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach) Mining Document Maps

Our approach • The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. • A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms • B ayesian • E volutionary • A pproach to • T ext • C onnectivity • A nalysis Mining Document Maps

BEATCA architecture • The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query Mining Document Maps

BEATCA architecture Mining Document Maps

User interface • Search results are presented on a document map • Compact (fuzzy) topical areas are extracted • Query-related summaries are generated on-line • Maps can have one of the following topologies: • the traditional flat map (quadratic or hexagonal cells) • rotating 3D map (torus, sphere, cylinder) • hyperbolic map (Poincarre or Klein projections) • growing map (Growing Neural Gas) Mining Document Maps

User interface Mining Document Maps

Map visualizations in 3D Mining Document Maps

Kohonen learning overview • Unsupervised learning neural network model • Neuron represented by reference vector in document space • Vector element (term dimension) equals TFxIDF • Iterative regression of reference vectors onto document vector space • Similiarity is computed as cosine of angle between corresponding vectors Mining Document Maps

How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • three-phase document clustering: • initial grouping via PLSA/PHITS • WEBSOM on document groups • fuzzy cell clusters extraction and labelling Mining Document Maps

Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact Mining Document Maps

Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values” Mining Document Maps

Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease) Mining Document Maps

Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas: #WZÓR# • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints Mining Document Maps

Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy Mining Document Maps

Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method Mining Document Maps

Convergence – alpha functions Mining Document Maps

Convergence – winner search Mining Document Maps

Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps Mining Document Maps

Execution time - map size Mining Document Maps

Execution time - optimizations Mining Document Maps

Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects Mining Document Maps

Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation Mining Document Maps

Thank you! Any questions? Mining Document Maps

Mapping document collections in non-standard geometries

Mapping document collections in non-standard geometries

Presentation Transcript

Automatic Document Indexing in Large Medical Collections

Processing of large document collections

Diffusion Geometries in Document Spaces. Multiscale Harmonic Analysis.

Mapping Non-Discrimination

Chapter 6-Non-Euclidean Geometries

Processing of large document collections

Processing of large document collections

Processing of large document collections

Automatic Document Indexing in Large Medical Collections

Processing of large document collections

Processing of large document collections

Document standard « portrait »

Semantic Wordfication of Document Collections

Processing of large document collections

Standard EPC Document in Railways

Processing of large document collections

Processing of large document collections

Document Collections 2

Document Collections 3

Processing of large document collections

Processing of large document collections