250 likes | 261 Views
Mapping document collections in non-standard geometries. Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw. Agenda. Motivation Our approach Architecture User interface
E N D
Mapping document collections in non-standard geometries Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw
Agenda • Motivation • Our approach • Architecture • User interface • Visualization • Map creation • Clustering • Experimental results • Future directions Mining Document Maps
Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future • The BEATCA project targets atcreation a full-fledged search engine for moderate size document collections (millions of documents)capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach) Mining Document Maps
Our approach • The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. • A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms • B ayesian • E volutionary • A pproach to • T ext • C onnectivity • A nalysis Mining Document Maps
BEATCA architecture • The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query Mining Document Maps
BEATCA architecture Mining Document Maps
User interface • Search results are presented on a document map • Compact (fuzzy) topical areas are extracted • Query-related summaries are generated on-line • Maps can have one of the following topologies: • the traditional flat map (quadratic or hexagonal cells) • rotating 3D map (torus, sphere, cylinder) • hyperbolic map (Poincarre or Klein projections) • growing map (Growing Neural Gas) Mining Document Maps
User interface Mining Document Maps
Map visualizations in 3D Mining Document Maps
Kohonen learning overview • Unsupervised learning neural network model • Neuron represented by reference vector in document space • Vector element (term dimension) equals TFxIDF • Iterative regression of reference vectors onto document vector space • Similiarity is computed as cosine of angle between corresponding vectors Mining Document Maps
How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • three-phase document clustering: • initial grouping via PLSA/PHITS • WEBSOM on document groups • fuzzy cell clusters extraction and labelling Mining Document Maps
Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact Mining Document Maps
Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values” Mining Document Maps
Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease) Mining Document Maps
Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas: #WZÓR# • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints Mining Document Maps
Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy Mining Document Maps
Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method Mining Document Maps
Convergence – alpha functions Mining Document Maps
Convergence – winner search Mining Document Maps
Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps Mining Document Maps
Execution time - map size Mining Document Maps
Execution time - optimizations Mining Document Maps
Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects Mining Document Maps
Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation Mining Document Maps
Thank you! Any questions? Mining Document Maps