1 / 24

Architecture for graphical maps of Web contents

This research paper discusses the architecture and techniques for creating graphical maps of web contents, focusing on document analysis, clustering, and visualization. The study aims to develop a user-friendly search engine capable of representing online queries in graphical form on a document map.

ynorwood
Download Presentation

Architecture for graphical maps of Web contents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture for graphical maps of Web contents Krzysztof Ciesielski, Michal Draminski, Mieczyslaw Klopotek, Mariusz Kujawiak, Slawomir Wierzchon Institute of Computer Science, PAS, Warsaw University of Podlasie, Siedlce Białystok University of Technology

  2. Agenda • Motivation • Architecture • Map interface • Map creation • Map clustering • Execution time of map creation • Convergence of map creation • Future direction

  3. Motivation • the Web and also intranets become increasingly content-rich • a good way of presenting massive document sets in an understandable way will be crucial in the near future. • The BEATCA project envisages creation of a user-friendly content presentation of moderate size document collections (with millions of documents).

  4. Our approach • The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. • A special architecture has been elaborated to enable experiments with various brands of map creation algorithm. • Our research targets at creation of a full-fledged search engine (with working name Beatca) for small collections of documents capable of representing on-line replies to queries in graphical form on a document map.

  5. Architecture • We follow the general architecture for search engines, • the preparation of documents for retrieval is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation, • the map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation, • Maps are used bythe query processor responding to user's queries.

  6. Architecture .................. Base Registry Search Engine Indexer Optimizer Mapper Vector Base Robot Map HT Base Indexer Mapper Optimizer Vector Base Map .................. .................. .................. HT Base

  7. User interface • Search results are presented on a document map • The map can have one of two forms: • The traditional flat map • The rotating torus

  8. Rotating torus representation of the map

  9. How are the maps created • A modified WebSOM method is used • Based on our observation of radical reduction of document vector variation • Multi-level maps

  10. A map for 20 newsgroups

  11. A detailed map for Syskill&Webert 4 document groups

  12. A high level map for Syskill&Webert 4 document groups

  13. Clustering groups documents • A fuzzy isodata method used • Entropy based • Initialisation with Minimum weight spanning tree • Clustered documents are labeled by weighed centroids of cell reference vectors modified with entropy

  14. Approximate clustering using minimal spanning tree for 5 newsgroups

  15. Label candi-datesfor clusters(5 news-groups)

  16. Experiments with execution time The impact of the following factors on the speed o9f map creation was investigated: • Map size • Optimization method • Dictionary optimization (extreme entropy and extreme frequency) • Reference vector optimization

  17. Convergence We checked the convergence of the maps to a stable state depending on • Type of alpha function (search radius reduction) • Type of winner search method

  18. Future research • We intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects. • Bayesian networks will be applied in particular to classify documents, to accelerate document clustering processes, to construct a thesaurus supporting query enrichment, and to keyword extraction. • Immuno-genetic systems will be used for adaptive document clustering by referring to the mechanism of so-called metadynamics, for extraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodies , and for visualisation and adjustment of resolution of document maps.

  19. Thank you • Any questions?

More Related