510 likes | 1.34k Views
Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.
E N D
Search Engine? • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query. • Eg: Google, Yahoo etc.
Flat Ranked VS Clustered • Google (Flat Ranked Search Engine)
Why Web Clustering Engines? • Conventional Engines are not much efficient in ‘Ambiguous’ queries. • The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.
This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). Web clustering engines: 1. Northern Light - predefined set of clusters 2. Credo Reference 3. Kartoo 4. Eyeplorer
Main advantages of the cluster hierarchy • It makes for shortcuts to the items that relate to the same meaning. • It allows better topic understanding.
Issues in Implementation Of clusters • Short input data description. • Meaningful labels. • Selection of similarity measure. • Grouping of objects into clusters. • Computational efficiency. • Unknown number of clusters.
1.Search Results Acquisition • Provides input for the rest of the system. • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL • The source of search results can be any public search engines, such as Google,Yahoo etc. • Fetching results from other search engines.
2.Preprocessing of Search results • Primary aim is to convert the search results into ‘features’ steps: i.Language identification ii.Tokenization iii.Stemming iv.Selection features
ii.Tokenization: Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.
iii.Stemming: Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’. Eg: connected,connecting & interconnection ↓ ↓ ↓ ‘connect’
iv.Selection features: • Extract features for each search result present in the input. • Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. • Features vary from single word to tuples of word.
How can represent a feature/text? • Vector Space Model(VSM) • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn] where t0, t1, . . . tnis a set of words/features andwtiis the weight/importance of feature ti Eg: d→“Pollyhad a dog and the dog had Polly” vsm representation
3.Cluster Construction & Labelling • The set of search results along with their features are input to the clustering algorithm, for building the clusters and labeling. Three types of Algorithms: 1. Data Centric Algorithms 2. Description aware 3. Description centric
Data Centric Clustering Algorithm • It has initial clustering of a collection of documents in a set of k clusters(scatter) • At Query time the user selected clusters of interest(gather) and the system re-clustered those documents. • Process repeats until a small cluster with relevant documents is found
Difficulties in Data centric algorithms • All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model. • Missing of meaningful labels.
4.Visualization of Clustered Results • One prominent approach is based on hierarchical folders • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach • Grokker - Nesting ,zooming approach • KartOO - Graph based interfaces