A text mining approach on automatic generation of web directories and hierarchies

A text mining approach on automatic generation of web directories and hierarchies Advisor ：Dr. Hsu Reporter：Chun Kai Chen Author：Hsin-Chang Yang and Chung-Hong Lee 2004. Expert Systems with Applications 645-663

Outline • Motivation • Objective • The text mining process • Automatic generation of web directories • Experimental Results • Summary

Motivation • The classification of web pages into proper directories and the organization of directory hierarchies are generally performed by human experts.

Objective • In this work, we provide a corpus-based method that applies a kind of text mining techniques on a corpus of web pages to automatically create webdirectories and organize them intohierarchies.

S1 S2 S3 The text mining process 網頁 SOM(DCM) 萃取文章資料 SOM(WCM) Si Automatic generation of web directories Si+1 Generation of directory hierarchies two-levelhierarchy Generation of directories web directories

Automatic generation of web directories • Generation of directory hierarchies • The super-cluster generation process algorithm • Generation of directories • identify cluster themes by examining the WCM • selects the word that is the most important toa super-cluster stop criteria DCM WCM

Experimental Results • The experiments show that our method can produce comprehensible and reasonable web directories and hierarchies.

Introduction(1/3) • Information finding is thus a serious problem for the web since most users find it hard to obtain the information using current information retrieval strategies. • Two kinds of strategies are now adopted by the web communities, namely searching and browsing.

Introduction(2/3) • Since the link structures may be considered static during browsing • the selection of starting pages plays the most important role when a user tries to find his goal in minimum time • Therefore, many commercial or academic web sites actively collect web pages and sort them into web directories • to provide users the starting points in the browsing process

Introduction(3/3) • Most existing web directories were created manually by human specialists. • Yahoo! • Such limitation is mainly caused by the gigantic amount of web pages produced and being produced

Related work • category hierarchy • predefined category hierarchy (Yahoo!) • automatically developing category hierarchy • topic identification • mutually related text excerpts • Self-organizing map algorithm

The text mining process(1/2) • The method is based on the self-organizing map learning algorithm and requires no human intervention during the construction of web directories and hierarchies. The text mining process 網頁萃取文章資料 SOM(WCM) SOM(DCM)

The text mining process(2/2) • labeling process • each document will associate with a neuron in the map. We record such associations and form the DCM. • In the DCM, each neuron is labeled by a list of documents which • are considered similar and • are in the same cluster. • In the same manner, we label each word to some neuron in the map and form the WCM.

Generation of directory hierarchies(1/3) • The two-levelhierarchy generation process • the parent node is the constructed super-cluster • the child nodes are the clusters that compose the super-cluster • can be further applied to every super-cluster to establish the next level of this hierarchy • The overall hierarchy • iteratively using such top–down approach • until a stop criterion is satisfied

Generation of directory hierarchies(2/3) • To form a super-cluster • the distance between two clusters(二維空間座標距離) • the dissimilarity between two clusters(神經元向量相似度) • the supporting cluster similarity • we can determine the significance of a clusterby examining the overall similarity that is contributed by its neighboring clusters. • doc(i) : 神經元 i的文件數量 • Bi : 神經元 i 的鄰近神經元 index • F: is a monotonically increasing function • The dominating clusters • has locally maximal supporting cluster similarity • the centroid of a super-cluster, which contains several child clusters

Generation of directory hierarchies(3/3) • In Step 3 of the super-cluster generation process algorithm we set three stop criteria. • The first criterion stops finding super-clusters • if there is no neuron left for selection. • The second criterion, which limits the number of dominating clusters, to constrain the breadth of hierarchies. • The third criterion constrains the depth of a hierarchy.

S2 S1 S3

Generation of directories • In this work, we try to identify cluster themes, i.e. directory labels, by examining the WCM. • selects the word that is the most important toa super-cluster

Summary • In this paper, we present a method to automatically generate • web directory hierarchies and identify directory labels. • Experiments show that our method could • successfully cluster the documents into directories, • reveal the hierarchical structure among these directories, • and assign a label to each directory. • However, fully automatic process may not provide the best solutions for these tasks that interfere so much with human beings. • Thus, in our opinions, a kind of semi-automatic process which uses the proposed method as a preprocessing stage should be plausible to meet the general requirements.

Personal Opinion • Application • such as text categorization, thesaurus construction, ontology learning, multilingual information retrieval • Advantage • fully automatic process , which can automatically create web director hierarchies without the intervention of human beings • Disadvantage • may not provide the best solutions

A text mining approach on automatic generation of web directories and hierarchies