230 likes | 251 Views
A text mining approach for automatic construction of hypertexts. Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors : Hsin-Chang Yanga ,* , Chung-Hong Lee b. 2005 Expert Systems with Applications. Outline. Motivation Objective Introduction System Overview
E N D
A text mining approach for automatic construction of hypertexts Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors :Hsin-Chang Yanga,*, Chung-Hong Leeb 2005 Expert Systems with Applications .
Outline • Motivation • Objective • Introduction • System Overview • Related Work • Text mining by self-organizing maps • Automatic hypertext construction • Experimental results • Conclusions • Personal Opinion
Motivation • Traditionally, hyperlinks construction method • by the creators of the web pages with or without the help of some authoring tools. • Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information.
Objective • Propose a new automatic hypertext construction method based on a text mining approach
Introduction • The use of hypertext provide a feasible mechanism to retrieve related documents. • A hypertext contains a number of navigational hyperlinks that point to some related hypertext or locations of the same hypertext
Introduction (cont.) • To transform a flat text to a hypertext we need to decide where to insert a hyperlink in the text. • A hyperlink …………… …………… …………… …………… …………… ……………
Introduction (cont.) • Structural link • Made available to navigate an information space by its structure & not necessarily specifically related to the content of the page they're found on • referential links and associative links • The critical points in creating associative hyperlinks are two fold 1.Find the source text to be linked 2.Find the documents which are semantically relevant to these sources
Introduction (cont.) • Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. • Then use these maps to identify the sources and destination of some important hyperlinks within these training documents.
大陸 word1 天津 ……. 外交部 System overview 1 2 Transform texts in to index terms 4 3
Related Work • Hypertext construction methodologies and system • Salton et al., & Allan 1992, • Computation of the similarity between fragments of documents in order to identify content links. • Agosti and Crestani(1993), • Establish a concept model that consist of three levels, concept level, index term level, and document level and devise a five-steps process to construct the links. • McClellan(1995), • Combines formal grammar and document indexing techniques to convert semi-structured text to hypertext.
Related Work (cont.) • Shin,Nam, and Kim(1997), • combine the statistical similarity and semantic similarity to create good hypertext. • Green(1997,2000), • Analyzing lexical chains in a text based on the WordNet. • Hyperlinks analysis • Mehler(1999) • Uses the concept of lexical cohesion for text linkage • If two texts comprise semantically similar words so that those texts represent valid candidates for text linkage.
Text mining by self-organizing maps • Document encoding and preprocessing 1.Keep only the nonus 2.Ignore the terms appear extremely few times 3.Manually constructed a stop list to filter out less important words. 4.Use binary vector scheme toencode the documents and ignore and kind of term weighting schemes.
Input data Winner node mj Text mining by self-organizing maps • Document clustering using SOM • SOM algorithm 1. Randomly select a training vector xi. 2. Find the BMU 3. Update the weight of neuron 4. Repeat 1-3 until all training vectors have been selected 5. Increase epoch t, and decrease and neighborhood size
WCM DCM Text mining by self-organizing maps • Document clustering using SOM
Text mining by self-organizing maps • Mining document and word associations • Word labeling process • For the weight vector of the jth neuron wj, if its nth component exceeds a predetermined threshold, the corresponding word of that component is labeled to this neuron. • Discover the ideas of a set of related documents • Linking the DCM and the WCM according to the neuron locations linking
Automatic hypertext construction • Finding sources • Two kinds of words • Themes of other documents but not of this document.(inter-cluster hyperlinks) • For instance, a user is reading an document about music, he may see the word oboe and want to learn more about it. • Themes of this document(intra-cluster hyperlinks) • Link documents that are related to this document for referential purpose. • Share a common theme and provide good references for the users.
Dj Wc WCM Automatic hypertext construction • Obtain the sources of these two kinds of hyperlinks • Introduce spanning factor to limit number of hyperlinks • Generate the ranking by averaging the total weights of all word clusters • Top-ranked (inter-cluster) • Top words (intra-cluster)
Dj Wc WCM Automatic hypertext construction • Finding destinations (1) Dj and Dki belong to different document clusters. (2) c is the neuron index of the word cluster that contains Dj (3) The distance between Dki and Djis minimum ,i.e.
Experimental results • The test corpus contains 3268 news articles which were posted by the CAN • To reduce the size of the vocabulary • Discard the word only occur once in a document • Discard manually constructed stoplist which contains 5259 Chinese words
Conclusions • We devised a novel method for automatic hypertext construction. • Experiments show that not only the text mining approach successfully reveals the co-occurrence patterns, but also the devise hypertext construction process effectively constructs semantic hyperlinks.
Personal opinions • Combined many method to apply • ……