500 likes | 686 Views
An On-line Document Clustering Method Based on Forgetting Factors. Yoshiharu Ishikawa , Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan. Outline. Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor
E N D
An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan ECDL2001
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Background • The Internet enabled on-line document delivery services • newsfeed services over the network • periodically issued on-line journals • Important technologies (and applications) for on-line documents • information filtering • document summarization, information extraction • topic detection and tracking (TDT) • Clustering works as a core technique for these applications
Our Objectives (1) • Development of an on-line clustering method which considers the novelty of each document • Presents a snapshot of clusters in an up-to-date manner • Example: articles from sports news feed Formula 1 & M. Schumacher U.S. Open Tennis Other articles Soccer World Cup time
Our Objectives (2) • Development of a novelty-based clustering method for on-line documents • Features: • It weights high importance on newer documents than older ones and forgets obsolete ones • introduction of a new document similarity measure that considers novelty and obsolescence of documents • Incremental clustering processing • low processing cost to generate a new clustering result • Automatic maintenance of target documents • obsolete documents are automatically deleted from the clustering target
A A A A A A A A A A Incremental Clustering Process (1) • when t = 0 (initial state) Cluster 1 4. cluster documents and present the result 1. arrival of new documents : Clustering Module Cluster k 3. calculate and store statistics 2. store new documents in the repository t = 0
Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A Cluster k Cluster k Incremental Clustering Process (2) • when t = 1 4. cluster documents and present the result 1. arrival of new documents Clustering Module 3. update statistics 2. store new documents in the repository t = 0 t = 1
Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Incremental Clustering Process (3) • when t = + 5. cluster documents and present the result 1. arrival of new documents Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t = t = t = 4. delete old documents
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • C2ICM Clustering Method • F2ICM Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and parameter Setting • Experimental Results • Conclusions and Future Work
C2ICM Clustering Method • Cover-Coefficient-based Incremental Clustering Methodology • Proposed by F. Can (ACM TOIS, 1993) [3] • Incremental Clustering Method with Low Update Cost • Seed-based Clustering Method • Based on the concept of seed powers • Seed powers are defined probabilistically • Documents with highest seed powers are selected as cluster seeds
Decoupling/Coupling Coefficients • Two important notions in C2ICM method • used to calculate seed powers • Decoupling coefficient of document di : • the probability that the document di is obtained when a document di itself is given • an index to measure the independence of di • Coupling coefficient of document di: • an index to measure the dependence of di
Seed Power • Seed powerspi for document di measures the appropriateness (moderate dependence) of di as a cluster seed • freq(di, tj): the occurrence frequency of term tj within document di • : decoupling coefficient for term tj • : coupling coefficient for term tj
C2ICM Clustering Algorithm (1) • Initial phase • Select new seeds based on the seed powers • Other documents are assigned to the cluster with the most similar seed Red: F1 & Schumacher Green: U.S. Open Tennis
C2ICM Clustering Algorithm (2) • Incremental update phase • Select new seeds based on the seed powers • Other documents are assigned to the cluster with the most similar seed Red: F1 & Schumacher Green: U.S. Open Tennis Orange: Soccer World Cup
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • C2ICM Clustering Method • F2ICM Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and parameter Setting • Experimental Results • Conclusions and Future Work
F2ICM Clustering Method • Extension of C2ICM method • Main differences • Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters • Incremental maintenance of statistics • Automatic deletion of obsolete old documents
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Document Similarity Based onForgetting Factor • New Document Similarity Measure Based on Document Forgetting Model • Assumption: each delivered document gradually loses its value (weight) as time passes • Derivation of document similarity measure based on the assumption • put high weights on new documents and low weights on old ones old documents have low effects on clustering • Using the derived document similarity measure, we can achieve a novelty-based clustering
Ti: acquisition time of document di Information value (weight) of di is defined as Document weight exponentially decreases as time passes (0 < < 1) determines the forgetting speed Document Forgetting Model (1) dwi 1 t t Ti current time acquisition time of document di
Document Forgetting Model (2) • Why we use the exponential forgetting model? • It inherits the ideas from the behavioral law of human memory • The Power Law of Forgetting [1]: human memory exponentially decreases as time passes • Relationship with citation analysis: • Obsolescence (aging) of citation can be measured by measuring citation rates • Some simple obsolescence model takes exponential forms • Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure • Simplicity: we can control the forgetting speed using the parameter
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
A A A A A A A A Our Approach for Document Similarity Derivation • Probabilistic derivation based on the document forgetting model • Let Pr(di, dj) be the probability to select the document pair (di, dj) from the document repository • We regardthe coocurrence probability Pr(di, dj) as their similarity sim(di, dj) Pr(di, dj) doc di doc dj
Derivation of Similarity Formula (1) • tdw: total weights of all the m documents • simple summation of all document weights • Pr(di): subjective probability to select document di from the repository where Since old documents have small document weights, their selection probabilities are small
Derivation of Similarity Formula (2) • Pr(tk|di): selection probability of term tk from document di • freq(di, tk): the number of occurrence of tk in di • the probability corresponds to term frequency
Derivation of Similarity Formula (3) • Pr(tk): occurrence probability of term tk • this probability corresponds to document frequency of term tk • the reciprocal of df(tk) represents IDF(inverse document frequency)
Derivation of Similarity Formula (4) • Using Bayes’ theorem, • Then we get
Derivation of Similarity Formula (5) • Therefore, the coocurrence probability of di, dj is: • The more a document di becomes old, the smaller its similarity scores with other documents are • because old documents have low Pr(di) values inner prodocut of document vectors based on TF-IDF weighting old documents have low similarity scores
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Updating Statistics and Probabilities • when t = + 1. arrival of new documents 5. present new clustering result Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t = t = t = 4. delete old documents
Approach to Update Processing (1) • In every incremental clustering step, we have to calculate document similarities • To compute similarities, we need to calculate document statistics and probabilities beforehand • It is inefficient to compute statistics every time from scratch • Store the calculated statistics and probabilities and utilize them for later computation Incremental Update Processing
Approach to Update Processing (2) • Formulation • d1, ..., dm: document set consists of m documents • t1, ..., tn: index term sets that appear in d1, ..., dm • t = : the latest update time of the document set • Assumption • when t = + , new documents dm + 1, ..., dm + m’ are appended to the document set • new documents dm + 1, ..., dm + m’ introduce additional terms tn + 1, ..., tn + n’ • m >> m and n >> n are satisfied
Update Processing Method (1) • Update of document weight dwi • Since unit time has passed from the previous update time t = , the weight of each document decreases according to • For each new document, assign initial value 1
Update Processing Method (2) • Example of Incremental Update Processing: Updating from tdw| to tdw|+ • Naive Approach: compute tdw|+ from scratch • time consuming!
Update Processing Method (3) • Smart Approach: compute tdw|+ incrementally • exponential weighting enables efficient incremental computation
Updating Processing Method (4) • Occurrence probability of each document Pr(di) can be easily recalculated • We need to calculate term frequencies tf(di, tk) only for new documents dm + 1, ..., dm + m’
Updating Processing Method (5) • Update formulas for document frequency of each term df(tk) • we expand the formula of df(tk) as follows, then store each permanently • can be incrementally updated using the formula
Update Processing Method (6) • Calculation of new decoupling coefficient i is easy: • Update formulas for decoupling coefficient for terms • incremental update is also possible • details are shown in the paper
Summary of Update Processing • Following statistics are maintained persistently (m: no. of documents, n: no. of terms) • dwi: weight of document di(1 im) • tdw: total weight of documents • freq(di, tk): term occurrence frequency(1 im, 1 kn) • docleni: document length (1 i m) • : statistics fo computedf(tk)(1 kn) • : statistics to compute (1 i m) • Incremental statistics update cost • O(m + m’n) O(m + n) with storage cost O(m + n): linear cost • cf. naive method (not incremental) costs O(mn) cost
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Expiration of Old Documents (1) • when t = + 1. arrival of new documents 5. present new clustering result Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t = t = t = 4. delete old documents
Expiration of Old Documents (2) • Two reasons to delete old documents: • reduction of storage area • old documents have only tiny effects on the resulting clustering structure • Our approach: • If dwi < ( is a small parameter constant) is satisfied, delete document di • When we delete di, related statistics values are deleted • e.g., freq(di, tk) • details are in the proceedings and [6]
Parameter Setting Methods • F2ICM uses two parameters in its algorithms: • forgetting factor (0 < < 1): specifies the forgetting speed • expiration parameter (0 < < 1): threshold value for document deletion • We use the following metaphors: • : half-life span of the value of a document • = ½ is satisfied, namely • :life span of a document • is determined by =
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Dataset and Parameter Settings • Dataset: Mainich Daily Newspaper articles • Each article consists of the following information: • issue date • subject area (e.g., economy, sports) • keyword list (50 150 words in Japanese) • The articles we used in the experiment: • issue date: January 1994 to February 1994 • subject area: international affairs • Parameter Settings • nc (no. of clusters) = 10 • (half-life span) = 7: the value of an article reduces to ½ in one week • (life span) = 30: every document will be deleted after 30 days
Computational Cost for Clustering Sequences • Plot of CPU time and response time for each clustering performed everyday • Costs linearly increase until 30th day, then becomes almost constant
Overview of Clustering Result (1) • Summarization of 10 clusters after 30 days (at January 31, 1994)
Overview of Clustering Result (2) • Summarization of 10 clusters after 57 days (at March 1, 1994)
Summary of the Experiment • Brief observations • F2ICM groups similar articles into a cluster as far as an appropriate seed is selected • But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated • Reasons of the observed phenomena: • Selected seeds are not well separated in topics: more sophisticated seed selection method is required • The number of keywords for an articles is rather small (50 150 words)
Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work
Conclusions and Future Work • Conclusions • Development of an on-line clustering method which considers the novelty of documents • Introduction of document forgetting model • F2ICM: Forgetting Factor-based Incremental Clustering Method • Incremental statistics update method (linear update cost) • Automatic document expiration and parameter setting methods • Preliminary report of the experiments • Current and Future Work • Revision of the clustering algorithms based on Scatter/Gather approach [4] • More detailed experiments and their evaluation • Development of automatic parameter tuning methods