1 / 50

An On-line Document Clustering Method Based on Forgetting Factors

An On-line Document Clustering Method Based on Forgetting Factors. Yoshiharu Ishikawa , Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan. Outline. Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor

alton
Download Presentation

An On-line Document Clustering Method Based on Forgetting Factors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa, Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan ECDL2001

  2. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  3. Background • The Internet enabled on-line document delivery services • newsfeed services over the network • periodically issued on-line journals • Important technologies (and applications) for on-line documents • information filtering • document summarization, information extraction • topic detection and tracking (TDT) • Clustering works as a core technique for these applications

  4. Our Objectives (1) • Development of an on-line clustering method which considers the novelty of each document • Presents a snapshot of clusters in an up-to-date manner • Example: articles from sports news feed Formula 1 & M. Schumacher U.S. Open Tennis Other articles Soccer World Cup time

  5. Our Objectives (2) • Development of a novelty-based clustering method for on-line documents • Features: • It weights high importance on newer documents than older ones and forgets obsolete ones • introduction of a new document similarity measure that considers novelty and obsolescence of documents • Incremental clustering processing • low processing cost to generate a new clustering result • Automatic maintenance of target documents • obsolete documents are automatically deleted from the clustering target

  6. A A A A A A A A A A Incremental Clustering Process (1) • when t = 0 (initial state) Cluster 1 4. cluster documents and present the result 1. arrival of new documents : Clustering Module Cluster k 3. calculate and store statistics 2. store new documents in the repository t = 0

  7. Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A Cluster k Cluster k Incremental Clustering Process (2) • when t = 1 4. cluster documents and present the result 1. arrival of new documents Clustering Module 3. update statistics 2. store new documents in the repository t = 0 t = 1

  8. Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Incremental Clustering Process (3) • when t =  +  5. cluster documents and present the result 1. arrival of new documents Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t =  t =  t =  4. delete old documents

  9. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • C2ICM Clustering Method • F2ICM Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and parameter Setting • Experimental Results • Conclusions and Future Work

  10. C2ICM Clustering Method • Cover-Coefficient-based Incremental Clustering Methodology • Proposed by F. Can (ACM TOIS, 1993) [3] • Incremental Clustering Method with Low Update Cost • Seed-based Clustering Method • Based on the concept of seed powers • Seed powers are defined probabilistically • Documents with highest seed powers are selected as cluster seeds

  11. Decoupling/Coupling Coefficients • Two important notions in C2ICM method • used to calculate seed powers • Decoupling coefficient of document di : • the probability that the document di is obtained when a document di itself is given • an index to measure the independence of di • Coupling coefficient of document di: • an index to measure the dependence of di

  12. Seed Power • Seed powerspi for document di measures the appropriateness (moderate dependence) of di as a cluster seed • freq(di, tj): the occurrence frequency of term tj within document di • : decoupling coefficient for term tj • : coupling coefficient for term tj

  13. C2ICM Clustering Algorithm (1) • Initial phase • Select new seeds based on the seed powers • Other documents are assigned to the cluster with the most similar seed  Red: F1 & Schumacher Green: U.S. Open Tennis

  14. C2ICM Clustering Algorithm (2) • Incremental update phase • Select new seeds based on the seed powers • Other documents are assigned to the cluster with the most similar seed  Red: F1 & Schumacher Green: U.S. Open Tennis Orange: Soccer World Cup

  15. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • C2ICM Clustering Method • F2ICM Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and parameter Setting • Experimental Results • Conclusions and Future Work

  16. F2ICM Clustering Method • Extension of C2ICM method • Main differences • Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters • Incremental maintenance of statistics • Automatic deletion of obsolete old documents

  17. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  18. Document Similarity Based onForgetting Factor • New Document Similarity Measure Based on Document Forgetting Model • Assumption: each delivered document gradually loses its value (weight) as time passes • Derivation of document similarity measure based on the assumption • put high weights on new documents and low weights on old ones  old documents have low effects on clustering • Using the derived document similarity measure, we can achieve a novelty-based clustering

  19. Ti: acquisition time of document di Information value (weight) of di is defined as Document weight exponentially decreases as time passes  (0 <  < 1) determines the forgetting speed Document Forgetting Model (1) dwi 1 t t Ti current time acquisition time of document di

  20. Document Forgetting Model (2) • Why we use the exponential forgetting model? • It inherits the ideas from the behavioral law of human memory • The Power Law of Forgetting [1]: human memory exponentially decreases as time passes • Relationship with citation analysis: • Obsolescence (aging) of citation can be measured by measuring citation rates • Some simple obsolescence model takes exponential forms • Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure • Simplicity: we can control the forgetting speed using the parameter 

  21. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  22. A A A A A A A A Our Approach for Document Similarity Derivation • Probabilistic derivation based on the document forgetting model • Let Pr(di, dj) be the probability to select the document pair (di, dj) from the document repository • We regardthe coocurrence probability Pr(di, dj) as their similarity sim(di, dj) Pr(di, dj) doc di doc dj

  23. Derivation of Similarity Formula (1) • tdw: total weights of all the m documents • simple summation of all document weights • Pr(di): subjective probability to select document di from the repository where Since old documents have small document weights, their selection probabilities are small

  24. Derivation of Similarity Formula (2) • Pr(tk|di): selection probability of term tk from document di • freq(di, tk): the number of occurrence of tk in di • the probability corresponds to term frequency

  25. Derivation of Similarity Formula (3) • Pr(tk): occurrence probability of term tk • this probability corresponds to document frequency of term tk • the reciprocal of df(tk) represents IDF(inverse document frequency)

  26. Derivation of Similarity Formula (4) • Using Bayes’ theorem, • Then we get

  27. Derivation of Similarity Formula (5) • Therefore, the coocurrence probability of di, dj is: • The more a document di becomes old, the smaller its similarity scores with other documents are • because old documents have low Pr(di) values inner prodocut of document vectors based on TF-IDF weighting old documents have low similarity scores

  28. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Document forgetting model • Derivation of document similarity measure • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  29. Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Updating Statistics and Probabilities • when t =  +  1. arrival of new documents 5. present new clustering result Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t =  t =  t =  4. delete old documents

  30. Approach to Update Processing (1) • In every incremental clustering step, we have to calculate document similarities • To compute similarities, we need to calculate document statistics and probabilities beforehand • It is inefficient to compute statistics every time from scratch • Store the calculated statistics and probabilities and utilize them for later computation Incremental Update Processing

  31. Approach to Update Processing (2) • Formulation • d1, ..., dm: document set consists of m documents • t1, ..., tn: index term sets that appear in d1, ..., dm • t = : the latest update time of the document set • Assumption • when t =  +  , new documents dm + 1, ..., dm + m’ are appended to the document set • new documents dm + 1, ..., dm + m’ introduce additional terms tn + 1, ..., tn + n’ • m >> m and n >> n are satisfied

  32. Update Processing Method (1) • Update of document weight dwi • Since  unit time has passed from the previous update time t = , the weight of each document decreases according to  • For each new document, assign initial value 1

  33. Update Processing Method (2) • Example of Incremental Update Processing: Updating from tdw| to tdw|+ • Naive Approach: compute tdw|+ from scratch • time consuming!

  34. Update Processing Method (3) • Smart Approach: compute tdw|+ incrementally • exponential weighting enables efficient incremental computation

  35. Updating Processing Method (4) • Occurrence probability of each document Pr(di) can be easily recalculated • We need to calculate term frequencies tf(di, tk) only for new documents dm + 1, ..., dm + m’

  36. Updating Processing Method (5) • Update formulas for document frequency of each term df(tk) • we expand the formula of df(tk) as follows, then store each permanently • can be incrementally updated using the formula

  37. Update Processing Method (6) • Calculation of new decoupling coefficient i is easy: • Update formulas for decoupling coefficient for terms • incremental update is also possible • details are shown in the paper

  38. Summary of Update Processing • Following statistics are maintained persistently (m: no. of documents, n: no. of terms) • dwi: weight of document di(1 im) • tdw: total weight of documents • freq(di, tk): term occurrence frequency(1 im, 1 kn) • docleni: document length (1  i m) • : statistics fo computedf(tk)(1 kn) • : statistics to compute (1  i  m) • Incremental statistics update cost • O(m + m’n)  O(m + n) with storage cost O(m + n): linear cost • cf. naive method (not incremental) costs O(mn) cost

  39. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  40. Cluster 1 Cluster 1 : : A A A A A A A A A A A A A A A A A A A A A A A Cluster k Cluster k Expiration of Old Documents (1) • when t =  +  1. arrival of new documents 5. present new clustering result Clustering Module 3. update statistics 2. store new documents in the repository ... t = + t =  t =  t =  4. delete old documents

  41. Expiration of Old Documents (2) • Two reasons to delete old documents: • reduction of storage area • old documents have only tiny effects on the resulting clustering structure • Our approach: • If dwi <  ( is a small parameter constant) is satisfied, delete document di • When we delete di, related statistics values are deleted • e.g., freq(di, tk) • details are in the proceedings and [6]

  42. Parameter Setting Methods • F2ICM uses two parameters in its algorithms: • forgetting factor  (0 <  < 1): specifies the forgetting speed • expiration parameter  (0 <  < 1): threshold value for document deletion • We use the following metaphors: • : half-life span of the value of a document •  = ½ is satisfied, namely • :life span of a document •  is determined by  = 

  43. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  44. Dataset and Parameter Settings • Dataset: Mainich Daily Newspaper articles • Each article consists of the following information: • issue date • subject area (e.g., economy, sports) • keyword list (50  150 words in Japanese) • The articles we used in the experiment: • issue date: January 1994 to February 1994 • subject area: international affairs • Parameter Settings • nc (no. of clusters) = 10 •  (half-life span) = 7: the value of an article reduces to ½ in one week •  (life span) = 30: every document will be deleted after 30 days

  45. Computational Cost for Clustering Sequences • Plot of CPU time and response time for each clustering performed everyday • Costs linearly increase until 30th day, then becomes almost constant

  46. Overview of Clustering Result (1) • Summarization of 10 clusters after 30 days (at January 31, 1994)

  47. Overview of Clustering Result (2) • Summarization of 10 clusters after 57 days (at March 1, 1994)

  48. Summary of the Experiment • Brief observations • F2ICM groups similar articles into a cluster as far as an appropriate seed is selected • But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated • Reasons of the observed phenomena: • Selected seeds are not well separated in topics: more sophisticated seed selection method is required • The number of keywords for an articles is rather small (50  150 words)

  49. Outline • Background and Objectives • F2ICM Incremental Document Clustering Method • Document Similarity Based on Forgetting Factor • Updating Statistics and Probabilities • Document Expiration and Parameter Setting • Experimental Results • Conclusions and Future Work

  50. Conclusions and Future Work • Conclusions • Development of an on-line clustering method which considers the novelty of documents • Introduction of document forgetting model • F2ICM: Forgetting Factor-based Incremental Clustering Method • Incremental statistics update method (linear update cost) • Automatic document expiration and parameter setting methods • Preliminary report of the experiments • Current and Future Work • Revision of the clustering algorithms based on Scatter/Gather approach [4] • More detailed experiments and their evaluation • Development of automatic parameter tuning methods

More Related