1 / 21

Bayesian Networks in Document Clustering

Explore the application of Bayesian networks in document clustering for efficient information retrieval. Learn about latent variables, PLSA, PHITS, and the EM algorithm for document analysis.

felipa
Download Presentation

Bayesian Networks in Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Networks in Document Clustering Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"

  2. A search engine with SOM-based document set representation

  3. Map visualizations in 3D (BEATCA)

  4. The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query

  5. Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk

  6. Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

  7. Our problem • Instability • Pre-defined major themes needed • Our approach • Find a coarse clustering into a few themes

  8. Bayesian Networks in Document Clustering • SOM document-map based search engines require initial document clustering in order to present results in a meaningful way. • Latent semantic Indexing based methods appear to be promising for this purpose. • One of them, the PLSA, has been empirically investigated. • A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.

  9. owner walk chappiR dog food Meat A Bayesian Network Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph High compression, Simpliofication of reasoning .

  10. BN application in text processing • Document classification • Document Clustering • Query Expansion

  11. Hidden variable approaches • PLSA (Probabilistic Latent Semantic Analysis) • PHITS (Probabilistic Hyperlink Analysis) • Combined PLSA/PHITS • Assumption of a hidden variable expressing the topic of the document. • The topic probabilistically influence the appearence of the document (links in PHITS, terms in PLSA)

  12. N be term-document matrix of word counts, i.e., Nijdenotes how often a term (single word or phrase) ti occurs in document dj. probabilistic decomposition into factorszk (1 k  K) P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints ΣiP(ti|zk) = 1 for all k and ΣkP(zk| dj) = 1 for all j. PLSA - concept Hidden variable T1 T2 D Z ..... Tn

  13. PLSA aims at maximizing L:= Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as states of a latent mixing variable associated with each observation (i.e., word occurrence), Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L. Hidden variable T1 T2 D Z Tn PLSA - concept ..... • different factors usually capture distinct "topics" of a document collection; • by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge

  14. Data: D Z T1 T2 ... Tn 1 ? 1 0 ... 1 2 ? 0 0 ... 1 3 ? 1 1 ... 1 4 ? 0 1 ... 1 5 ? 1 0 ... 0 .......... Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... EM algorithm – step 0 Z randomly initialized

  15. Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... D Z Tn EM algorithm – step 1 BN trained

  16. Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 2 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 2 0 1 ... 1 5 1 1 0 ... 0 .......... D Z Tn EM algorithm step 2 Z sampled for each record according to the probability distribution P(Z=1|D=d,T1=t1,...,Tn=tn) P(Z=2|D=d,T1=t1,...,Tn=tn) .... Z sampled from BN GOTO step 1 untill convergence (Z assignment „stable”)

  17. The problem • Too high number of adjustable variables • Pre-defined clusters not identified • Long computation times • instability

  18. Solution • Our suggestion • Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class” • We were successful • Up to five classes well clustered • High speed (with 20,000 documents)

  19. Next step • Naive bayes assumes document and term independence • What if they are in fact dependent? • Our solution: • TAN APPROACH • First we create a BN of terms/documents • Then assume there is a hidden variable • Promissing results, need a deeper study

  20. PLSA – a model with term TAN Hidden variable D1 T6 T5 D2 Z Dk T4 T2 T3 T1

  21. D6 D5 D4 D2 D3 D1 PLSA – a model with document TAN Hidden variable T1 T2 Z Ti

More Related