Bayesian Networks in Document Clustering

Bayesian Networks in Document Clustering Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"

A search engine with SOM-based document set representation

Map visualizations in 3D (BEATCA)

The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query

Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk

Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Our problem • Instability • Pre-defined major themes needed • Our approach • Find a coarse clustering into a few themes

Bayesian Networks in Document Clustering • SOM document-map based search engines require initial document clustering in order to present results in a meaningful way. • Latent semantic Indexing based methods appear to be promising for this purpose. • One of them, the PLSA, has been empirically investigated. • A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.

owner walk chappiR dog food Meat A Bayesian Network Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph High compression, Simpliofication of reasoning .

BN application in text processing • Document classification • Document Clustering • Query Expansion

Hidden variable approaches • PLSA (Probabilistic Latent Semantic Analysis) • PHITS (Probabilistic Hyperlink Analysis) • Combined PLSA/PHITS • Assumption of a hidden variable expressing the topic of the document. • The topic probabilistically influence the appearence of the document (links in PHITS, terms in PLSA)

N be term-document matrix of word counts, i.e., Nijdenotes how often a term (single word or phrase) ti occurs in document dj. probabilistic decomposition into factorszk (1 k  K) P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints ΣiP(ti|zk) = 1 for all k and ΣkP(zk| dj) = 1 for all j. PLSA - concept Hidden variable T1 T2 D Z ..... Tn

PLSA aims at maximizing L:= Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as states of a latent mixing variable associated with each observation (i.e., word occurrence), Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L. Hidden variable T1 T2 D Z Tn PLSA - concept ..... • different factors usually capture distinct "topics" of a document collection; • by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge

Data: D Z T1 T2 ... Tn 1 ? 1 0 ... 1 2 ? 0 0 ... 1 3 ? 1 1 ... 1 4 ? 0 1 ... 1 5 ? 1 0 ... 0 .......... Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... EM algorithm – step 0 Z randomly initialized

Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... D Z Tn EM algorithm – step 1 BN trained

Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 2 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 2 0 1 ... 1 5 1 1 0 ... 0 .......... D Z Tn EM algorithm step 2 Z sampled for each record according to the probability distribution P(Z=1|D=d,T1=t1,...,Tn=tn) P(Z=2|D=d,T1=t1,...,Tn=tn) .... Z sampled from BN GOTO step 1 untill convergence (Z assignment „stable”)

The problem • Too high number of adjustable variables • Pre-defined clusters not identified • Long computation times • instability

Solution • Our suggestion • Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class” • We were successful • Up to five classes well clustered • High speed (with 20,000 documents)

Next step • Naive bayes assumes document and term independence • What if they are in fact dependent? • Our solution: • TAN APPROACH • First we create a BN of terms/documents • Then assume there is a hidden variable • Promissing results, need a deeper study

PLSA – a model with term TAN Hidden variable D1 T6 T5 D2 Z Dk T4 T2 T3 T1

D6 D5 D4 D2 D3 D1 PLSA – a model with document TAN Hidden variable T1 T2 Z Ti

Bayesian Networks in Document Clustering

Bayesian Networks in Document Clustering

Presentation Transcript

Bayesian Networks

Bayesian Networks

Bayesian Networks

Bayesian Networks

Bayesian Networks

Bayesian Networks

“Bayesian Identity Clustering”

Bayesian networks

Bayesian Hierarchical Clustering

Bayesian networks

Document Clustering

Web Document Clustering

Small World Networks: Applications in Document Clustering and Healthcare

Bayesian Networks

Document Clustering

Bayesian Networks

Bayesian Networks

BAYESIAN NETWORKS

Bayesian Hierarchical Clustering

Bayesian Networks