210 likes | 221 Views
Explore the application of Bayesian networks in document clustering for efficient information retrieval. Learn about latent variables, PLSA, PHITS, and the EM algorithm for document analysis.
E N D
Bayesian Networks in Document Clustering Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"
The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query
Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk
Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar
Our problem • Instability • Pre-defined major themes needed • Our approach • Find a coarse clustering into a few themes
Bayesian Networks in Document Clustering • SOM document-map based search engines require initial document clustering in order to present results in a meaningful way. • Latent semantic Indexing based methods appear to be promising for this purpose. • One of them, the PLSA, has been empirically investigated. • A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.
owner walk chappiR dog food Meat A Bayesian Network Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph High compression, Simpliofication of reasoning .
BN application in text processing • Document classification • Document Clustering • Query Expansion
Hidden variable approaches • PLSA (Probabilistic Latent Semantic Analysis) • PHITS (Probabilistic Hyperlink Analysis) • Combined PLSA/PHITS • Assumption of a hidden variable expressing the topic of the document. • The topic probabilistically influence the appearence of the document (links in PHITS, terms in PLSA)
N be term-document matrix of word counts, i.e., Nijdenotes how often a term (single word or phrase) ti occurs in document dj. probabilistic decomposition into factorszk (1 k K) P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints ΣiP(ti|zk) = 1 for all k and ΣkP(zk| dj) = 1 for all j. PLSA - concept Hidden variable T1 T2 D Z ..... Tn
PLSA aims at maximizing L:= Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as states of a latent mixing variable associated with each observation (i.e., word occurrence), Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L. Hidden variable T1 T2 D Z Tn PLSA - concept ..... • different factors usually capture distinct "topics" of a document collection; • by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge
Data: D Z T1 T2 ... Tn 1 ? 1 0 ... 1 2 ? 0 0 ... 1 3 ? 1 1 ... 1 4 ? 0 1 ... 1 5 ? 1 0 ... 0 .......... Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... EM algorithm – step 0 Z randomly initialized
Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 1 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 1 0 1 ... 1 5 2 1 0 ... 0 .......... D Z Tn EM algorithm – step 1 BN trained
Hidden variable T1 T2 Data: D Z T1 T2 ... Tn 1 2 1 0 ... 1 2 2 0 0 ... 1 3 1 1 1 ... 1 4 2 0 1 ... 1 5 1 1 0 ... 0 .......... D Z Tn EM algorithm step 2 Z sampled for each record according to the probability distribution P(Z=1|D=d,T1=t1,...,Tn=tn) P(Z=2|D=d,T1=t1,...,Tn=tn) .... Z sampled from BN GOTO step 1 untill convergence (Z assignment „stable”)
The problem • Too high number of adjustable variables • Pre-defined clusters not identified • Long computation times • instability
Solution • Our suggestion • Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class” • We were successful • Up to five classes well clustered • High speed (with 20,000 documents)
Next step • Naive bayes assumes document and term independence • What if they are in fact dependent? • Our solution: • TAN APPROACH • First we create a BN of terms/documents • Then assume there is a hidden variable • Promissing results, need a deeper study
PLSA – a model with term TAN Hidden variable D1 T6 T5 D2 Z Dk T4 T2 T3 T1
D6 D5 D4 D2 D3 D1 PLSA – a model with document TAN Hidden variable T1 T2 Z Ti