Text-classification using Latent Dirichlet Allocation - intro graphical model

Text-classification using Latent Dirichlet Allocation- intro graphical model Lei Li leili@cs

Outline • Introduction • Unigram model and mixture • Text classification using LDA • Experiments • Conclusion

…………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… bank debt loan interest billion buy ……………………… …………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… Iraq war weapon army Ak-47 bomb ……………………… Text Classification What class can you tell given a doc? military finance

Why db guys care? • Could be adapted to model discrete random variables • Disk failures • user access pattern • Social network, tags • blog

Document • “bag of words”: no order on words • d=(w1, w2, … wN) • wi one value in 1…V (1-of-V scheme) • V: vocabulary size

Modeling Document • Unigram: simple multinomial dist • Mixture of unigram • LDA • Other: PLSA, bigram

Unigram Model for Classification • Y is the class label, • d={w1, w2, … wN} • Use bayes rule: • How to model the document given class • ~ Multinomial distribution, estimated as word frequency Y N w

Unigram: example d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0 P(finance|d)=? P(military|d)=?

Mixture of unigrams for classification • For each class, assume k topics • Each topic represents a multinomial distribution • Under each topic, each word is multinomial Y N z w

Unigram: example d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0 P(finance|d)=? P(military|d)=?

Bayesian Network • Given a DAG • Nodes are random variables, or parameters • Arrow are conditional probability dependency • Given some prob on part nodes, there are algorithm to infer values for other nodes

Latent Dirichlet Allocation • Model a θ as a Dirichlet distribution, on α • For n-th term wn: • Model n-th latent variable zn as a multinomial distribution according to θ. • Model wn as a multinomial distribution according to zn and β.

Variational inference for LDA • Direct inference with LDA is HARD • Approximation with variational distribution • use factorized distribution on variational parameters γ and Φ to approximate posterior distribution of latent variables θand z.

Experiment • Data set: Reuters-21578, 8681 training documents, 2966 test documents. • Classification task: “EARN” vs. “Non-EARN” • For each document, learn LDA features and classify with them (discriminative)

Result most frequent words in each topic

Classification Accuracy

Comparison of Accuracy

Take Away Message • LDA with few topics and few training data could produce relative better results • Bayesian network is useful to model multiple random variable, nice algorithm for it, • Potential use of LDA: • disk failure • database access pattern • user preference (collaborative filtering) • social network (tags)

Reference • Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of machine Learning Research

Text-classification using Latent Dirichlet Allocation - intro graphical model

Text-classification using Latent Dirichlet Allocation - intro graphical model

Presentation Transcript

An Introduction to Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

Question Classification using Latent Focus Words

Latent Dirichlet Allocation a generative model for text

Text Classification

Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

Topic Model Latent Dirichlet Allocation

Latent Dirichlet Allocation( LDA)

A Latent Dirichlet Allocation Method For Selectional Preferences

Using Latent Dirichlet Allocation for Child Narrative Analysis

Project 2 Latent Dirichlet Allocation

Latent Dirichlet Allocation

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Latent model

Text Classification Using Stochastic Keyword Generation

TopicXP: Exploring Topics in Source Code using Latent Dirichlet Allocation

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Text Classification using SVM-light

Text Classification