Topic modeling

Topic modeling Mark Steyvers Department of Cognitive Sciences University of California, Irvine

Some topics we can discuss • Introduction to LDA: basic topic model • Preliminary work on therapy transcripts • Extensions to LDA • Conditional topic models (for predicting behavioral codes) • Various topic models for word order • Topic models incorporating parse trees • Topic models for dialogue • Topic models incorporating speech information

Most basic topic model: LDA(Latent Dirichlet Allocation)

Automatic and unsupervised extraction of semantic themes from large text collections. Pennsylvania Gazette (1728-1800) 80,000 articles Enron 250,000 emails NYT 330,000 articles NSF/ NIH 100,000 grants AOL queries 20,000,000 queries 650,000 users 16 million Medline articles

Doc1 Doc2 Doc3 … PIZZA 34 0 3 PASTA 12 0 2 ITALIAN 0 19 6 FOOD … 0 … 16 … 1 … Model Input • Matrix of counts: number of times words occur in documents • Note: • word order is lost: “bag of words” approach • Some function words are deleted: “the”, “a”, “in” documents words

Basic Assumptions • Each topic is a distribution over words • Each document a mixture of topics • Each word in a document originates from a single topic

Document = mixture of topics auto car parts cars used ford honda truck toyota party store wedding birthday jewelry ideas cards cake gifts webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien hannah montana zac efron disney high school musical mileycyrus hilary duff 20% Document ------------------------------- -------------------------------------------------------------- --------------------------------------------------------------------------------------- 80% 100% Document ------------------------------- -------------------------------------------------------------- ---------------------------------------------------------------------------------------

Generative Process • For each document, choose a mixture of topics •  Dirichlet() • Sample a topic [1..T] from the mixturez Multinomial() • Sample a word from the topicw Multinomial((z)) Dirichlet(β) Nd D T

Prior Distributions • Dirichlet priors encourage sparsity on topic mixtures and topics Topic 3 Word 3 Topic 1 Topic 2 Word 1 Word 2 θ~ Dirichlet( α )  ~ Dirichlet( β) (darker colors indicate lower probability)

Statistical Inference • Three sets of latent variables: • document-topic distributions θ • topic-word distributions  • topic assignments z • Estimate posterior distribution over topic assignments • P( z | w ) • we “collapse” over topic mixtures and word mixtures • we can later infer θand  • Use approximate methods: Markov chain Monte Carlo (MCMC) with Gibbs sampling

Toy Example: Artificial Dataset Two topics 16 documents Docs Can we recover the original topics and topic mixtures from this data?

Initialization: assign word tokens randomly to topics: (●=topic 1; ○=topic 2 )

Gibbs Sampling count of topic t assigned to doc d count of word w assigned to topic t probability that word iis assigned to topic t

After 1 iteration • Apply sampling equation to each word token: (●=topic 1; ○=topic 2 )

After 4 iterations (●=topic 1; ○=topic 2 )

After 8 iterations (●=topic 1; ○=topic 2 )

After 32 iterations  (●=topic 1; ○=topic 2 )

Summary of Algorithm INPUT: word-document counts (word order is irrelevant) OUTPUT: topic assignments to each word P( zi ) likely words in each topic P( w | z ) likely topics in each document (“gist”) P( z | d )

Example topics from TASA: an educational corpus • 37K docs 26K word vocabulary • 300 topics e.g.:

Three documents with the word “play”(numbers & colors  topic assignments)

LSA documents dims dims documents C = U D VT dims words words dims Topic model documents topics documents C = F Q topics words words normalized co-occurrence matrix mixture components mixture weights

Documents as Topics Mixtures:a Geometric Interpretation P(word1) 1 topic 1 = observeddocument 0 topic 2 1 P(word2) P(word3) 1 P(word1)+P(word2)+P(word3) = 1

Some Preliminary Work on Therapy Transcripts

Defining documents • Can define “document” in multiple ways • all words within a therapy session • all words from a particular speaker within a session • Clearly we need to extend topic model to dialogue….

Positive/Negative Topic Usage by Group

Positive/ Negative Topic Usage by Changes in Satisfaction This graph shows that couples with a decrease in satisfaction over the course of therapy use relatively negative language. Those who leave the therapy with increased satisfaction exhibit more positive language

Topics used by Satisfied/ Unsatisfied Couples Topic 38 talk divorce problem house along separate separation talking agree example Dissatisfied couples talk relatively more often about separation and divorce

Affect Dynamics • Analyze the short-term dynamics of affect usage: • Do unhappy couples follow up negative language with negative language more often than happy couples? In other words, are unhappy couples involved in a negative feedback loop? • Calculated: • P( z2=+ | z1=+ ) • P( z2=+ | z1=- ) • P( z2=- | z1=+ ) • P( z2=- | z1=- ) • E.g. P( z2=- | z1=+ ) is the probability that after a positive word the next non-neutral word will be a negative word

Markov Chain Illustration Base rates + .27 z Normal Controls - - + .73 .72 .28 + .33 z Positive Change - - + .67 .73 .27 + .37 z Little Change - - + .63 .78 .22 + .41 z Negative Change - - + .59 .78 .22

Modeling Extensions

Extensions • Multi-label Document Classification • conditional topic models • Topic models and word order • ngrams/collocations • hidden-markov models • Some potential model developments: • topic models incorporating parse trees • topic models for dialogue • topic models incorporating speech information

Conditional Topic Models Assume there is a topic associated with each label/behavioral code. Model only is allowed to assign words to labels that are associated with the document This model can learn the distribution of words associated with each label/behavioral code

Vulnerability=yes Hard Expression=no “Vulnerability” word? word word? word? word? word? word? word? word? word? word? word? word? .... ? Vulnerability=no Hard Expression=yes word? word? word? word? word? word? word? word? word? word? word? word? .... “Hard Expression” ? Vulnerability=yes Hard Expression=yes word? word? word? word? word? word?.... Topics associated with Behavioral Codes Topic Weights Documents and topic assignments

Preliminary Results

Topic Models for short-range sequential dependencies

Hidden Markov Topics Model • Syntactic dependencies  short range dependencies • Semantic dependencies  long-range q Semantic state: generate words from topic model z1 z2 z3 z4 w1 w2 w3 w4 Syntactic states: generate words from HMM s1 s2 s3 s4 (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

NIPS Semantics KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * NIPSSyntax IN WITH FOR ON FROM AT USING INTO OVER WITHIN # * I X T N - C F P IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN

Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Collocation Topic Model Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

Potential Model Developments

Using parse trees/ pos taggers? S S NP NP VP VP “You complete me” “I complete you”

Modeling Dialogue

Topic Segmentation Model • Purver, Kording, Griffiths, & Tenenbaum, J. B. (2006). Unsupervised topic modeling for multi-party spoken discourse. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics • Automatically segments multi-party discourse into topically coherent segments • Outperforms standard HMMs • Model does not incorporate speaker information or speaker turns • goal is simply to segment long stream of words into segments

At each utterance, there is a prob. of changing theta, the topic mixture. If no change is indicated, words are drawn from the same mixture of topics. If there is a change, the topic mixture is resampled from Dirichley

Latent Dialogue Structure modelDing et al. (Nips workshop, 2009) • Designed for modeling sequences of messages on discussion forums • Models the relationship of messages within documents – a message might relate to any previous message within a dialogue • It does not incorporate speaker specific variables

Some details …

Learning User Intentions in Spoken Dialogue SystemsChinaei et al. (ICAART, 2009) • Applies HTMM model (Gruber et al., 2007) to dialogue • Assumes that within each talk-turn, words are drawn from same topic z (not mixture!). At start of new talk-turn, there is some probability (psi below) of sampling new topic z from mixture theta

Other ideas • Can we enhance topic models with non-verbal speech information • Each topic is a distribution over words as well as voicing information (f0, timing, etc) T Nd D Non-verbal feature

Other Extensions

Topic modeling

Topic modeling

Presentation Transcript

Modeling and Detecting Anomalous Topic Access

Text Mining and Topic Modeling

An Automatic Advertisement/Topic MODELING AND RECOMMENDING SYSTEM

Incorporating Entities in News Topic Modeling

ECN Topic 1.1 Modeling Results

Topic modeling experiments benchmark and simple evaluations

Topic 7: GIS Models and Modeling

Topic #9 – Linear Modeling ‏

Language Modeling using PLSA-Based Topic HMM

Topic Modeling using Semantic and Network structure

Elementary Text Analysis & Topic Modeling

TOPIC : Memory modeling

Modeling Dynamic Multi-topic Discussions in Online Forums

Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Special-topic Lecture Bioinformatics: Modeling Cell Fate

iTopicModel: Information Network-Integrated Topic Modeling

MARCXTM: Topic Maps Modeling of MARC Bibliographic Information

Topic Modeling using Semantic and Network structure

Monitoring, Modeling and Research Needs CENR TOPIC #1.

Special-topic Lecture Bioinformatics: Modeling Cell Fate

Topic modeling

Topic modeling

Presentation Transcript

Modeling and Detecting Anomalous Topic Access

Text Mining and Topic Modeling

An Automatic Advertisement/Topic MODELING AND RECOMMENDING SYSTEM

Incorporating Entities in News Topic Modeling

ECN Topic 1.1 Modeling Results

Topic modeling experiments benchmark and simple evaluations

Topic 7: GIS Models and Modeling

Topic #9 – Linear Modeling ‏

Language Modeling using PLSA-Based Topic HMM

Topic Modeling using Semantic and Network structure

Elementary Text Analysis &amp; Topic Modeling

TOPIC : Memory modeling

Modeling Dynamic Multi-topic Discussions in Online Forums

Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Special-topic Lecture Bioinformatics: Modeling Cell Fate

iTopicModel: Information Network-Integrated Topic Modeling

MARCXTM: Topic Maps Modeling of MARC Bibliographic Information

Topic Modeling using Semantic and Network structure

Monitoring, Modeling and Research Needs CENR TOPIC #1.

Special-topic Lecture Bioinformatics: Modeling Cell Fate

Elementary Text Analysis & Topic Modeling