570 likes | 842 Views
Probabilistic Topic Models and Associative Memory. Overview. I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining. Example of associative memory: word association. CUE: PLAY. RESPONSES:
E N D
Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining
Example of associative memory: word association CUE: PLAY RESPONSES: FUN, BALL, GAME, WORK, GROUND, MATE, CHILD, ENJOY, WIN, ACTOR
Example of associative memory: free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS ..... FALSE RECALL: “Sleep” 61%
A theory for semantic association • Semantic association as probabilistic inference • Representation of semantic structure
Latent Semantic Structure Distribution over words Latent Structure Inferring latent structure Words Prediction
Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining
Probabilistic Topic Models • Probabilistic Latent Semantic Indexing (pLSI) • Hoffman (1999): • Latent Dirichlet Allocation (LDA) • Blei, Ng, & Jordan (2003) this talk, use topic models as a theory for human semantic association
Topic Model • Unsupervised learning of topics (“gist”) of documents: • articles/chapters • conversations • emails • .... any verbal context • Topics are useful latent structures to explain semantic association
Probabilistic Generative Model • Each document is a probability distribution over topics • Each topic is a probability distribution over words
GENERATIVE PROCESS money money loan bank DOCUMENT 1: money1 bank1 bank1 loan1river2 stream2bank1 money1river2 bank1 money1 bank1 loan1money1 stream2bank1 money1 bank1 bank1 loan1river2 stream2bank1 money1river2 bank1 money1 bank1 loan1bank1 money1 stream2 .8 loan bank bank loan .3 TOPIC 1 .2 DOCUMENT 2: river2 stream2 bank2 stream2 bank2money1loan1 river2 stream2loan1 bank2 river2 bank2bank1stream2 river2loan1 bank2 stream2 bank2money1loan1river2 stream2 bank2 stream2 bank2money1river2 stream2loan1 bank2 river2 bank2money1bank1stream2 river2 bank2 stream2 bank2money1 river bank river .7 stream river bank stream TOPIC 2 Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b ) Mixture components Mixture weights
The probability of choosing a word: word probability in topic j probability of topic jin document
Graphical Model sample a distribution over topics a q sample a topic z b f sample a word from that topic w N d D T
INVERTING THE GENERATIVE PROCESS DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewingby large audiences). A Play is written because playwrights have something ... ? ? TOPIC 1 DOCUMENT 2: He was listening to music coming from a passing riverboat. The music had already captured his heart as well as his ear. It was jazz. Bix beiderbecke had already had music lessons. He wanted to play the cornet. And he wanted to play jazz....... ? TOPIC 2 We estimate the assignments of topics to words
INVERTING THE GENERATIVE PROCESS DOCUMENT 1: APlay082iswritten082to beperformed082on astage082before alive093audience082or beforemotion270picture004ortelevision004cameras004( forlater054viewing004bylarge202audiences082). APlay082iswritten082becauseplaywrights082have something ... ? ? TOPIC 1 DOCUMENT 2: He waslistening077tomusic077coming009from apassing043riverboat.Themusic077had alreadycaptured006hisheart157as well as hisear119. It wasjazz077. Bix beiderbecke had already hadmusic077lessons077. Hewanted268toplay077the cornet. And hewanted268 toplay077 jazz077....... ? TOPIC 2 We estimate the assignments of topics to words
Statistical Inference • Fix number of topics T • We estimate the posterior over topic assignments • Markov Chain Monte Carlo (MCMC) with Gibbs sampling
Choosing number of topics • Subjective interpretability • Bayesian model selection • Griffiths & Steyvers (2004) • Generalization test • Non-parametric Bayesian statistics • Infinite models; models that grow with size of data • Teh, Jordan, Teal, & Blei (2004) • Blei, Griffiths, Jordan, Tenenbaum (2004)
Procedure INPUT: word-document counts OUTPUT: topic assignments to each word likely words in each topic likely topics for a document (“gist”)
Example: topics from an educational corpus (TASA) • 37K docs, 26K words • 1700 topics, e.g.: PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining
Example associative structure BAT BALL BASEBALL GAME PLAY STAGE THEATER (Association norms by Doug Nelson et al. 1998)
Explaining structure with topics BAT BALL topic 1 BASEBALL GAME PLAY topic 2 STAGE THEATER
Tasa corpus • Need a suitable corpus to model human associations • TASA • an educational corpus of text • 37K documents • 26K words
Modeling Word Association • Word association modeled as prediction • Given that a single word is observed, what future other words might occur? • Under a single topic assumption: Cue Response
Model predictions RANK 9
Median rank of first associate Median Rank
Latent Semantic Analysis(Landauer & Dumais, 1997) • Each word is a single point in semantic space • Similarity measured by cosine of angle between word vectors high dimensional space Singular value decomposition STREAM RIVER word-document counts BANK MONEY
Median rank of first associate Median Rank
Triangle Inequality in Spatial Representations THEATER w1 w2 w3 SOCCER PLAY Cosine similarity: cos(w1,w3) ≥ cos(w1,w2)cos(w2,w3) – sin(w1,w2)sin(w2,w3)
Testing violation of triangle inequality • Look for triplets of associates w1 w2 w3 such that P( w2 | w1 ) > t P( w3 | w2 ) > t and measure P( w3 | w1 ) • Vary threshold t
Recall: example study List STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process • Reconstruct study list based on the stored “gist” • The gist can be represented by a distribution over topics • Under a single topic assumption: Retrieved word Study list
Predictions for the “Sleep” list STUDYLIST EXTRALIST (top 8)
Correlation between intrusion rates and predictions .69 .53 .37
Latent Semantic Analysis vs. Topics • Quantitative differences • Qualitative differences • probabilistic generative models can work with more structured representations • Extensions of topic models: • hierarchies • syntax-semantics
Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining
Integrating Topics and Syntax (Griffiths, Steyvers, Blei, & Tenenbaum, 2004) • Syntactic dependencies short range dependencies • Semantic dependencies long-range q Semantic state: generate words from topic model z1 z2 z3 z4 w1 w2 w3 w4 Syntactic states: generate words from HMM s1 s2 s3 s4
ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION ... IN BY WITH ON AS FROM TO FOR IS ARE BE HAS HAVE WAS WERE AS THE A AN THIS THEIR ITS EACH ONE BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION (S) THESEARCHINLONG TERM MEMORY …… (S) A MODEL OFVISUAL ATTENTION ……
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Topic Hierarchies • In regular topic model, no relations between topics • Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 6 topic 5 topic 7 topic 4 • Nested Chinese Restaurant Process • Blei, Griffiths, Jordan, Tenenbaum (2004) • Learn hierarchical structure, as well as topics within structure
Example: Psych Review Abstracts THE OF AND TO IN A IS A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES
Generative Process THE OF AND TO IN A IS A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES
Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining
Applications in Machine Learning • Automatically learn topics from large text collections • NSF/NIH grant proposals • 18th century newspapers • Enron email • Topics provide quick overview of content
Enron email data 500,000 emails 5000 authors 1999-2002
Enron topics TEXANS WINFOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis