570 likes | 710 Views
Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities. Xuerui Wang Computer Science Department University of Massachusetts Amherst. Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty. Probabilistic topic models.
E N D
Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities Xuerui Wang Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty.
Probabilistic topic models • Main Assumption: • Documents are mixture of topics • Topic distributions over words for co-occurrence • Objectives: • Understand text using learned topics • Represent documents in topic space
Clustering words into topics withLatent Dirichlet Allocation [Blei, Ng, Jordan 2003] GenerativeProcess: Example: For each document: 70% finance 30% environment Sample a distributionover topics, For each word in doc Sample a topic, z finance environment Sample a wordfrom the topic, w “bank”
Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]
Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]
Documents are not just text ! • Multiple modalities: • Research papers (author, venue, words, etc.) • Email messages (sender, recipients, time, words, etc.) • Legislative resolutions (voting record, words, etc.) • And many more • Most previous work: one modality at a time • Learn topics from words • Discover groups from relations • Etc.
Outline • Introduction • Role and Topic Discovery in Social Networks • Group and Topic Discovery from Voting Records • Topics over Time • Topical Phrase with Markov Assumption • Conclusions
All possible “topic models” with one latent topic, two observed modalities and two conditional dependencies
Outline • Introduction • Role and Topic Discovery in Social Networks • Group and Topic Discovery from Voting Records • Topics over Time • Topical Phrase with Markov Assumption • Conclusions
Inference and Estimation • Gibbs Sampling: • Easy to implement • Reasonably fast r
Enron email corpus • 250k email messages • 147 people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com
Topics, and prominent senders / receiversdiscovered by ART Topic names, by hand
Topics, and prominent senders / receiversdiscovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”
Comparing role discovery Traditional SNA ART Author-Topic connection strength (A,B) = distribution over recipients distribution over authored topics distribution over authored topics
Comparing role discoveryTracy Geaconne Dan McCarty Traditional SNA ART Author-Topic Different roles Different roles Similar roles Geaconne = “Secretary” McCarty = “Vice President”
Comparing role discoveryLynn Blair Kimberly Watson Traditional SNA ART Author-Topic Very similar Very different Different roles Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning”
McCallum Email Corpus 2004 • January - October 2004 • 23k email messages • 825 people From: kate@cs.umass.edu Subject: NIPS and .... Date: June 14, 2004 2:27:41 PM EDT To: mccallum@cs.umass.edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate
Outline • Introduction • Role and Topic Discovery in Social Networks • Group and Topic Discovery from Voting Records • Topics over Time • Topical Phrase with Markov Assumption • Conclusions
Discovering groups from observed set of relations Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students.
Adjacency matrix representing relations Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)
Group Model: partitioning entities into groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta Multinomial Dirichlet S: number of entities G: number of groups Binomial Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
Two relations with different attributes Student Roster Adams BennettCarterDavis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E)
budget, funding, annual, cash document, corrections, review, annual Goal:Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics.
The Group-Topic model: discovering groups and topics simultaneously Beta Uniform Multinomial Dirichlet Dirichlet Binomial Multinomial
U.S. Senate data set • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and bankingAccountingAdministrative feesCost controlCreditDeposit insuranceDepressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay……
Topics discovered (U.S. Senate) Mixture of Unigrams Group-Topic Model
Groups discovered (US Senate) Groups from topic Education + Domestic
Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicare
Do we get better groups with the GT model? Baseline Model GT Model • Cluster bills into topics using mixture of unigrams; • Apply group model on topic-specific subsets of bills. • Jointly cluster topic and groups at the same time using the GT model. Agreement Index (AI) measures group cohesion. Higher, better.
Outline • Introduction • Role and Topic Discovery in Social Networks • Group and Topic Discovery from Voting Records • Topics over Time • Topical Phrase with Markov Assumption • Conclusions
Want to model trends over time • Is prevalence of topic growing or waning? • Pattern appears only briefly • Capture its statistics in focused way • Don’t confuse it with patterns elsewhere in time • How do roles, groups, influence shift over time?
Dirichlet prior Dirichlet prior Betaover time multinomialover topics multinomialover topics Dirichlet prior Dirichlet prior timestamp topicindex topicindex Betaover time word timestamp word Multinomialover words Multinomialover words Topics Over Time (TOT)
State of the union address 208 Addresses delivered between January 8, 1790 and January 29, 2002. • To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. • 17156 ‘documents’ • 21534 words • 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910
Topic Distributions Conditioned on Time in NIPS conference papers topic mass (in vertical height) time
TOT improves ability to predict time Predicting the year of a State-of-the-Union address. L1 = distance between predicted year and actual year.
Outline • Introduction • Role and Topic Discovery in Social Networks • Group and Topic Discovery from Voting Records • Topics over Time • Topical Phrase with Markov Assumption • Conclusions
Topic Interpretability Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function LDA algorithms algorithm genetic problems efficient
Topics modeling phrases • Topics based only on unigrams often difficult to interpret • Topic discovery itself is confused because important meaning / distinctions carried by phrases. • Significant opportunity to provide improved language models to ASR, MT, IR, etc.
Topical N-Gram model z1 z2 z3 z4 . . . y1 y2 y3 y4 . . . w1 w2 w3 w4 . . . D 2 1 1 2 W W T T
Features of Topical N-Grams model • Easily trained by Gibbs sampling • Can run efficiently on millions of words • Topic-specific phrase discovery • “white house” has special meaning as a phrasein the politics topic, • ... but not in the real estate topic.
NIPS research papers • Full text of NIPS papers between 1987-1999. • 1,740 research papers in total. • 13, 649 unique words and 2,301,375 word tokens. • Stop words removed and no stemming.
Topical N-grams (2+) Topical N-grams (1) policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning RL function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods “Reinforcement Learning” LDA state learning policy action reinforcement states time optimal actions function algorithm reward step dynamic control sutton rl decision algorithms agent
“Support Vector Machines” LDA Topical N-grams (2+) Topical N-grams (1) kernel training support margin svm solution kernels regularization adaboost test data generalization examples cost convex algorithm working feature sv functions kernel linear vector support set nonlinear data algorithm space pca function problem margin vectors solution training svm kernels matrix machines support vectors test error support vector machines training error feature space training examples decision function cost functions test inputs kkt conditions leave-one-out procedure soft margin bayesian transduction training patterns training points maximum margin strictly convex regularization operators base classifiers convex optimization