390 likes | 406 Views
Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting November 18th 2008. The Text Revolution. Widespread availability of text in digital form is driving
E N D
Statistical Modeling of Large Text CollectionsPadhraic SmythDepartment of Computer ScienceUniversity of California, Irvine MURI Project Kick-off MeetingNovember 18th 2008
The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on….
The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on…. • Most of this work is happening in computing, but many of the underlying techniques are statistical
Motivation Pennsylvania Gazette 80,000 articles 1728-1800 16 million Medline articles NYT 1.5 million articles
Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on…..
Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on….. Key Ideas: • Learn a probabilistic model over words and docs • Treat query-answering as computation of appropriate conditional probabilities
Topic Models for Documents P( words | document ) = ?? = S P(words|topic) P (topic|document) Topic = probability distribution over words Coefficients for each document Automatically learned from text corpus
Basic Concepts • Topics = distributions over words • Unknown a priori, learned from data • Documents represented as mixtures of topics • Learning algorithm • Gibbs sampling (stochastic search) • Linear time per iteration • Provides a full probabilistic model over words, documents, and topics • Query answering = computation of conditional probabilities
Enron email data 250,000 emails 28,000 individuals 1999-2002
Examples of Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP
Topic trends from New York Times Tour-de-France TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings 330,000 articles 2000-2002 ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax
What does an author write about? • Author = Jerry Friedman, Stanford:
What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,…
What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM:
What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM: - Topic 1: index, data, update, join, efficient…. - Topic 2: query, database, relational, optimization, answer…. - Topic 3: data, mining, association, discovery, attributes,…
Examples of Data Sets Modeled • 1,200 Bible chapters (KJV) • 4,000 Blog entries • 20,000 PNAS abstracts • 80,000 Pennsylvania Gazette articles • 250,000 Enron emails • 300,000 North Carolina vehicle accident police reports • 500,000 New York Times articles • 650,000 CiteSeer abstracts • 8 million MEDLINE abstracts • Books by Austen, Dickens, and Melville • ….. • Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically
Related Work • Statistical origins • Latent class models in statistics (late 60’s) • Admixture models in genetics • LDA Model: Blei, Ng, and Jordan (2003) • Variational EM • Topic Model: Griffiths and Steyvers (2004) • Collapsed Gibbs sampler • Alternative approaches • Latent semantic indexing (LSI/LSA) • less interpretable, not appropriate for count data • Document clustering: • simpler but less powerful
Clusters v. Topics One Cluster
Clusters v. Topics Multiple Topics One Cluster
Extensions • Author-topic models • Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) • Special-words model • Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006) • Entity-topic models • Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006) • See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc • Probabilistic basis allows for a wide range of generalizations
Technical Approach and Challenges • Develop flexible probabilistic network models that can incorporate textual information • e.g., ERGMs with text as node or edge covariates • e.g., latent space models with text-based covariates • e.g., dynamic relational models with text as edge covariates • Research challenges • Computational scalability • ERGMS not directly applicable to large text data sets • What text representation to use: • High-dimensional “bag of words” ? • Low-dimensional latent topics ? • Utility of text • Does the incorporation of textual information produce more accurate models or predictions? How can this be quantified?
Graphical Model z Group Variable .......... Word 2 Word 1 Word n
Graphical Model z Group Variable w Word n words
Graphical Model z Group Variable w Word n words D documents
Mixture Model for Documents Group Probabilities a z Group Variable f Group-Word distributions w Word n words D documents
Clustering with a Mixture Model Cluster Probabilities a z Cluster Variable f Cluster-Word distributions w Word n words D documents
Graphical Model for Topics Document-Topic distributions q z Topic f Topic-Word distributions w Word n D
Learning via Gibbs sampling Document-Topic distributions q Gibbs sampler to estimate z for each word occurrence, …… marginalizing over other parameters z Topic f Topic-Word distributions w Word n D
More Details on Learning • Gibbs sampling for word-topic assignments (z) • 1 iteration = full pass through all words in all documents • Typically run a few hundred Gibbs iterations • Estimating θand • use z samples to get point estimates • non-informative Dirichlet priors forθand • Computational Efficiency • Learning is linear in the number of word tokens • Can still take order of a day on 100k or more docs