Statistical Modeling of Large Text Collections

Statistical Modeling of Large Text CollectionsPadhraic SmythDepartment of Computer ScienceUniversity of California, Irvine MURI Project Kick-off MeetingNovember 18th 2008

The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on….

The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on…. • Most of this work is happening in computing, but many of the underlying techniques are statistical

Motivation Pennsylvania Gazette 80,000 articles 1728-1800 16 million Medline articles NYT 1.5 million articles

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on…..

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on….. Key Ideas: • Learn a probabilistic model over words and docs • Treat query-answering as computation of appropriate conditional probabilities

Topic Models for Documents P( words | document ) = ?? = S P(words|topic) P (topic|document) Topic = probability distribution over words Coefficients for each document Automatically learned from text corpus

Topics = Multinomials over Words

Basic Concepts • Topics = distributions over words • Unknown a priori, learned from data • Documents represented as mixtures of topics • Learning algorithm • Gibbs sampling (stochastic search) • Linear time per iteration • Provides a full probabilistic model over words, documents, and topics • Query answering = computation of conditional probabilities

Enron email data 250,000 emails 28,000 individuals 1999-2002

Enron email: business topics

Enron: non-work topics…

Examples of Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

Topic trends from New York Times Tour-de-France TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings 330,000 articles 2000-2002 ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax

What does an author write about? • Author = Jerry Friedman, Stanford:

What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,…

What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM:

What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM: - Topic 1: index, data, update, join, efficient…. - Topic 2: query, database, relational, optimization, answer…. - Topic 3: data, mining, association, discovery, attributes,…

Examples of Data Sets Modeled • 1,200 Bible chapters (KJV) • 4,000 Blog entries • 20,000 PNAS abstracts • 80,000 Pennsylvania Gazette articles • 250,000 Enron emails • 300,000 North Carolina vehicle accident police reports • 500,000 New York Times articles • 650,000 CiteSeer abstracts • 8 million MEDLINE abstracts • Books by Austen, Dickens, and Melville • ….. • Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically

Related Work • Statistical origins • Latent class models in statistics (late 60’s) • Admixture models in genetics • LDA Model: Blei, Ng, and Jordan (2003) • Variational EM • Topic Model: Griffiths and Steyvers (2004) • Collapsed Gibbs sampler • Alternative approaches • Latent semantic indexing (LSI/LSA) • less interpretable, not appropriate for count data • Document clustering: • simpler but less powerful

Clusters v. Topics

Clusters v. Topics One Cluster

Clusters v. Topics Multiple Topics One Cluster

Extensions • Author-topic models • Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) • Special-words model • Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006) • Entity-topic models • Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006) • See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc • Probabilistic basis allows for a wide range of generalizations

Combining Models for Networks and Text

Technical Approach and Challenges • Develop flexible probabilistic network models that can incorporate textual information • e.g., ERGMs with text as node or edge covariates • e.g., latent space models with text-based covariates • e.g., dynamic relational models with text as edge covariates • Research challenges • Computational scalability • ERGMS not directly applicable to large text data sets • What text representation to use: • High-dimensional “bag of words” ? • Low-dimensional latent topics ? • Utility of text • Does the incorporation of textual information produce more accurate models or predictions? How can this be quantified?

Graphical Model z Group Variable .......... Word 2 Word 1 Word n

Graphical Model z Group Variable w Word n words

Graphical Model z Group Variable w Word n words D documents

Mixture Model for Documents Group Probabilities a z Group Variable f Group-Word distributions w Word n words D documents

Clustering with a Mixture Model Cluster Probabilities a z Cluster Variable f Cluster-Word distributions w Word n words D documents

Graphical Model for Topics Document-Topic distributions q z Topic f Topic-Word distributions w Word n D

Learning via Gibbs sampling Document-Topic distributions q Gibbs sampler to estimate z for each word occurrence, …… marginalizing over other parameters z Topic f Topic-Word distributions w Word n D

More Details on Learning • Gibbs sampling for word-topic assignments (z) • 1 iteration = full pass through all words in all documents • Typically run a few hundred Gibbs iterations • Estimating θand  • use z samples to get point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Can still take order of a day on 100k or more docs

Gibbs Sampler Stability

Statistical Modeling of Large Text Collections

Statistical Modeling of Large Text Collections

Presentation Transcript

The Electronic Text

THE REVOLUTION

Text text text.

Marking the Text

Russian Revolution Text pages 732 - 737

The Revolution

The Revolution

Marking the Text

The Revolution

The Revolution

Marking the Text

Text Text Text Text Text Text Text Text Text Text Text Text Text Text

The Revolution

THE TEXT MESSAGE

The Revolution

THE REVOLUTION!!!!

The humanities: Text producers, text products and online text

The Revolution

THE REVOLUTION!!!!

3. The Copernican Revolution and Newton’s Revolution or , The Revolution Revolution:

The Revolution

Text Text Text