Bringing Structure to Text

Bringing Structure to Text Jiawei Han, Chi Wang and Ahmed El-Kishky Computer Science, University of Illinois at Urbana-Champaign August 24, 2014

Outline • Introduction to bringing structure to text • Mining phrase-based and entity-enriched topical hierarchies • Heterogeneous information network construction and mining • Trends and research problems

Motivation of Bringing Structure to Text • The prevalence of unstructured data • Structures are useful for knowledge discovery • Too expensive to be structured by human: Automated & scalable Vast majority of the CEOs expressed frustration over their organization’s inability to glean insights from available data -- IBM study with1500+ CEOs Up to 85% of all information is unstructured -- estimated by industry analysts

Information Overload: A Critical Problem in Big Data Era • By 2020, information will double every 73 days • --G.Starkweather(Microsoft),1992 • Unstructured or loosely structured data are prevalent Information growth

Example: Research Publications • Every year, hundreds of thousands papers are published • Unstructured data: paper text • Loosely structured entities: authors, venues venue papers author

Example: News Articles • Every day, >90,000 news articles are produced • Unstructured data: news content • Extracted entities: persons, locations, organizations, … organization location news person

Example: Social Media • Every second, >150K tweets are sent out • Unstructured data: tweet content • Loosely structured entities: twitters, hashtags, URLs, … #maythefourthbewithyou hashtag twitter DarthVader tweets URL The White House

Text-Attached Information Network for Unstructured and Loosely-Structured Data organization hashtag venue location twitter news papers tweets person URL author text entity (given or extracted)

What Power Can We Gain if More Structures Can Be Discovered? • Structured database queries • Information network analysis, …

Structures Facilitate Multi-Dimensional Analysis: An EventCube Experiment

Distribution along Multiple Dimensions Query ‘health care bill’ in news data

Entity Analysis and Profiling Topic distribution for “Stanford University”

AMETHYST [Danilevsky et al. 13]

Structures Facilitate Heterogeneous Information Network Analysis • Real-world data: Multiple object types and/or multiple link types Movie Studio The Facebook Network Director Actor Venue Paper Author Movie DBLP Bibliographic Network The IMDB Movie Network

What Can Be Mined in Structured Information Networks • Example: DBLP: A Computer Science bibliographic database

Useful Structure from Text: Phrases, Topics, Entities • Top 10 active politiciansand phrases regarding healthcare issues? • Top 10 researchers and phrases in data miningand their specializations? Topics (hierarchical) entity Phrases Entities text

Outline • Introduction to bringing structure to text • Mining phrase-based and entity-enriched topical hierarchies • Heterogeneous information network construction and mining • Trends and research problems

Topic Hierarchy: Summarize the Data with Multiple Granularity venue • Top 10 researchers in data mining? • And their specializations? • Important research areas in SIGIR conference? papers author

Methodologies of Topic Mining

Bag-of-Words Topic Modeling • Widely studied technique for text analysis • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many other text mining tasks • Represent each document as a bag of words: all the words within a document are exchangeable • Probabilistic approach

Topic: Multinomial Distribution over Words • A document is modeled as a sample of mixed topics • How can we discover these topic word distributions from a corpus? government 0.3 response 0.2... Topic 1 [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ]…[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic 3 Example from ChengxiangZhai's lecture notes

Routine of Generative Models • Model design: assume the documents are generated by a certain process • Model Inference: Fit the model with observed documents to recover the unknown parameters corpus Generative process with unknown parameters Criticism of government response to the hurricane … Two representative models: pLSA and LDA

Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] • topics: multinomial distributions over words • documents: multinomial distributions over topics Topic Topic Doc .3 .3 .4 … donate 0.1relief 0.05... government 0.3 response 0.2... … Doc .2 .5 .3 Generative process: we will generate each token in each document according to

PLSA– Model Design • topics: multinomial distributions over words • documents: multinomial distributions over topics Topic Topic Doc .3 .3 .4 … donate 0.1relief 0.05... government 0.3 response 0.2... … Doc .2 .5 .3 • To generate a token in document : • Sample a topic label according to (e.g. z=1) • Sample a word w according to (e.g. w=government) .3 .3 .4 Topic

PLSA – Model Inference • What parameters are most likely to generate the observed corpus? Topic Topic corpus … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … Doc .? .? .? … Doc .? .? .? • To generate a token in document : • Sample a topic label according to (e.g. z=1) • Sample a word w according to (e.g. w=government) .3 .3 .4 Topic

PLSA – Model Inference using Expectation-Maximization (EM) • Exact max likelihood is hard => approximate optimization with EM Topic Topic corpus … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … Doc .? .? .? … Doc .? .? .? E-step: Fix , estimate topic labels for every token in every document M-step: Use estimated topic labels to estimate Guaranteed to converge to a stationary point, but not guaranteed optimal

How the EM Algorithm Works Topic Topic response criticism d1 … donate 0.1relief 0.05... government 0.3 response 0.2... government … Doc .3 .3 .4 M-step hurricane dD Sum fractional counts government … response Doc .2 .5 .3 E-step Bayes rule

Analysis of pLSA Pros cons High model complexity -> prone to overfitting The EM solution is neither optimal nor unique • Simple, only one hyperparameter k • Easy to incorporate prior in the EM algorithm

Latent Dirichlet Allocation (LDA) [Blei et al. 02] • Impose Dirichlet prior to the model parameters -> Bayesian version of pLSA Doc .3 .3 .4 … Topic Topic Doc .2 .5 .3 … donate 0.1relief 0.05... government 0.3 response 0.2... To mitigate overfitting Generative process: First generate with Dirichlet prior, then generate each token in each document according to Same as pLSA

LDA – Model Inference Maximum likelihood Method of Moments Aim to find parameters that fit the moments (expectation of patterns) Exact inference is tractable Tensor orthogonal decomposition [Anandkumar et al. 12] Scalable tensor orthogonal decomposition [Wang et al. 14a] • Aim to find parameters that maximize the likelihood • Exact inference is intractable • Approximate inference • Variational EM [Blei et al. 03] • Markov chain Monte Carlo (MCMC) – collapsed Gibbs sampler [Griffiths & Steyvers 04]

MCMC – Collapsed Gibbs Sampler [Griffiths & Steyvers 04] response … criticism d1 Topic Topic government … donate 0.1relief 0.05... government 0.3 response 0.2... … hurricane dD … government Estimated Estimated response … Iter 1 Iter 2 Iter 1000 Sample each zi conditioned on z-i

Method of Moments [Anandkumar et al. 12, Wang et al. 14a] Topic Topic corpus • What parameters are most likely to generate the observed corpus? … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … • What parameters fit the empirical moments? Moments: expectation of patterns length 1 length 2(pair) length 3(triple)

Guaranteed Topic Recovery • Theorem. The patterns up to length 3are sufficient for topic recovery V V V V V V: vocabularysize; k:topicnumber length 1 length 2(pair) length 3(triple)

Tensor Orthogonal Decomposition for LDA Topic Input corpus eigen decomposition V government 0.3 response 0.2... Normalized pattern counts V k … k k tensor product Topic V V: vocabularysize k:topicnumber donate 0.1relief 0.05... V V [Anandkumar et al. 12]

Tensor Orthogonal Decomposition for LDA – Not Scalable Prohibitive to compute Topic Input corpus V government 0.3 response 0.2... Normalized pattern counts V … k Topic k k donate 0.1relief 0.05... V V: vocabularysize; k:topicnumber L: # tokens; l: average doc length V V Time: Space:

Scalable Tensor Orthogonal Decomposition Topic Input corpus 2nd scan V government 0.3 response 0.2... Normalized pattern counts V k 1st scan … k k Topic V Sparse & low rank donate 0.1relief 0.05... V Time: Space: V # nonzero Decomposable [Wang et al. 14a]

Speedup 1 Eigen-Decomposition of Eigen-decomposition of (Sparse) (Eigenvec) V V k V k k V k

Speedup 1 Eigen-Decomposition of Eigen-decomposition of 2. Eigen-decomposition of k k k k (Small) (Eigenvec) k k k k

Speedup 2Construction of Small Tensor (Dense) (Sparse) V V V V V

STOD – Scalable tensor orthogonal decomposition TOD – Tensor orthogonal decomposition Gibbs Sampling – Collapsed Gibbs sampling 20-3000 Times Faster Two scans vs. thousands of scans Synthetic data Real data L=19M L=39M

Recovery error on synthetic data Effectiveness STOD = TOD > Gibbs Sampling Coherence on real data Recovery error is low when the sample is large enough Variance is almost 0 Coherence is high CS News

Summary of LDA Model Inference Maximum likelihood Method of Moments STOD [Wang et al. 14a] fast, scan data twice robust recovery with theoretic guarantee New and promising! • Approximate inference • slow, scan data thousands of times • large variance, no theoretic guarantee • Numerous follow-up work • further approximation [Porteous et al. 08, Yao et al. 09, Hoffman et al. 12] etc. • parallelization [Newman et al. 09] etc. • online learning [Hoffman et al. 13] etc.

Methodologies of Topic Mining

Flat Topics -> Hierarchical Topics • In PLSA and LDA, a topic is selected from a flat pool of topics • In hierarchical topic models, a topic is selected from a hierarchy • CS Topic Topic o • Informationtechnology&system … donate 0.1relief 0.05... government 0.3 response 0.2... IR o/1 o/2 DB To generate a token in document : Sample a topic label accordingto Sample a word w according to .3 .3 .4 Topic o/2/2 o/1/2 o/1/1 o/2/1

Hierarchical Topic Models • Topics form a tree structure • nested Chinese Restaurant Process [Griffiths et al. 04] • recursive Chinese Restaurant Process [Kim et al. 12a] • LDA with Topic Tree [Wang et al. 14b] • Topics form a DAG structure • Pachinko Allocation [Li & McCallum 06] • hierarchical Pachinko Allocation [Mimno et al. 07] • nested Chinese Restaurant Franchise [Ahmed et al. 13] o o o/1 o/1 o/2 o/2 o/2/2 o/2/2 o/1/2 o/1/2 o/1/1 o/1/1 o/2/1 o/2/1 DAG: directed acyclic graph

Hierarchical Topic Model Inference Maximum likelihood Method of Moments Scalable Tensor Recursive Orthogonal Decomposition [Wang et al. 14b] fast and robust recovery with theoretic guarantee Recursive method - only for LDA with Topic Tree model • Exact inference is intractable • Approximate inference: variational inferenceor MCMC • Non recursive – all the topics are inferred at once Most popular

LDA with Topic Tree o Topic distributions o/1 o/2 … Word distributions #words in d Dirichlet prior #docs Latent Dirichlet Allocation with Topic Tree o/2/2 o/1/2 o/1/1 o/2/1 [Wang et al. 14b]

Recursive Inference for LDA with Topic Tree • A large tree subsumesa smaller tree with shared model parameters Inference order Easy to revise the tree structure Flexibleto decide when to terminate [Wang et al. 14b]

Scalable Tensor Recursive Orthogonal Decomposition • Theorem. STROD ensures robust recovery and revision Input corpus Topic + Topic t government 0.3 response 0.2... Normalized pattern counts for t k … k k Topic donate 0.1relief 0.05... [Wang et al. 14b]

Bringing Structure to Text

Bringing Structure to Text

Presentation Transcript

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

text structure

Text Structure

TEXT STRUCTURE

Text Structure

Text Structure

TEXT STRUCTURE

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure

Text Structure