1.77k likes | 1.96k Views
Bringing Structure to Text. Jiawei Han, Chi Wang and Ahmed El- Kishky Computer Science, University of Illinois at Urbana-Champaign August 24, 2014. Outline. Introduction to bringing structure to text Mining phrase-based and entity-enriched topical hierarchies
E N D
Bringing Structure to Text Jiawei Han, Chi Wang and Ahmed El-Kishky Computer Science, University of Illinois at Urbana-Champaign August 24, 2014
Outline • Introduction to bringing structure to text • Mining phrase-based and entity-enriched topical hierarchies • Heterogeneous information network construction and mining • Trends and research problems
Motivation of Bringing Structure to Text • The prevalence of unstructured data • Structures are useful for knowledge discovery • Too expensive to be structured by human: Automated & scalable Vast majority of the CEOs expressed frustration over their organization’s inability to glean insights from available data -- IBM study with1500+ CEOs Up to 85% of all information is unstructured -- estimated by industry analysts
Information Overload: A Critical Problem in Big Data Era • By 2020, information will double every 73 days • --G.Starkweather(Microsoft),1992 • Unstructured or loosely structured data are prevalent Information growth
Example: Research Publications • Every year, hundreds of thousands papers are published • Unstructured data: paper text • Loosely structured entities: authors, venues venue papers author
Example: News Articles • Every day, >90,000 news articles are produced • Unstructured data: news content • Extracted entities: persons, locations, organizations, … organization location news person
Example: Social Media • Every second, >150K tweets are sent out • Unstructured data: tweet content • Loosely structured entities: twitters, hashtags, URLs, … #maythefourthbewithyou hashtag twitter DarthVader tweets URL The White House
Text-Attached Information Network for Unstructured and Loosely-Structured Data organization hashtag venue location twitter news papers tweets person URL author text entity (given or extracted)
What Power Can We Gain if More Structures Can Be Discovered? • Structured database queries • Information network analysis, …
Structures Facilitate Multi-Dimensional Analysis: An EventCube Experiment
Distribution along Multiple Dimensions Query ‘health care bill’ in news data
Entity Analysis and Profiling Topic distribution for “Stanford University”
Structures Facilitate Heterogeneous Information Network Analysis • Real-world data: Multiple object types and/or multiple link types Movie Studio The Facebook Network Director Actor Venue Paper Author Movie DBLP Bibliographic Network The IMDB Movie Network
What Can Be Mined in Structured Information Networks • Example: DBLP: A Computer Science bibliographic database
Useful Structure from Text: Phrases, Topics, Entities • Top 10 active politiciansand phrases regarding healthcare issues? • Top 10 researchers and phrases in data miningand their specializations? Topics (hierarchical) entity Phrases Entities text
Outline • Introduction to bringing structure to text • Mining phrase-based and entity-enriched topical hierarchies • Heterogeneous information network construction and mining • Trends and research problems
Topic Hierarchy: Summarize the Data with Multiple Granularity venue • Top 10 researchers in data mining? • And their specializations? • Important research areas in SIGIR conference? papers author
Bag-of-Words Topic Modeling • Widely studied technique for text analysis • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many other text mining tasks • Represent each document as a bag of words: all the words within a document are exchangeable • Probabilistic approach
Topic: Multinomial Distribution over Words • A document is modeled as a sample of mixed topics • How can we discover these topic word distributions from a corpus? government 0.3 response 0.2... Topic 1 [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ]…[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic 3 Example from ChengxiangZhai's lecture notes
Routine of Generative Models • Model design: assume the documents are generated by a certain process • Model Inference: Fit the model with observed documents to recover the unknown parameters corpus Generative process with unknown parameters Criticism of government response to the hurricane … Two representative models: pLSA and LDA
Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] • topics: multinomial distributions over words • documents: multinomial distributions over topics Topic Topic Doc .3 .3 .4 … donate 0.1relief 0.05... government 0.3 response 0.2... … Doc .2 .5 .3 Generative process: we will generate each token in each document according to
PLSA– Model Design • topics: multinomial distributions over words • documents: multinomial distributions over topics Topic Topic Doc .3 .3 .4 … donate 0.1relief 0.05... government 0.3 response 0.2... … Doc .2 .5 .3 • To generate a token in document : • Sample a topic label according to (e.g. z=1) • Sample a word w according to (e.g. w=government) .3 .3 .4 Topic
PLSA – Model Inference • What parameters are most likely to generate the observed corpus? Topic Topic corpus … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … Doc .? .? .? … Doc .? .? .? • To generate a token in document : • Sample a topic label according to (e.g. z=1) • Sample a word w according to (e.g. w=government) .3 .3 .4 Topic
PLSA – Model Inference using Expectation-Maximization (EM) • Exact max likelihood is hard => approximate optimization with EM Topic Topic corpus … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … Doc .? .? .? … Doc .? .? .? E-step: Fix , estimate topic labels for every token in every document M-step: Use estimated topic labels to estimate Guaranteed to converge to a stationary point, but not guaranteed optimal
How the EM Algorithm Works Topic Topic response criticism d1 … donate 0.1relief 0.05... government 0.3 response 0.2... government … Doc .3 .3 .4 M-step hurricane dD Sum fractional counts government … response Doc .2 .5 .3 E-step Bayes rule
Analysis of pLSA Pros cons High model complexity -> prone to overfitting The EM solution is neither optimal nor unique • Simple, only one hyperparameter k • Easy to incorporate prior in the EM algorithm
Latent Dirichlet Allocation (LDA) [Blei et al. 02] • Impose Dirichlet prior to the model parameters -> Bayesian version of pLSA Doc .3 .3 .4 … Topic Topic Doc .2 .5 .3 … donate 0.1relief 0.05... government 0.3 response 0.2... To mitigate overfitting Generative process: First generate with Dirichlet prior, then generate each token in each document according to Same as pLSA
LDA – Model Inference Maximum likelihood Method of Moments Aim to find parameters that fit the moments (expectation of patterns) Exact inference is tractable Tensor orthogonal decomposition [Anandkumar et al. 12] Scalable tensor orthogonal decomposition [Wang et al. 14a] • Aim to find parameters that maximize the likelihood • Exact inference is intractable • Approximate inference • Variational EM [Blei et al. 03] • Markov chain Monte Carlo (MCMC) – collapsed Gibbs sampler [Griffiths & Steyvers 04]
MCMC – Collapsed Gibbs Sampler [Griffiths & Steyvers 04] response … criticism d1 Topic Topic government … donate 0.1relief 0.05... government 0.3 response 0.2... … hurricane dD … government Estimated Estimated response … Iter 1 Iter 2 Iter 1000 Sample each zi conditioned on z-i
Method of Moments [Anandkumar et al. 12, Wang et al. 14a] Topic Topic corpus • What parameters are most likely to generate the observed corpus? … donate ?relief ?... government ?response ?... Criticism of government response to the hurricane … • What parameters fit the empirical moments? Moments: expectation of patterns length 1 length 2(pair) length 3(triple)
Guaranteed Topic Recovery • Theorem. The patterns up to length 3are sufficient for topic recovery V V V V V V: vocabularysize; k:topicnumber length 1 length 2(pair) length 3(triple)
Tensor Orthogonal Decomposition for LDA Topic Input corpus eigen decomposition V government 0.3 response 0.2... Normalized pattern counts V k … k k tensor product Topic V V: vocabularysize k:topicnumber donate 0.1relief 0.05... V V [Anandkumar et al. 12]
Tensor Orthogonal Decomposition for LDA – Not Scalable Prohibitive to compute Topic Input corpus V government 0.3 response 0.2... Normalized pattern counts V … k Topic k k donate 0.1relief 0.05... V V: vocabularysize; k:topicnumber L: # tokens; l: average doc length V V Time: Space:
Scalable Tensor Orthogonal Decomposition Topic Input corpus 2nd scan V government 0.3 response 0.2... Normalized pattern counts V k 1st scan … k k Topic V Sparse & low rank donate 0.1relief 0.05... V Time: Space: V # nonzero Decomposable [Wang et al. 14a]
Speedup 1 Eigen-Decomposition of Eigen-decomposition of (Sparse) (Eigenvec) V V k V k k V k
Speedup 1 Eigen-Decomposition of Eigen-decomposition of 2. Eigen-decomposition of k k k k (Small) (Eigenvec) k k k k
Speedup 2Construction of Small Tensor (Dense) (Sparse) V V V V V
STOD – Scalable tensor orthogonal decomposition TOD – Tensor orthogonal decomposition Gibbs Sampling – Collapsed Gibbs sampling 20-3000 Times Faster Two scans vs. thousands of scans Synthetic data Real data L=19M L=39M
Recovery error on synthetic data Effectiveness STOD = TOD > Gibbs Sampling Coherence on real data Recovery error is low when the sample is large enough Variance is almost 0 Coherence is high CS News
Summary of LDA Model Inference Maximum likelihood Method of Moments STOD [Wang et al. 14a] fast, scan data twice robust recovery with theoretic guarantee New and promising! • Approximate inference • slow, scan data thousands of times • large variance, no theoretic guarantee • Numerous follow-up work • further approximation [Porteous et al. 08, Yao et al. 09, Hoffman et al. 12] etc. • parallelization [Newman et al. 09] etc. • online learning [Hoffman et al. 13] etc.
Flat Topics -> Hierarchical Topics • In PLSA and LDA, a topic is selected from a flat pool of topics • In hierarchical topic models, a topic is selected from a hierarchy • CS Topic Topic o • Informationtechnology&system … donate 0.1relief 0.05... government 0.3 response 0.2... IR o/1 o/2 DB To generate a token in document : Sample a topic label accordingto Sample a word w according to .3 .3 .4 Topic o/2/2 o/1/2 o/1/1 o/2/1
Hierarchical Topic Models • Topics form a tree structure • nested Chinese Restaurant Process [Griffiths et al. 04] • recursive Chinese Restaurant Process [Kim et al. 12a] • LDA with Topic Tree [Wang et al. 14b] • Topics form a DAG structure • Pachinko Allocation [Li & McCallum 06] • hierarchical Pachinko Allocation [Mimno et al. 07] • nested Chinese Restaurant Franchise [Ahmed et al. 13] o o o/1 o/1 o/2 o/2 o/2/2 o/2/2 o/1/2 o/1/2 o/1/1 o/1/1 o/2/1 o/2/1 DAG: directed acyclic graph
Hierarchical Topic Model Inference Maximum likelihood Method of Moments Scalable Tensor Recursive Orthogonal Decomposition [Wang et al. 14b] fast and robust recovery with theoretic guarantee Recursive method - only for LDA with Topic Tree model • Exact inference is intractable • Approximate inference: variational inferenceor MCMC • Non recursive – all the topics are inferred at once Most popular
LDA with Topic Tree o Topic distributions o/1 o/2 … Word distributions #words in d Dirichlet prior #docs Latent Dirichlet Allocation with Topic Tree o/2/2 o/1/2 o/1/1 o/2/1 [Wang et al. 14b]
Recursive Inference for LDA with Topic Tree • A large tree subsumesa smaller tree with shared model parameters Inference order Easy to revise the tree structure Flexibleto decide when to terminate [Wang et al. 14b]
Scalable Tensor Recursive Orthogonal Decomposition • Theorem. STROD ensures robust recovery and revision Input corpus Topic + Topic t government 0.3 response 0.2... Normalized pattern counts for t k … k k Topic donate 0.1relief 0.05... [Wang et al. 14b]