1.46k likes | 1.6k Views
Modeling Users and Content : Structured Probabilistic Representation and Scalable Online Inference Algorithms . Amr Ahmed Thesis Defense. The Infosphere. News Sources. Social Media. Research Publications. President Obama had an accident while playing a basketball match.
E N D
Modeling Users and Content:Structured Probabilistic Representationand Scalable Online Inference Algorithms Amr Ahmed Thesis Defense
News Sources Social Media Research Publications President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match The Infosphere
The Infosphere President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match Soccer Online inference Car deals Fashion
News Sources Social Media Research Publications Thesis question
News Sources Social Media Research Publications President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match How to model users and content? President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match President Obama had an accident while playing a basketball match Online inference Car deals Soccer Fashion
Questions What do we mean by Content? What characterizes user and content?
ArXiv Conference proceeding Research Publications Pubmed central Journal transactions Yahoo! news CNN Red state Social Media Blogs Google news Daily KOS BBC
Multi-faceted nature Temporal dynamics Phy Bio CS time time BP: “We will make this right." Drill explosion “BP wasn't prepared for an oil spill at such depths” Choice is a fundamental, constitutional right Ban abortion with Constitutional amendment
What Characterizes Users? • Long-term interests • Baseball • Graphical models • Music • Short-term interests • Buying a car • Getting a new camera • Spurious interests • What is the buzz about the oil spill?
Thesis Question • How to build a structured representation of Users and Content • Temporal Dynamics • How ideas/events evolve over time • How user interest change over time • Structural Correspondence • How ideas are addressed across modalities and communities • How to learn user interest from multimodal sources
Thesis Approach • Models • Probabilistic graphical models • Topic models and Non-parametric Bayes • Principled, expressive and modular • Algorithms • Distributed • To deal with large-scale datasets • Online • To update the representation with new data
Outline • Background • Mixed-membership Models • Recurrent Chinese Restaurant Process • Modeling Temporal Dynamics • News • Research publications • User intents • Modeling multi-faceted Content • Ideological Perspective
What is a Good Model for Documents? • Clustering • Mixture of unigram model • How to specify a model? • Generative process • Assume some hidden variables • Use them to generate documents • Inference • Invert the process • Given documents hidden variables f p K ci wi N
Mixture of Unigram f1 fk f p K ci wi N pj pk p1 wi Generative Process Is this a good model for documents? • For Document wi • Sample ci ~ Multi(p) • Sample wi~Mult(fci) When is this a good model for documents? • When documents are single-topic • Not true in our settings
0.6 0.3 0.1 MT Syntax Learning Source Target SMT Alignment Score BLEU Parse Tree Noun Phrase Grammar CFG likelihood EM Hidden Parameters Estimation argMax What Do We Need to Model? • Q: What is it about? • A: Mainly MT, with syntax, some learning A Hierarchical Phrase-Based Model for Statistical Machine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system. Mixing Proportion Topics Unigram over vocabulary Topic Models
Mixed-Membership Models Prior f1 fk q Generative Process • For each document d • Sample qd~Prior • For each word w in d • Sample z~Multi(qd) • Sample w~Multi(fz) z f w K N D qj qk q1 wi A Hierarchical Phrase-Based Model for Statistical Machine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases. Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
Outline • Background • Mixed-membership Models • Recurrent Chinese Restaurant Process • Modeling Temporal Dynamics • Research publications • News • User intents • Modeling multi-faceted Content • Ideological Perspective
Chinese Restaurant Process (CRP) • Allows the number of mixtures to grow with the data • Also called non-parametric • Means the number of effective parameters grow with data • Still have hyper-parametersthat control the rate of growth • a:how fast a new cluster/mixture is born? • G0: Prior over mixture component parameters
The Chinese Restaurant Process f1 f2 f3 Generative Process • For data point xi • Choose table j Njand Sample xi ~ f(fj) • Choose a new table K+1 a • Sample fK+1 ~ G0 and Sample xi ~ f(fK+1) The rich gets richer effect CANNOT handle sequential data
Recurrent CRP (RCRP) [Ahmed and Xing 2008] • Adapts the number of mixture components over time • Mixture components can die out • New mixture components are born at any time • Retained mixture components parametersevolve according to a Markovian dynamics
Recurrent CRP (RCRP) • Three equivalent constructions (see [Ahmed & Xing 2008]) Infinite limit of fixed-dimensional dynamic model. Recurrent Chinese Restaurant Time-dependent random measures
The Recurrent Chinese Restaurant Process • The restaurant operates in epochs • The restaurant is closed at the end of each epoch • The state of the restaurant at time epoch tdepends on that at time epoch t-1 • Can be extended to higher-order dependencies.
The Recurrent Chinese Restaurant Process T=1 Dish eaten at table 3 at time epoch 1 OR the parameters of cluster 3 at time epoch 1 f1,1 f2,1 f3,1 Generative Process • Customers at time T=1 are seated as before: • Choose table j Nj,1 and Sample xi ~ f(fj,1) • Choose a new table K+1 a • Sample fK+1,1 ~ G0 and Sample xi ~ f(fK+1,1)
The Recurrent Chinese Restaurant Process f1,1 f1,1 f2,1 f2,1 f3,1 f3,1 T=1 N2,1=3 N3,1=1 N1,1=2 T=2
T=1 f1,1 f2,1 f3,1 f1,1 f2,1 f3,1 T=2 N2,1=3 N3,1=1 N1,1=2
T=1 f1,1 f2,1 f3,1 f1,1 f2,1 f3,1 T=2 N2,1=3 N3,1=1 N1,1=2
T=1 f1,1 f2,1 f3,1 f1,1 f2,1 f3,1 T=2 N2,1=3 N3,1=1 N1,1=2
T=1 f1,2 f2,1 f3,1 f1,1 f2,1 f3,1 T=2 N2,1=3 N3,1=1 N1,1=2 Sample f1,2 ~ P(.| f1,1)
T=1 f1,2 f2,1 f3,1 f1,1 f2,1 f3,1 T=2 N2,1=3 N3,1=1 N1,1=2 And so on ……
T=1 f1,2 f2,2 f3,1 f1,1 f2,1 f3,1 f4,2 T=2 N2,1=3 N3,1=1 N1,1=2 Died out cluster Newly born cluster At the end of epoch 2
T=1 f1,1 f2,1 f3,1 f1,2 f1,2 f2,2 f2,2 f3,1 f4,2 f4,2 T=2 N2,1=3 N3,1=1 N1,1=2 N1,2=2 N2,2=2 N4,2=1 T=3
æ ö - w W å ç ÷ e N l - k , t w è ø = w 1 RCRP • Can be extended to model higher-order dependencies • Can decay dependencies over time • Pseudo-counts for table k at time t is History size Number of customers sitting at table K at time epoch t-w Decay factory
T=1 f1,1 f2,1 f3,1 f1,2 f1,2 f2,2 f2,2 f3,1 f4,2 f4,2 T=2 N2,1=3 N3,1=1 N1,1=2 N2,3 T=3 æ ö - w W å ç ÷ e N l N2,3 = - k , t w è ø = w 1
RCRP • Can be extended to model higher-order dependencies • Can decay dependencies over time • Pseudo-counts for table k at time t is • (W, l, a) can generate interesting clustering configurations
TDPM Generative Power DPM W=T l = Power-law curve TDPM W=4 l = .4 Independent DPMs W= 0 l = ? (any)
Outline • Background • Mixed-membership Models • Recurrent Chinese Restaurant Process • Modeling Temporal Dynamics • News • User intents • Research publications • Modeling multi-faceted Content • Ideological Perspective
Modeling Temporal Dynamics RCRP Infinite storylines from streaming text Evolution of research ideas Online scalable inference Dynamic user interests Online distributed inference
Outline • Background • Mixed-membership Models • Recurrent Chinese Restaurant Process • Modeling Temporal Dynamics • News • User intents • Research publications • Modeling multi-faceted Content • Ideological Perspective
Understanding the News • Clustering • Group similar articles together • Classification • High-level topics like sports and politics • Analysis • How a story develops over time • Who are the main entities • Challenges • Large scale and online • Almost one document per second
A Unified Model • Jointly solves the three main tasks • Clustering, • Classification • Analysis • Building blocks • A Topic model • High-level concepts (unsupervised classification) • Dynamic clustering (RCRP) • Discover tightly-focused concepts • Named entities • Story developments
Dynamic Clustering • Recurrent Chinese restaurant process (RCRP) • Discovers time-sensitive stories Generative Process • For each document wd at time t • Sample wd ~ Multinomial(bs) priors Stories’ trend + prior at time t
Infinite Dynamic Cluster-Topic Hybrid Politics Government Minister Authorities Opposition Officials Leaders group Accidents Police Attack run man group arrested move Sports games Won Team Final Season League held UEFA-soccer Tax-Bill Champions Goal Coach Striker Midfield penalty Juventus AC Milan Lazio Ronaldo Lyon Tax Billion Cut Plan Budget Economy Bush Senate Fleischer White House Republican g
Infinite Dynamic Cluster-Topic Hybrid Politics Government Minister Authorities Opposition Officials Leaders group Accidents Police Attack run man group arrested move Sports games Won Team Final Season League held UEFA-soccer Tax-Bill Border-Tension Champions Goal Coach Striker Midfield penalty Juventus AC Milan Lazio Ronaldo Lyon Tax Billion Cut Plan Budget Economy Nuclear Border Dialogue Diplomatic militant Insurgency missile Bush Senate Fleischer White House Republican Pakistan India Kashmir New Delhi Islamabad Musharraf Vajpayee g
The Graphical Model Tightly-focuses High-level concepts
The Graphical Model Tightly-focuses High-level concepts
The Graphical Model • Each story has: • Distribution over words • Distribution over topics • Distribution over named entites
The Graphical Model • Document’s mixing-vector is sampled from its story prior • Words inside a document can either come form global topics or the story specific topic