Automatic Labeling of Multinomial Topic Models

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

Outline • Background: statistical topic models • Labeling a topic model • Criteria and challenge • Our approach: a probabilistic framework • Experiments • Summary

Topic models (Multinomial distributions) term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03 model 0.03 … Subtopic discovery Topical pattern analysis PLSA [Hofmann 99] LDA [Blei et al. 03] Summarization Author-Topic [Steyvers et al. 04] web 0.21search 0.10link 0.08 graph 0.05… Pachinko allocation [Li & McCallum 06] Opinion comparison … CPLSA [Mei & Zhai 06] … Topic over time [Wang et al. 06] … Statistical Topic Models for Text Mining Text Collections Probabilistic Topic Modeling …

Term, relevance, weight, feedback Retrieval Models Topic Models: Hard to Interpret • Use top words • automatic, but hard to make sense • Human generated labels • Make sense, but cannot scale up term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … insulin foraging foragers collected grains loads collection nectar … ? Question: Can we automatically generate understandable labels for topics?

iPod Nano じょうほうけんさく Pseudo-feedback Information Retrieval What is a Good Label? Retrieval models • Semantically close (relevance) • Understandable – phrases? • High coverage inside topic • Discriminative across topics term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … • Mei and Zhai 06: • a topic in SIGIR

NLP Chunker Ngram Stat. 1 3 Discrimination 2 Relevance Score information retriev. 0.26 0.01 retrieval models 0.20IR models 0.18 pseudo feedback 0.09 …… Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06…… information retrieval, retrieval model, index structure, relevance feedback, … 4 Coverage retrieval models 0.20 IR models 0.18 0.02pseudo feedback 0.09 …… information retrieval 0.01 Candidate label pool Our Method Collection (e.g., SIGIR) term 0.16relevance 0.07weight 0.07 feedback 0.04independence 0.03 model 0.03 … filtering 0.21collaborative 0.15… trec 0.18evaluation 0.10…

Good Label (l1): “clustering algorithm” ? Bad Label (l2):“body shape” Relevance (Task 2): the Zero-Order Score • Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 √ p(“dimensional”|) = 0.3 dimensional > algorithm Latent Topic  … birch p(“shape”|) = 0.01 shape … p(w|) body p(“body”|) = 0.001

Clustering Clustering Clustering Good Label (l1)“clustering algorithm” dimension key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… dimension dimension … Topic  algorithm partition partition key algorithm algorithm hash … … P(w|) p(w | hash join) hash hash p(w | clustering algorithm ) Relevance (Task 2): the First-Order Score • Intuition: prefer phrases with similar context (distribution) l2: “hash join” Score (l,  ) = D(||l)

Discrimination and Coverage (Tasks 3 & 4) • Discriminative across topic: • High relevance to target topic, low relevance to other topics • High Coverage inside topic: • Use MMR strategy

Variations and Applications • Labeling document clusters • Document cluster  unigram language model • Applicable to any task with unigram language model • Context sensitive labels • Label of a topic is sensitive to the context • An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ?in marketing?

Experiments • Datasets: • SIGMOD abstracts; SIGIR abstracts; AP news data • Candidate labels: significant bigrams; NLP chunks • Topic models: • PLSA, LDA • Evaluation: • Human annotators to compare labels generated from anonymous systems • Order of systems randomly perturbed; score average over all sample topics

Result Summary • Automatic phrase labels >> top words • 1-order relevance >> 0-order relevance • Bigram > NLP chunks • Bigram works better with literature; NLP better with news • System labels << human labels • Scientific literature is an easier task

Results: Sample Topic Labels north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and,to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 iran contra … clustering algorithm clustering structure … tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 r treeb tree … large data, data quality, high data, data application, … indexing methods

Results: Context-Sensitive Labeling • Explore the different meaning of a topic with different contexts (content switch) • An alternative approach to contextual text mining sampling estimation approximation histogram selectivity histograms … Context: Database(SIGMOD Proceedings) Context: IR(SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models;

Summary • Labeling: A postprocessing step of all multinomial topic models • A probabilistic approach to generate good labels • understandable, relevant, high coverage, discriminative • Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive • Future work: • Labeling hierarchical topic models • Incorporating priors

Thanks!

Automatic Labeling of Multinomial Topic Models