1 / 1

Automatic Labeling of Multinomial Topic Models

Clustering. Clustering. Clustering. Clustering. Good Label ( l 1 ): “ clustering algorithm”. Context: SIGMOD Proceedings. dimensional. dimension. dimension. dimension. Statistical topic models.

joshua
Download Presentation

Automatic Labeling of Multinomial Topic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Clustering Clustering Clustering Good Label (l1): “clustering algorithm” Context: SIGMOD Proceedings dimensional dimension dimension dimension Statistical topic models term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … wordprobdata 0.0358 university 0.0132 new 0.0119 results 0.0119 end 0.0116 high 0.0098 research 0.0096 figure 0.0089 analysis 0.0076 number 0.0073 institute 0.0072 … … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Good Label (l1):“clustering algorithm” Bad Label (l2):“hash join” Topic  algorithm … partition partition Latent Topic  Text Collection … algorithm birch Multinomial topic models algorithm algorithm Bad Label (l2):“body shape” join … … Collection (Context) shape Coverage; Discrimination Relevance Score Re-ranking … hash hash hash p(w|) body clustering algorithm; distance measure; … P(w|l2) P(w|) P(w|l1) D(|l1) < D(|l2) database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … = NLP Chunker Ngram stat. Topic Ranked Listof Labels = Unigram language model Candidate label pool Multinomial distribution of Terms DAIS The Database and Information Systems Laboratory at The University of Illinois at Urbana-Champaign Large Scale Information Management Automatic Labeling ofMultinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai Our Method Multinomial Topic Models: Hard to Interpret (Label) Topics term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … insulin foraging foragers collected grains loads collection nectar … Topic Model … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … (Multinomial mixture, PLSA, LDA, & lots of extensions) ? 1 2 Applications: topic extraction, IR, contextual text mining, opinion analysis… Blei et al. http://www.cs.cmu.edu/~lemur/science/topics.html Good labels = Understandable + Relevant+High Coverage + Discriminative Question: Can we automatically generate meaningful labels for topics? 1 1 2 Scoring and Re-ranking Labels Modeling the Relevance • Zero-order relevance: prefer phrases well covering top words Bias of using C to estimate l Pointwise mutual information based on C, pre-computed High Coverage inside topic (MMR): Discriminative across topic: • First-order relevance: prefer phrases with similar context (distribution) 3 4 Results: Sample Topic Labels Context-sensitive Labeling Indexing methods sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 selectivity estimation tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 sampling estimation approximation histogram selectivity histograms … Context: Database(SIGMOD Proceedings) Context: IR(SIGIR Proceedings) Clustering algorithms selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; the, of, a, and,to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 r treeb tree large data dependencies functional cube multivalued iceberg buc … north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 plane 0.02 air 0.02 flight 0.02 pilot 0.01 crew 0.01 force 0.01 accident 0.01 crash 0.005 multivalue dependency functional dependency Iceberg cube term dependency; independence assumption; air plane crashes air force iran contra • - Applicable to any task with unigram language model, such as labeling document clusters iran contra trial 5 6

More Related