1 / 21

Automatic Web Tagging and Person Tagging Using Language Models

Automatic Web Tagging and Person Tagging Using Language Models. - Qiaozhu Mei † , Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz. Tagging a Web Document.

lucian
Download Presentation

Automatic Web Tagging and Person Tagging Using Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Web Tagging and Person TaggingUsing Language Models - Qiaozhu Mei†, Yi Zhang‡ Presented by Jessica Gronski‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz

  2. Tagging a Web Document • The dual problem of search/retrieval: [Mei et al. 2007] • Retrieval: short description (query)  relevant documents • Tagging: document  short description (tag) • To summarize the content of documents • To access the document in the future retrieval Text Document Query/Tag tagging

  3. Social Bookmarking of Web Documents Web documents Social bookmarks (tags)

  4. Existing Work on Social Bookmarking • Social Bookmarking Systems • Del.icio.us, Digg, Citeulike, etc. • Enhance Social bookmarking systems • Anti-spam [Koutrika et al 2007] • Search& ranking tags [Hotho et al 2006] • Utilize social bookmarks • Visualization [Dubinko et al. 2006] • Summarization [Boydell et al. 2007] • Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008]

  5. Research Questions • Can we automatically generate tags for web documents? • Meaningful, compact, relevant • Can we generate tags for other web objects, such as web users?

  6. Applications of Automatic Tagging • Summarizing documents/ web objects • Suggest social bookmarks • Refine queries for web search • Finding good queries to a document • Suggest good keywords for online advertising

  7. Rest of the Talk • A probabilistic approach to tag generation • Candidate Tag Selection • Web document representation • Tag ranking • Experiments • Web documents tagging; • web user tagging • Summary

  8. Our Method ipod nano, data mining, presidential campaignindex structure, statistics tutorial, computer science… User-Generated Corpus (e.g., Del.icio.us, Wikipedia) Candidate tag pool representation Ranking candidate tags data 0.1599statistics 0.0752tutorial 0.0660 analysis 0.0372software 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 algorithm 0.0173 … data mining 0.26statistics tutorial 0.19computer science 0.17 index structure 0.01……ipod nano 0.00001 presidential campaign 0.0…… Web Documents Multinomial word Distribution

  9. Candidate Tag Selection • Meaningful, compact, user-oriented • From social bookmarking data • E.g., Del.icio.us • Single tags  tags that other people used • “phrases”  statistically significant bigrams • From other user-generated web contents • E.g., Wikipedia • Titles of entries in wikipedia

  10. Representation of Web Documents • Multinomial distribution of words (unigram language models) • Commonly used in retrieval and text mining • Can be estimated from the content of the document, or from social bookmarks (our approach) • What other people used to tag that document text 0.16mining 0.08data 0.07 probabilistic 0.04independence 0.03 model 0.03 … Baseline: Use the top words in that distribution to tag a document

  11. Tag Ranking: A Probabilistic Approach • Web documents d  a language model • A candidate tag t  a language model from its co-occurring tags • Score and rank t by KL-divergence of these two language models Social BookmarkCollection

  12. Rewriting and Efficient Computation Bias of using C to representcandidate tag t Bias of using C to representdocument d (e.g., del.icio.us) 1. Can be pre-computed from corpus; 2. Only store those PMI(w,t|C) > 0

  13. Tagging Web Users • Summarize the interests and bias of a user • Web user  a pseudo document • Estimate a language model from all tags that he used • The rest is similar to web document tagging

  14. Experiments • Dataset: • Two-week tagging records from Del.icio.us • Candidate tags: • Top 15,000 Significant 2-grams from del.icio.us; • titles of all wikipedia entries (5,836,166 entries, around 48,000 appeared in del.icio.us)

  15. Tagging Web Documents Meaningful, relevant Relevant, precise But partially covers good tags Too general, sometimes not relevant overfit data, not real phrases Meaningful, relevant, real But sometimes not meaningful overfit data, not real phrases Meaningful, relevant

  16. Tagging Web Documents (Cont.) Relevant, precise Meaningful, relevant, real Meaningful, relevant Too general, sometimes not relevant overfit data, not real phrases But sometimes not meaningful

  17. Tagging Web Users overfit data, not real phrases Meaningful, relevant, real Partially covers the interest

  18. Tagging Web Users (Cont.) Missed many good tags

  19. Discussions • Using top tags: too general, sometimes not relevant • Ranking tags by labeling language models: • Candidate = Social bookmarking words • Pros: relevant, compact • Cons: ambiguous, not so meaningful • Candidate = Social bookmarking bigrams • Pros: more meaningful, relevant • Cons: overfiting the data, sometimes not real phrases • Candidate = Wikipedia Titles: • Pros: meaningful, relevant real phrases • Cons: biased, missed potential good tags. (Bias(t, C))

  20. Summary • Automatic tagging of web documents and web users • A probabilistic approach based on labeling language models • Effective when the candidate tags are of high quality • Future work: • A robust way of generating candidate tags • Large scale evaluation

  21. Thanks!

More Related