200 likes | 314 Views
Empirical Study of Topic Modeling in Twitter. Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA. Why we care about text modeling in Twitter ?. SOMA 2010 . Why we care about text modeling in Twitter ?. Understanding users’ interests
E N D
Empirical Study of Topic Modeling in Twitter Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA
Why we care about text modeling in Twitter ? SOMA 2010
Why we care about text modeling in Twitter ? • Understanding users’ interests • Understanding social network • Identifying emerging topics SOMA 2010
Problems • Tweets are too short (140 char) • Hash tags • Abbreviations • Multiple languages SOMA 2010
Question How can we train an “effective” standard topic model ? SOMA 2010
We found • Topics learned by different aggregation strategies • are substantially different • Training the model at user-level is faster • Learned topics can help classification tasks SOMA 2010
A quick review of topic models LDA Author-Topic SOMA 2010
Our goal • Obtain topic mixtures for both tweets and users SOMA 2010
Training Schemes • Train on tweets • Infer users + tweets • Train on aggregated tweets (by users) • Infer tweets • Train on aggregated tweets (by terms) • Infer users + tweets • Author-Topic model • Infer tweets SOMA 2010
Datasets • 1,992,758 tweets + 514,130 users • 3,697,498 terms • 274 verified users from Twitter Suggestion • 16 categories • 50,447 tweets (150 tweets per user) SOMA 2010
Tasks • Topic modeling • Retweet Prediction • User & Tweets Topical Classification • Logistic Regression SOMA 2010
Topic Modeling SOMA 2010
Topic Modeling SOMA 2010
Topic Modeling SOMA 2010
Retweet Prediction @Jon Hello World 2009-11-01 13:15pm @Kim @Jon Hello World 2009-11-01 13:23pm @Frank @Kim @Jon Hello World 2009-11-01 17:49pm Hello World 2009-11-01 12:00pm Positive examples Negative examples SOMA 2010
Retweet Prediction SOMA 2010
Tweets Classification SOMA 2010
User Classification SOMA 2010
Conclusion • User Level Aggregation is helpful • Fast and good result • Author-Topic model does not directly apply • Topic Modeling can help other tasks • tweets classification SOMA 2010
Thank you and IBM Travel Grant! • Contact Info: • Liangjie Hong • hongliangjie@lehigh.edu • WUME Laboratory • Computer Science and Engineering • Lehigh University • Bethlehem, PA 18015 USA SOMA 2010