Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

Empirical Study of Topic Modeling in Twitter Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

Why we care about text modeling in Twitter ? SOMA 2010

Why we care about text modeling in Twitter ? • Understanding users’ interests • Understanding social network • Identifying emerging topics SOMA 2010

Problems • Tweets are too short (140 char) • Hash tags • Abbreviations • Multiple languages SOMA 2010

Question How can we train an “effective” standard topic model ? SOMA 2010

We found • Topics learned by different aggregation strategies • are substantially different • Training the model at user-level is faster • Learned topics can help classification tasks SOMA 2010

A quick review of topic models LDA Author-Topic SOMA 2010

Our goal • Obtain topic mixtures for both tweets and users SOMA 2010

Training Schemes • Train on tweets • Infer users + tweets • Train on aggregated tweets (by users) • Infer tweets • Train on aggregated tweets (by terms) • Infer users + tweets • Author-Topic model • Infer tweets SOMA 2010

Datasets • 1,992,758 tweets + 514,130 users • 3,697,498 terms • 274 verified users from Twitter Suggestion • 16 categories • 50,447 tweets (150 tweets per user) SOMA 2010

Tasks • Topic modeling • Retweet Prediction • User & Tweets Topical Classification • Logistic Regression SOMA 2010

Topic Modeling SOMA 2010

Retweet Prediction @Jon Hello World 2009-11-01 13:15pm @Kim @Jon Hello World 2009-11-01 13:23pm @Frank @Kim @Jon Hello World 2009-11-01 17:49pm Hello World 2009-11-01 12:00pm Positive examples Negative examples SOMA 2010

Retweet Prediction SOMA 2010

Tweets Classification SOMA 2010

User Classification SOMA 2010

Conclusion • User Level Aggregation is helpful • Fast and good result • Author-Topic model does not directly apply • Topic Modeling can help other tasks • tweets classification SOMA 2010

Thank you and IBM Travel Grant! • Contact Info: • Liangjie Hong • hongliangjie@lehigh.edu • WUME Laboratory • Computer Science and Engineering • Lehigh University • Bethlehem, PA 18015 USA SOMA 2010

Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

Presentation Transcript

University of Hail College of Computer Science and Engineering Department of computer Science and Software Engineering

Computer Science and Engineering

Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

to Lehigh and the USA

to Lehigh and the USA

Computer Science and Engineering

Computer Science and Engineering

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Computer Science and Engineering

Computer Science and Engineering

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Liangjie Hong , Zaihan Yang and Brian D. Davison Computer Science and Engineering

Computer Science and Engineering

Department of Electrical Engineering and Computer Science, University of Michigan, USA.

Computer Science and Engineering

Zicheng Yu Lehigh University Bethlehem, Pennsylvania

TamKang University Computer Science and Information Engineering

Engineering and Computer Science Update

Computer Science and Engineering

Pat Langley Computer Science and Engineering Arizona State University Tempe, Arizona USA