Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Twist : User Timeline Tweets Classifier Team : Priya Iyer VaidyVenkat Sonali Sharma Mentor: Andy Schlaikjer

Goal • Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology • Input: user timeline tweets • Output: list of auto classified tweets

Rationale • Twitter allows users to create custom Friend Lists based on the user handles.

Rationale (contd.) • Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

Approach • Step 1: Data Collection • Step 2: Text mining • Step 3: Creation of the training file for the library • Step 4: Evaluation of several classifiers • Step 5: Selecting the best classifier • Step 6: Validating the classification • Step 7: Tuning the parameters • Step 8: Repeat; until correct classification

Text Mining Process • Remove special characters • Tokenize • Remove redundant letters in words • Spell Check • Stemming • Language Identification • Remove Stop Words • Generate bigrams and change to lower case

Go SF Giants! Such an amaazzzingfeelin’!!!! \m/ :D  Stopwords SF Giants! amaazzzingfeelin’!!!! \/ :D  Special chars SF Giants amaazzzingfeelin Spell check SF Giants amazing feeling Stemming SF Giants amazing feel me SF Giants amazing feel stopwords

Choice of ML technique • Logistic Regression Classifier • Reasons: • Most popular linear classification technique for text classification • Ability to handle multiple categories with ease • Gave the best cross-validation accuracy and precision-recall score • Library: LIBLINEAR for Python

Creation of LIBLINEAR training input SF Giants amazing feel Indexing SF – 1 Giants -2 amazing-3 feel-4 Boolean SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) Training Input for the SVM 1 1:1 2:1 3:1 4:1

Demo

THANK YOU Andy, Marti & The Twitter Team

Questions?

Data Collection Challenges – Backup Slides • Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business” • Tweets were not purely “Sports” or “Business” related • Personal messages were prominent • Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

Text Mining Challenges • Noise in the data: • Tweets are in inconsistent format • Lots of meaningless words • Misspellings • More of individual expression • For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit • Different words, same root Playing , plays , playful - play Solution: Stemming

Sample LIBLINEAR input format (Train)

LIBLINEAR output for a test file of 20 tweets • Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4) • Comma separated values of the categories that each tweet • Accuracy here is 94%. Precision: 0.89 Recall: 0.89 • Experiment with different kernels for a better accuracy

Summary: Data Source/Software/Tools • Category based tweets from • https://twitter.com/i/#!/who_to_follow/interests • Coding done in Python • Database – sqlite3 • ML tool – lib SVM • Stemming – Porter’s Stemming • NLP Tool kit

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer