Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11

Democrats, RepublicansandStarbucksAfficionados: User Classification in TwitterKdd ‘11 Utku Şirin 1560838

User Classification in Twitter, Authors • MarcoPennacchiotti • ResearchScientist at Yahoo! Labs • PhD is fromUni. Of Rome • Studied in SaarlandUniversity • Large-ScaleTextMining, Information Extraction, and Natural Language Processing • Ana-Maria Popescu • ResearchScientist at Yahoo! Labs • GraduatedfromUniversity of Washington • Social Media ResearchandAnalytics, User modelling, andSentiment Analysis

Social Media • Hotlygrowingphenomenan • Everyone is there, theconservativesandtherevolutionaries ! • As Data Miners, whatweareinterested in is theverylargenumberof available data aboutthesocialusers • A basicandimportanttask: Classification of theUsers • Authorativeusersextraction • Post reranking in web search (KDDCUP ‘12, Track #2) • User recommendation • How to do theclassification ?

Classificationtask • Thestartingpoint is tofulfilltheincompleteuserattributesbyclassifiyingtheuserwithrespecttotheincompleteuserattribute, indeed. • Most of theusers do not mentionexplicitly her politicalview, forexample • Therearevariousmethodsforsolvingtheuserclassification problem • What do wehave in socialmedia domain ? • Usershavemanyattributes, such as age, gender, etc… • Based on theattributes a classifiermay be trained/constructed • Social Network • Usershavefriendsthatshefollows • How to define theclassificationtasksothatwe can combinethesetwotypes of information ‘structure’, userattributesandsocial network ?

Machine learning model • A novelarchitecturecombininguser-centricinformationandsocial network information • User-centricinformationaretheattributes of theusers, whichwecall as featureshereafter • Social Network information is theinformation of friends of theusers • Main contribution of thepaper • UseGradientBoostedDecisionTrees (GBDT) framework as theclassificationalgorithm • Train the GDBT withgivenlabeledinput data • Andlabeltheuserswithrespecttothebuiltclassifier • Thenapplysameclassifier model tothefriends of theusersandlabelthefriendsalso • Lastly, updateeachuser’slabelwithrespectto her friends’ labelusing an updateformulae

User-Centric Information • User-centricinformation is represented as features. There is a overmuchfeature set mainlycomprised of fourparts • Profile features(PROF) • User name, use of avatarpicture, date of accountcreation, etc… • Tweetingbehaviorfeatures(BEHAV) • Averagenumber of tweetsperday, number of repliesetc... • Linguisticcontentfeatures • Richestfeature set, comprised of foursub-featuresets • UsesLatentDrichletAllocation (LDA) as Language Model • Prototypicalwords(LING-WORD): • Protowords, wordsthatareicons in users. • Foundprobabilisticallyfromthe data • Firstlypartitiontheusersinto n class, thenfindthemostfrequentwordsforeachclassandtakemostlyused k wordsforeachclass • Prototypicalhashtags(LING-HASH): • Hashtag (#) todenotetopics • Sametechniqueforprotowords • Generic LDA(LING-GLDA): • LDA is thelanguage model theyused, extractedtopicswithrespecttothe LDA model andrepresentsusers as a distributionovertopics • LDA is trainedbyallsets of users • Domain-specific LDA(LING-DLDA): • Same as Generic LDA, but trainedwithspecifictraining set such as usersthatareonlydemocratsandrepublicans • Sentimentwords(LING-SENT): • Manuallycollectedsmall set of terms, Ronald Regan, goodorbad ? • OpinionFinderToolgivesthesentiment as positive, negative, neutral

User-Centric Information • Social Network Features • Combination of twodifferentfeatures • Friendaccounts(SOC-FRIE): • Informsaboutsharingsamefriendsfordifferentlabeleduserssuch as democratsandrepublicans • Prototypicalreplied(SOC-REP) andretweeted (SOC-RET) users: • Findmostfrequentmentioned (@) andretweeted (RT) usersfordifferentlabeledusers • That’sallforuser-centricinformation OVERMUCH, indeed…

Label Update Using Social Network • Now eachuser in the test set is labeledbytheclassiferthat is trainedwiththefeaturesjustmentioned • Labelupdatepartupdatesthelabelswithrespecttothelabels of friends of theusers, this is done as follows: • Labeleachuserandall of her friendsusingthebuiltclassifier. Labelsarenumbers in [+1, -1], highervaluesshowshigherconfidencelevel • Thenupdatethelabels of userswithrespecttothefollowingformulafortheuserui:

Experimental Evaluation • Three binaryclassificationtasks: • Detectingpoliticalaffiliation • DemocratorRepublican • 5169 Democratsand 5169 Republicans • 1.2 millionsfriends • Ethnicity • AfricanAmericanor Not • 3000 AfricanAmericansand 3000 Not AfricanAmericans • 508K friends • Following a business • FollowingStarbucksor Not • 5000 Starbucksfollowerand 5000 Not • 981K friends

ExperimentalResults, PoliticalAffiliationTask • Best achievedresultforcombined HYBRID model amongthreetaskshowever, not significantincreaseoversingle ML model • Social Network featuresareverysuccessfull. This is becauseusersfrom a particularpoliticalviewarefriendswithsimilarparticularviews. Suportting sinle Graph-BasedLabelupdate is alsoverysuccessfullalone

ExperimentalResults, StarbucksFansTask • SocialGraphupdate is not thatmuchsuccessfull as politicalaffiliationtask since Starbucksdoes not buildfriends, indeed • Profile featuresarevery successfullalone • Linguisticfeaturesarealso successfull • HYBRID methodstilldoes not increasethealone ML systemsignificantly

ExperimentalResults, EthnicityTask • HYBRID methodfails, decreasesthealone ML model • Social network features a so bad ! • As in StarbukcsTaskcase, ethnicitydoes not form a community. Hence, social network features andgraph-basedupdate has very lowresults • Best featurealoneresultsare in linguisticfeatures. Linguisticfeaturesalwayshave a point !

OverallComments • #1 ML methodmostlygoodenoughandupdatepart of thearchitecturedoes not bringsignificantimprovement.Ifthetaskallowsforusersto form a communityupdatefunctionworks, else, it mayeven hurt thealone ML system as in ethnicitycase • #2 LinguisticFeaturesalwaysreliable

Review#1 • Thenovelty of combiningthetypesof information is attractive, however, thereareseriouspointsthatshould be criticized • First of alltheclassifier is doingonlybinaryclassificationandnothingsaidaboutmulti-dimensionalclassification. Doingmulti-dimensionalclassificationusingbinaryclassifier is time-consumingandweakenstheclaimaboutthescalability. • As said, thenovelarch. idea is attractive, however, theresultsshowthatlabel-updatedoes not workwell. Why ? Theydid not giveanyappriciablecomment on whylabelupdatedoes not workwell. This, I believe, showsthatthefeature set andthenovelarchitecture is not well-studied. • Thereareovermuchfeatures. But thereasonswhythesefeaturesareselectedare not given. • Morever, applyingsame ML model theusersandtheirfriendsreplicatestheinformation. Obviouslyconnecteduserswillhavesomecommonanddifferentattributes, what is thepoint? • Thesocialgraphshould be usedmoreeffectively. I think it should not be usedtoupdatethelabels but as an importantlyweigthedfeature in the ML model. This is becauseweshouldsuperposedifferentinformationtypesinstead of usingonetocompensatetheother. You can seedifference in thinkingvectorspace, updatemeansspanningsamevectoragain, superposingmeansusingbothvectorconcurrently. Forexample, protowordswouldhavebeenextractedusingthe network, somehow.

Review#2 • TheytoldaboutGradientBoostedDecisionTrees(GBDT) but gavenothingaboutthisclassificationalgorithm, an explanation is expected at least in princpileabout GBDT. Samething is validforLatentDrichlettAllocation(LDA) language model. It is thefirst time I hearthislanguage model, andtheysaidnothingabout LDA. It is onlysaidthat LDA is used as language model andassociatedwithtopics. But, what is LDA andhow it is associatedwithtopics? • Thereis no data analysis, verycruicallacking of paper, everything is data! Theyonlygavethenumber of usersused in training, but whataboutthe test set? Development set? Anyotherstatisticsaboutthe data? Moreover, theyuseddifferentnumber of samplesforeachtask. Thesuccess of labelupdate is verylowforethnicitytaskthanthepoliticalaffiliationtask, however, thereare 1.2M friendsforpoliticalaffiliationtask but almosthalf of themforethnicitytask, 508K. Hencethecross-taskcommentsare not confident. • Thesystemtheybuilthave a stroingconstraint, indeed. It is languagedependent, English. Forexample, thefeaturesbased on frequencies of protowordswill not workforTurkishduetoitsagglutinativenature, manyinflectedforms of samewords: masayı, masada, masanın, masalardakilkerinetc… (A stemmerwill be needmostprobably) • Experimentsarenot done in a structuredway. Theyhavejust done theexperimentsandshowstheresults. There is not a usefulcomment. Beside, theydid not explainwhytheyhavechosentheseexperiments. Forexample, I wouldwanttoseesomesuccess of subsetfeatures as featuresalonehavemostlyverygoodresults, somesubsetmayincreasetheoverall HYBRID result.

AnyCommentsorQuestions ?

Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11

Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11

Presentation Transcript

The Rise of Andrew Jackson

THE NEW DEAL

Twitter

OpenCart Latest Tweet

CHAPTER 17

Twitter

THE NEW DEAL

2010 Congressional Elections

Politics and Reform

Politics and Expansion in an Industrial Age

Gilded Age Politics

The Two-Party System: Democrats, Republicans, and GDIs

Dollars Dominate Democracy: The Politics of the Gilded Age

The Republicans, Democrats, and Minor Parties

THE PLAYERS Governor Kitzhaber

Twitter for Teachers

Political Party and Election Notes

MATH125

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Chapter 29: Wilsonian Progressivism at Home and Abroad 1912-1916