160 likes | 355 Views
Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11. Utku Şirin 1560838. User Classification in Twitter , Authors. Marco Pennacchiotti Research Scientist at Yahoo ! Labs PhD is from Uni . Of Rome Studied in Saarland University
E N D
Democrats, RepublicansandStarbucksAfficionados: User Classification in TwitterKdd ‘11 Utku Şirin 1560838
User Classification in Twitter, Authors • MarcoPennacchiotti • ResearchScientist at Yahoo! Labs • PhD is fromUni. Of Rome • Studied in SaarlandUniversity • Large-ScaleTextMining, Information Extraction, and Natural Language Processing • Ana-Maria Popescu • ResearchScientist at Yahoo! Labs • GraduatedfromUniversity of Washington • Social Media ResearchandAnalytics, User modelling, andSentiment Analysis
Social Media • Hotlygrowingphenomenan • Everyone is there, theconservativesandtherevolutionaries ! • As Data Miners, whatweareinterested in is theverylargenumberof available data aboutthesocialusers • A basicandimportanttask: Classification of theUsers • Authorativeusersextraction • Post reranking in web search (KDDCUP ‘12, Track #2) • User recommendation • How to do theclassification ?
Classificationtask • Thestartingpoint is tofulfilltheincompleteuserattributesbyclassifiyingtheuserwithrespecttotheincompleteuserattribute, indeed. • Most of theusers do not mentionexplicitly her politicalview, forexample • Therearevariousmethodsforsolvingtheuserclassification problem • What do wehave in socialmedia domain ? • Usershavemanyattributes, such as age, gender, etc… • Based on theattributes a classifiermay be trained/constructed • Social Network • Usershavefriendsthatshefollows • How to define theclassificationtasksothatwe can combinethesetwotypes of information ‘structure’, userattributesandsocial network ?
Machine learning model • A novelarchitecturecombininguser-centricinformationandsocial network information • User-centricinformationaretheattributes of theusers, whichwecall as featureshereafter • Social Network information is theinformation of friends of theusers • Main contribution of thepaper • UseGradientBoostedDecisionTrees (GBDT) framework as theclassificationalgorithm • Train the GDBT withgivenlabeledinput data • Andlabeltheuserswithrespecttothebuiltclassifier • Thenapplysameclassifier model tothefriends of theusersandlabelthefriendsalso • Lastly, updateeachuser’slabelwithrespectto her friends’ labelusing an updateformulae
User-Centric Information • User-centricinformation is represented as features. There is a overmuchfeature set mainlycomprised of fourparts • Profile features(PROF) • User name, use of avatarpicture, date of accountcreation, etc… • Tweetingbehaviorfeatures(BEHAV) • Averagenumber of tweetsperday, number of repliesetc... • Linguisticcontentfeatures • Richestfeature set, comprised of foursub-featuresets • UsesLatentDrichletAllocation (LDA) as Language Model • Prototypicalwords(LING-WORD): • Protowords, wordsthatareicons in users. • Foundprobabilisticallyfromthe data • Firstlypartitiontheusersinto n class, thenfindthemostfrequentwordsforeachclassandtakemostlyused k wordsforeachclass • Prototypicalhashtags(LING-HASH): • Hashtag (#) todenotetopics • Sametechniqueforprotowords • Generic LDA(LING-GLDA): • LDA is thelanguage model theyused, extractedtopicswithrespecttothe LDA model andrepresentsusers as a distributionovertopics • LDA is trainedbyallsets of users • Domain-specific LDA(LING-DLDA): • Same as Generic LDA, but trainedwithspecifictraining set such as usersthatareonlydemocratsandrepublicans • Sentimentwords(LING-SENT): • Manuallycollectedsmall set of terms, Ronald Regan, goodorbad ? • OpinionFinderToolgivesthesentiment as positive, negative, neutral
User-Centric Information • Social Network Features • Combination of twodifferentfeatures • Friendaccounts(SOC-FRIE): • Informsaboutsharingsamefriendsfordifferentlabeleduserssuch as democratsandrepublicans • Prototypicalreplied(SOC-REP) andretweeted (SOC-RET) users: • Findmostfrequentmentioned (@) andretweeted (RT) usersfordifferentlabeledusers • That’sallforuser-centricinformation OVERMUCH, indeed…
Label Update Using Social Network • Now eachuser in the test set is labeledbytheclassiferthat is trainedwiththefeaturesjustmentioned • Labelupdatepartupdatesthelabelswithrespecttothelabels of friends of theusers, this is done as follows: • Labeleachuserandall of her friendsusingthebuiltclassifier. Labelsarenumbers in [+1, -1], highervaluesshowshigherconfidencelevel • Thenupdatethelabels of userswithrespecttothefollowingformulafortheuserui:
Experimental Evaluation • Three binaryclassificationtasks: • Detectingpoliticalaffiliation • DemocratorRepublican • 5169 Democratsand 5169 Republicans • 1.2 millionsfriends • Ethnicity • AfricanAmericanor Not • 3000 AfricanAmericansand 3000 Not AfricanAmericans • 508K friends • Following a business • FollowingStarbucksor Not • 5000 Starbucksfollowerand 5000 Not • 981K friends
ExperimentalResults, PoliticalAffiliationTask • Best achievedresultforcombined HYBRID model amongthreetaskshowever, not significantincreaseoversingle ML model • Social Network featuresareverysuccessfull. This is becauseusersfrom a particularpoliticalviewarefriendswithsimilarparticularviews. Suportting sinle Graph-BasedLabelupdate is alsoverysuccessfullalone
ExperimentalResults, StarbucksFansTask • SocialGraphupdate is not thatmuchsuccessfull as politicalaffiliationtask since Starbucksdoes not buildfriends, indeed • Profile featuresarevery successfullalone • Linguisticfeaturesarealso successfull • HYBRID methodstilldoes not increasethealone ML systemsignificantly
ExperimentalResults, EthnicityTask • HYBRID methodfails, decreasesthealone ML model • Social network features a so bad ! • As in StarbukcsTaskcase, ethnicitydoes not form a community. Hence, social network features andgraph-basedupdate has very lowresults • Best featurealoneresultsare in linguisticfeatures. Linguisticfeaturesalwayshave a point !
OverallComments • #1 ML methodmostlygoodenoughandupdatepart of thearchitecturedoes not bringsignificantimprovement.Ifthetaskallowsforusersto form a communityupdatefunctionworks, else, it mayeven hurt thealone ML system as in ethnicitycase • #2 LinguisticFeaturesalwaysreliable
Review#1 • Thenovelty of combiningthetypesof information is attractive, however, thereareseriouspointsthatshould be criticized • First of alltheclassifier is doingonlybinaryclassificationandnothingsaidaboutmulti-dimensionalclassification. Doingmulti-dimensionalclassificationusingbinaryclassifier is time-consumingandweakenstheclaimaboutthescalability. • As said, thenovelarch. idea is attractive, however, theresultsshowthatlabel-updatedoes not workwell. Why ? Theydid not giveanyappriciablecomment on whylabelupdatedoes not workwell. This, I believe, showsthatthefeature set andthenovelarchitecture is not well-studied. • Thereareovermuchfeatures. But thereasonswhythesefeaturesareselectedare not given. • Morever, applyingsame ML model theusersandtheirfriendsreplicatestheinformation. Obviouslyconnecteduserswillhavesomecommonanddifferentattributes, what is thepoint? • Thesocialgraphshould be usedmoreeffectively. I think it should not be usedtoupdatethelabels but as an importantlyweigthedfeature in the ML model. This is becauseweshouldsuperposedifferentinformationtypesinstead of usingonetocompensatetheother. You can seedifference in thinkingvectorspace, updatemeansspanningsamevectoragain, superposingmeansusingbothvectorconcurrently. Forexample, protowordswouldhavebeenextractedusingthe network, somehow.
Review#2 • TheytoldaboutGradientBoostedDecisionTrees(GBDT) but gavenothingaboutthisclassificationalgorithm, an explanation is expected at least in princpileabout GBDT. Samething is validforLatentDrichlettAllocation(LDA) language model. It is thefirst time I hearthislanguage model, andtheysaidnothingabout LDA. It is onlysaidthat LDA is used as language model andassociatedwithtopics. But, what is LDA andhow it is associatedwithtopics? • Thereis no data analysis, verycruicallacking of paper, everything is data! Theyonlygavethenumber of usersused in training, but whataboutthe test set? Development set? Anyotherstatisticsaboutthe data? Moreover, theyuseddifferentnumber of samplesforeachtask. Thesuccess of labelupdate is verylowforethnicitytaskthanthepoliticalaffiliationtask, however, thereare 1.2M friendsforpoliticalaffiliationtask but almosthalf of themforethnicitytask, 508K. Hencethecross-taskcommentsare not confident. • Thesystemtheybuilthave a stroingconstraint, indeed. It is languagedependent, English. Forexample, thefeaturesbased on frequencies of protowordswill not workforTurkishduetoitsagglutinativenature, manyinflectedforms of samewords: masayı, masada, masanın, masalardakilkerinetc… (A stemmerwill be needmostprobably) • Experimentsarenot done in a structuredway. Theyhavejust done theexperimentsandshowstheresults. There is not a usefulcomment. Beside, theydid not explainwhytheyhavechosentheseexperiments. Forexample, I wouldwanttoseesomesuccess of subsetfeatures as featuresalonehavemostlyverygoodresults, somesubsetmayincreasetheoverall HYBRID result.