1 / 16

Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11

Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11. Utku Şirin 1560838. User Classification in Twitter , Authors. Marco Pennacchiotti Research Scientist at Yahoo ! Labs PhD is from Uni . Of Rome Studied in Saarland University

madra
Download Presentation

Democrats , Republicans and S tarbucks Afficionados : User Classification in Twitter Kdd ‘11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Democrats, RepublicansandStarbucksAfficionados: User Classification in TwitterKdd ‘11 Utku Şirin 1560838

  2. User Classification in Twitter, Authors • MarcoPennacchiotti • ResearchScientist at Yahoo! Labs • PhD is fromUni. Of Rome • Studied in SaarlandUniversity • Large-ScaleTextMining, Information Extraction, and Natural Language Processing • Ana-Maria Popescu • ResearchScientist at Yahoo! Labs • GraduatedfromUniversity of Washington • Social Media ResearchandAnalytics, User modelling, andSentiment Analysis

  3. Social Media • Hotlygrowingphenomenan • Everyone is there, theconservativesandtherevolutionaries ! • As Data Miners, whatweareinterested in is theverylargenumberof available data aboutthesocialusers • A basicandimportanttask: Classification of theUsers • Authorativeusersextraction • Post reranking in web search (KDDCUP ‘12, Track #2) • User recommendation • How to do theclassification ?

  4. Classificationtask • Thestartingpoint is tofulfilltheincompleteuserattributesbyclassifiyingtheuserwithrespecttotheincompleteuserattribute, indeed. • Most of theusers do not mentionexplicitly her politicalview, forexample • Therearevariousmethodsforsolvingtheuserclassification problem • What do wehave in socialmedia domain ? • Usershavemanyattributes, such as age, gender, etc… • Based on theattributes a classifiermay be trained/constructed • Social Network • Usershavefriendsthatshefollows • How to define theclassificationtasksothatwe can combinethesetwotypes of information ‘structure’, userattributesandsocial network ?

  5. Machine learning model • A novelarchitecturecombininguser-centricinformationandsocial network information • User-centricinformationaretheattributes of theusers, whichwecall as featureshereafter • Social Network information is theinformation of friends of theusers • Main contribution of thepaper • UseGradientBoostedDecisionTrees (GBDT) framework as theclassificationalgorithm • Train the GDBT withgivenlabeledinput data • Andlabeltheuserswithrespecttothebuiltclassifier • Thenapplysameclassifier model tothefriends of theusersandlabelthefriendsalso • Lastly, updateeachuser’slabelwithrespectto her friends’ labelusing an updateformulae

  6. User-Centric Information • User-centricinformation is represented as features. There is a overmuchfeature set mainlycomprised of fourparts • Profile features(PROF) • User name, use of avatarpicture, date of accountcreation, etc… • Tweetingbehaviorfeatures(BEHAV) • Averagenumber of tweetsperday, number of repliesetc... • Linguisticcontentfeatures • Richestfeature set, comprised of foursub-featuresets • UsesLatentDrichletAllocation (LDA) as Language Model • Prototypicalwords(LING-WORD): • Protowords, wordsthatareicons in users. • Foundprobabilisticallyfromthe data • Firstlypartitiontheusersinto n class, thenfindthemostfrequentwordsforeachclassandtakemostlyused k wordsforeachclass • Prototypicalhashtags(LING-HASH): • Hashtag (#) todenotetopics • Sametechniqueforprotowords • Generic LDA(LING-GLDA): • LDA is thelanguage model theyused, extractedtopicswithrespecttothe LDA model andrepresentsusers as a distributionovertopics • LDA is trainedbyallsets of users • Domain-specific LDA(LING-DLDA): • Same as Generic LDA, but trainedwithspecifictraining set such as usersthatareonlydemocratsandrepublicans • Sentimentwords(LING-SENT): • Manuallycollectedsmall set of terms, Ronald Regan, goodorbad ? • OpinionFinderToolgivesthesentiment as positive, negative, neutral

  7. User-Centric Information • Social Network Features • Combination of twodifferentfeatures • Friendaccounts(SOC-FRIE): • Informsaboutsharingsamefriendsfordifferentlabeleduserssuch as democratsandrepublicans • Prototypicalreplied(SOC-REP) andretweeted (SOC-RET) users: • Findmostfrequentmentioned (@) andretweeted (RT) usersfordifferentlabeledusers • That’sallforuser-centricinformation OVERMUCH, indeed…

  8. Label Update Using Social Network • Now eachuser in the test set is labeledbytheclassiferthat is trainedwiththefeaturesjustmentioned • Labelupdatepartupdatesthelabelswithrespecttothelabels of friends of theusers, this is done as follows: • Labeleachuserandall of her friendsusingthebuiltclassifier. Labelsarenumbers in [+1, -1], highervaluesshowshigherconfidencelevel • Thenupdatethelabels of userswithrespecttothefollowingformulafortheuserui:

  9. Experimental Evaluation • Three binaryclassificationtasks: • Detectingpoliticalaffiliation • DemocratorRepublican • 5169 Democratsand 5169 Republicans • 1.2 millionsfriends • Ethnicity • AfricanAmericanor Not • 3000 AfricanAmericansand 3000 Not AfricanAmericans • 508K friends • Following a business • FollowingStarbucksor Not • 5000 Starbucksfollowerand 5000 Not • 981K friends

  10. ExperimentalResults, PoliticalAffiliationTask • Best achievedresultforcombined HYBRID model amongthreetaskshowever, not significantincreaseoversingle ML model • Social Network featuresareverysuccessfull. This is becauseusersfrom a particularpoliticalviewarefriendswithsimilarparticularviews. Suportting sinle Graph-BasedLabelupdate is alsoverysuccessfullalone

  11. ExperimentalResults, StarbucksFansTask • SocialGraphupdate is not thatmuchsuccessfull as politicalaffiliationtask since Starbucksdoes not buildfriends, indeed • Profile featuresarevery successfullalone • Linguisticfeaturesarealso successfull • HYBRID methodstilldoes not increasethealone ML systemsignificantly

  12. ExperimentalResults, EthnicityTask • HYBRID methodfails, decreasesthealone ML model • Social network features a so bad ! • As in StarbukcsTaskcase, ethnicitydoes not form a community. Hence, social network features andgraph-basedupdate has very lowresults • Best featurealoneresultsare in linguisticfeatures. Linguisticfeaturesalwayshave a point !

  13. OverallComments • #1 ML methodmostlygoodenoughandupdatepart of thearchitecturedoes not bringsignificantimprovement.Ifthetaskallowsforusersto form a communityupdatefunctionworks, else, it mayeven hurt thealone ML system as in ethnicitycase • #2 LinguisticFeaturesalwaysreliable

  14. Review#1 • Thenovelty of combiningthetypesof information is attractive, however, thereareseriouspointsthatshould be criticized • First of alltheclassifier is doingonlybinaryclassificationandnothingsaidaboutmulti-dimensionalclassification. Doingmulti-dimensionalclassificationusingbinaryclassifier is time-consumingandweakenstheclaimaboutthescalability. • As said, thenovelarch. idea is attractive, however, theresultsshowthatlabel-updatedoes not workwell. Why ? Theydid not giveanyappriciablecomment on whylabelupdatedoes not workwell. This, I believe, showsthatthefeature set andthenovelarchitecture is not well-studied. • Thereareovermuchfeatures. But thereasonswhythesefeaturesareselectedare not given. • Morever, applyingsame ML model theusersandtheirfriendsreplicatestheinformation. Obviouslyconnecteduserswillhavesomecommonanddifferentattributes, what is thepoint? • Thesocialgraphshould be usedmoreeffectively. I think it should not be usedtoupdatethelabels but as an importantlyweigthedfeature in the ML model. This is becauseweshouldsuperposedifferentinformationtypesinstead of usingonetocompensatetheother. You can seedifference in thinkingvectorspace, updatemeansspanningsamevectoragain, superposingmeansusingbothvectorconcurrently. Forexample, protowordswouldhavebeenextractedusingthe network, somehow.

  15. Review#2 • TheytoldaboutGradientBoostedDecisionTrees(GBDT) but gavenothingaboutthisclassificationalgorithm, an explanation is expected at least in princpileabout GBDT. Samething is validforLatentDrichlettAllocation(LDA) language model. It is thefirst time I hearthislanguage model, andtheysaidnothingabout LDA. It is onlysaidthat LDA is used as language model andassociatedwithtopics. But, what is LDA andhow it is associatedwithtopics? • Thereis no data analysis, verycruicallacking of paper, everything is data! Theyonlygavethenumber of usersused in training, but whataboutthe test set? Development set? Anyotherstatisticsaboutthe data? Moreover, theyuseddifferentnumber of samplesforeachtask. Thesuccess of labelupdate is verylowforethnicitytaskthanthepoliticalaffiliationtask, however, thereare 1.2M friendsforpoliticalaffiliationtask but almosthalf of themforethnicitytask, 508K. Hencethecross-taskcommentsare not confident. • Thesystemtheybuilthave a stroingconstraint, indeed. It is languagedependent, English. Forexample, thefeaturesbased on frequencies of protowordswill not workforTurkishduetoitsagglutinativenature, manyinflectedforms of samewords: masayı, masada, masanın, masalardakilkerinetc… (A stemmerwill be needmostprobably) • Experimentsarenot done in a structuredway. Theyhavejust done theexperimentsandshowstheresults. There is not a usefulcomment. Beside, theydid not explainwhytheyhavechosentheseexperiments. Forexample, I wouldwanttoseesomesuccess of subsetfeatures as featuresalonehavemostlyverygoodresults, somesubsetmayincreasetheoverall HYBRID result.

  16. AnyCommentsorQuestions ?

More Related