470 likes | 685 Views
Social Media Analytics : Digital Footprints. Funded by:. Sandhya Krishnan Dr. Anupam Joshi. Introduction.
E N D
Social Media Analytics : Digital Footprints Funded by: Sandhya Krishnan Dr. Anupam Joshi CHMPR IAB 2012
Introduction • Social media has greatly impacted the way we communicate today. With approximately 3000 tweets/sec (13K/sec around Superbowl) and 2.5 Billion updates a day, it is a great way to disseminate information to users across the world. • However such a tool can also be used to disseminate misinformation in a quick and efficient manner which can have an harmful impact in multiple scenarios like national security cases, or business/marketing cases and hence needs to be curbed and kept in check • Our approach is to create a social footprint of users which can be used to distinguish real and imposter/ compromised accounts on social media. CHMPR IAB 2012
Introduction • Social Media is a great way to disseminate information to users across the world. 200 million active users and 340 million tweets/day (December 2012) 1.11 billion users as of May 2013 • But, what about disinformation (intentionally false or inaccurate information spread deliberately) ??
Motivation- March 2013 February 2013 @flydeltaassist @deltaassist Real Twitter Verified Account Both claim to be Promised free tickets to first several thousand followers Was Tweeting against Church’s Anti-Gay policy Both claim to be Pope Francis Fake Account Banned by Twitter
Motivation @theUSpresident @BarakObama Which one is real?? @BarackObama CHMPR IAB 2012
Motivation August 2012 @pmoindia claims to be @pm0india claims to be @dryumyumsingh claims to be • Tweeting content which was : • Misrepresenting violence against Muslims in Burma • Instigating riots in North- Eastern Region of India 6 fake PMO India Profiles
Motivation- News/Business Scenarios April 2013 February 2013 Hacked Accounts
Objective @BarakObama Which one is real?? @BarakObama__ @theUSpresident @Obamanews @BarackObama44 @ThePresObama Is this account compromised?? @BarackObama
Success Criteria • Build a prototype system which performs a joint content and network structure analysis demonstrating the feasibility of distinguishing real and fake profiles. • Developing high accuracy in identifying real accounts of “famous people” • Evaluate further by filtering down the social media network to check the validity of accounts belonging to a layman CHMPR IAB 2012
Solution overview What is a digital footprint? DIGITAL FOOTPRINT @barackobama Content Words in tweets Hash tags URL s Mentions Following Re- tweets Followers Replies Network Structure Meta data Name Verified A/c ? Location Created_at
Solution overview Create Digital Footprint System- Content Module Twitter User_timelineAPI Extract Tweets (Content) Clean Text and Create Bag of Words Model @barackobama • For each word compute TF-IDF score Compute two groups of words- Frequently occurring and Rarely occurring.
Solution Overview Create Digital Footprint System- Network Module Twitter User_TimelineAPI Extract users in ‘Re-Tweets’ and ‘Replies’ Form Close Social Network @barackobama Extract users who ‘mention’ current user
Solution Overview Digital Footprint System- Content Module Digital Signature/ Footprint @barackobama System- Network Module
Solution Overview AuthenticateDigital Footprint • What content is similar? • % terms common between tweets and news articles • How similar are they? • Average difference between TF-IDF scores of such terms • Above two metrics computed for Rare and Frequent terms in both context- Tweets and News Article {Rare and Frequent terms indicated by TF-IDF}
Network Characteristics of Close Social Network Solution Overview AuthenticateDigital Footprint System- Network Module • To understand Trust Propagation in Social Networks, we record: • Number of Twitter ‘verified’ users in the current user’s network • In some scenarios we also use: • Network Intersection with a trusted user • Number of hops required to reach the current user from the trusted user in the network • Number of nodes in network • Out-degree- From user’s Replies and Re-tweets • In-degree – User’s @mentions in addition to @replies directed to the user and @RT of the user’s tweets
Results • Ground Truth • Twitter ‘verified’ real accounts • If above tagging absent, then manual observation of account Analysis done to identify real and fake profiles of “Famous people” Analysis done to identify “Less Famous people” Corporate Accounts Analysis done for a specific time period or 3500 most recent Tweets- Whichever relevant Hacked /Compromised Accounts
Results I Digital Signature/ Footprint “Famous people” System- Network Module System- Content Module “Famous People on Twitter” • People about whom enough information from reliable web sources is available on a day to day basis
Results IPresident Obama [1st May 2013] System- Content Module Graph 1 Graph 2
Results I President Obama System- Network Module System: @barackobama is real Ground Truth: @barackobama is the Twitter ‘verified’ real account
Results I - Conclusion “Famous people” Predicted Actual “Famous people” Total Twitter handles – 31 Number of Real handles – 18 Number of Fake handles - 13
Results - II Digital Signature/ Footprint “Less Famous people” System- Network Module System- Content Module “Less Famous People on Twitter” • People about whom enough information from reliable web sources is not available on a regular day to day basis • Information maybe available on some days or in spurts (when such users are in News for a particular event/ development etc) • Continuous availability of web content about such users is not reliable- hence we look at the social network structure of such users
Results - II A good mix of highly sought users in music, acting, fashion, journalism, media, business US Senators Celebrities popular in the USA Members of Parliament – India Celebrities popular in India
Results – IISenators- USA Digital Signature/ Footprint System- Network Module Trusted User: @barackobama
Results – IISenators- USA System: @chuckgrassley is real Ground Truth: @chuckgrassley is the Twitter ‘verified’ real account
Results IICelebrities- USA Digital Signature/ Footprint System- Network Module • Trusted Users: @youtube, @justinbieber,@shakira,@kimkardashian and @cnnbrk
Results IICelebrities- USA Graph1 (Close)Social Network Analysis Graph 2
Results IICelebrities- USA System: @lindsaylohan is real Ground Truth: @lindsaylohan is the Twitter ‘verified’ real account
Results – IIConclusion “Less Famous people” Predicted Actual Total Twitter handles – 350 Number of Real handles – 278 Number of Fake handles -72
Results III Digital Signature/ Footprint “Corporate Accounts” System- Network Module System- Content Module @bostonmarathon @bostonmarathons @_bostonmarathon
Results IV Phase I of Evaluation Digital Signature/ Footprint “Twitter Handle” • Phase II of Evaluation • Content comparison also done between tweets of compromised account and content from: • Other Similar Twitter Accounts • Previous Content posted by account over a significant period of time System- Network Module System- Content Module Detect hacked/compromised accounts on Twitter
‘@AP’ hacked Phase I Results System- Content Module “Breaking: Two Explosions in the White House and Barack Obama is injured” The terms which are absent in news articles but present in the tweets of @AP :
‘@AP’ hacked Phase I Results System- Content Module • The termscommon between tweets and news but have high difference in TF-IDF scores (Average Difference is 0.6): “Breaking: Two Explosions in the White House and Barack Obama is injured”
‘@AP’ hacked Phase II • Solution approach • 3500 most recent tweets of each handle • Run Content Analysis Module over this data set • Compute: • % common terms between @AP and other account handles • Average Difference in TF-IDF scores between such terms • Results • 40 – 45 % of the topics spoken by these news channel accounts coincide • Above topics showed very high similarity i.e. lower difference in TF-IDF scores • Uncommon topics where observed to be specific stories followed by these individual channels On a regular day, how similar is @AP to @breakingnews, @cnn, @foxnews, @washingtonpost and @Nationnow ?
‘@AP’ hacked Phase II Results Are the terms in this tweet mentioned by majority news channel accounts? “Breaking: Two Explosions in the White House and Barack Obama is injured”
Other ‘Hacking’ episodes – Successfully Caught @48hours and @60minutes caught accurately with identical Phase I and Phase II analysis like @AP
Other ‘Hacking’ episodes – Successfully Caught • Compare tweets from day of attack with • Past 10 day tweets of the handle
Conclusion System- Content Module Digital Signature/ Footprint System- Network Module Authenticate this footpint to flag account as real or fake/compromised
Conclusion • Applicability of system demonstrated in three flavors: • Authenticating ‘famous’ Twitter users • Content and network analysis modules - both are extremely useful • Authenticating ‘less famous’ Twitter users • Network analysis module is more relevant • Detecting if an existing account is hacked/compromised • Only content analysis is relevant in this context • Content comparison in case of compromised accounts, is done between tweets of compromised account and content from: • Reliable web sources • Other Similar Twitter Accounts • Content posted by account over a significant period of time
Future Work For the three flavors in which our system is usable, some immediate tasks planned are: • Authenticating ‘famous’ Twitter users • Implement a sentiment analysis module in addition to the text analysis module • Authenticating ‘less famous’ Twitter users • Incorporate context to understand who is the “famous” and hence ”trusted” user in context of the current user • Detecting if an existing account is hacked/compromised • Build an online system which will: • Constantly monitor accounts tweeting similar contents • Flag if one such account tweets content very different from others
Future Work • Gather larger data sets and perform evaluations in each of the above categories • Extend system such that it is more applicable in differentiating a layman’s account as real or fake/compromised
References • Zi Chu, Steven Gianvecchio, Haining Wang, and SushilJajodia. 2010. Who is tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th Annual Computer Security Applications Conference (ACSAC '10). ACM, New York, NY, USA, 21-30. • F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting Spammers on Twitter. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), July 2010 • Michael Gamon and Anthony Aue. 2005. Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing (FeatureEng '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 57-64 • Soo-Min Kim and Eduard Hovy. 2006. Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text (SST '06). Association for Computational Linguistics, Stroudsburg, PA, USA, 1-8. • Qianni D, Yunjing D. How your friends influence you: quantifying pairwise influences on twitter. [serial online]. January 1, 2012;Available from: Inspec, Ipswich, MA. Accessed April 15, 2013. • Meeyoung Cha and HamedHaddadi and FabrícioBenevenuto and Krishna P. Gummadi, Measuring user influence in Twitter: The million follower fallacy. ICWSM ’10: Proceedings of international AAAI Conference on Weblogs and Social, 2010 • MohitKewalramani, "Community Detection in Twitter", MastersThesis, University of Maryland Baltimore County, May 2011, • De Choudhury, M. (2010). How "Birds of a Feather Flock Together" on Online Social Spaces.2010 Grace Hopper Celebration of Women in Computing (Atlanta, • Irani, D.; Webb, S.; Kang Li; Pu, C., "Large Online Social Footprints--An Emerging Threat," Computational Science and Engineering, 2009. CSE '09. International Conference on , vol.3, no., pp.271,276, 29-31 Aug. 2009doi: 10.1109/CSE.2009.459 • D. Correa, A. Sureka, and R. Sethi, “WhACKY! - What anyone could know about you from Twitter," in PST, 2012. • M. Motoyama and G. Varghese, “I seek you: searching and matching individuals in social networks," in Proceedings of the eleventh international workshop on Web information and data management,ser. WIDM, 2009. • Paridhi Jain, PonnurangamKumaraguru, “Finding Nemo: Searching and Resolving Identities of Users Across Online Social Networks” Indraprastha Institute of Information Technology (IIIT-Delhi), India • http://www.slideshare.net/franswaa/twitter-101-for-nonprofits
Questions? CHMPR IAB 2012