1 / 51

Crime Scene Investigation: SMS Spam Data Analysis

Crime Scene Investigation: SMS Spam Data Analysis. Roger Piqueras Jover AT&T Security Research Center New York, NY roger.jover@att.com. Ilona Murynets AT&T Security Research Center New York, NY ilona@ att.com. IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.

hayley
Download Presentation

Crime Scene Investigation: SMS Spam Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crime Scene Investigation: SMS Spam Data Analysis Roger PiquerasJover AT&T Security Research Center New York, NY roger.jover@att.com IlonaMurynets AT&T Security Research Center New York, NY ilona@att.com IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.

  2. Spam is the commonly adopted name to refer to unwanted messages that are massively sent to a large number of recipients. e-mail spam • 90% of the daily e-mail via the Internet is spam • multiple solutions detect and block • a small amount of spam reaching inboxes SMS spam ?

  3. SMS-spam • connect aircards & cell to PC • yearly growth larger than 500% • effective anti-abuse messaging filters injected • content-based algorithms (for email) works less efficient Why??? • acronyms/pruned spellings/emoticons • Shut down/swap SIM

  4. SMS-spam • consume network resources for legitimate services otherwise. • user pays at a per received message basis • exposes smart phone users to viruses • fraudulent messaging activities such as phishing, identity theft and fraud This paper: • used forSMS spam detection engine

  5. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  6. three data sets: SMS cell M2M • tier-1 cellular operator • Call Detail Records (CDR) of 9000 SMSspammer & 17000 legitimate (cell & M2M) • Mobile Originated (MO):transmitting party • Mobile Terminated (MT):receiver • Spammers identified & disconnected from the network. • SMS: prepaid cell: postpaid • M2M: TAC

  7. three data sets for analysis

  8. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  9. notes • In all the figures throughout the paper, legitimate cellphone users, M2M systems and spammers (SMS)are represented in green, blue and red, respectively.

  10. Account information • spammers (99.64%) are using pre-paid accounts with unlimited messaging plans • SIM cards are constantly switched to circumvent detection schemes • discard it once an account is canceled and work with a new one • average age is 7 to 11 days (legitimate user is several months to a couple years)

  11. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  12. Messaging Abuse

  13. Messaging Abuse • Spammers generate a large load of messages • Spammers not only send but also receive more than legitimate customers do • opt-out • trick

  14. Messaging Abuse Actual spam messages often attempt to trick the recipient into replying to the message. Despite a small percentage of users will reply, the large amount of accounts targeted in a spam campaign results in many responses.

  15. Messaging Abuse

  16. Messaging Abuse • legitimate accounts have a small set of recipients. (7 on average) • spammers hit a couple of thousand victims • legitimate users send multiple messages to a small set of destinations • spammers send one message to each victim

  17. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  18. Response ratio

  19. Response ratio • legitimate users, messages are sent in response to a previous message in a sequential way. the response ratio close to 1. • For spammers the amount of MT SMSs is proportionally very small to the number of transmitted messages. the response ratio is close to 0

  20. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  21. Message timing and time series

  22. Message timing and time series

  23. Message timing and time series • Inter-SMS intervals for spammers are short less random -- low entropy • intervals for legitimate messages are less frequently random--higher entropy. • Messaging activities of certain M2M devices are prescheduled.

  24. Message timing and time series

  25. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  26. Location & targets

  27. Location & targets • California, • Sacramento and Orange • Los Angeles • New York/New Jersey/Long Island • Miami Beach • Illinois, Michigan • North Carolina and Texas.

  28. Location & targets

  29. Location & targets • The legitimate recipients -- local area (i.e. the area around the subscriber’s home or areas where the subscriber works, used to live or where friends and relatives reside). • The spam recipients distributed uniformly over the US population.

  30. Location & targets

  31. Location & targets • Spammers are characterized by messaging a large number of area codes, always greater than those of cell-phone users and M2M.

  32. Location & targets

  33. Location & targets • low entropy (legitimate cell) -- contacts repeatedly the same area codes. • High entropy (SMS) -- sends messages to a more random set of area codes. • Network enabled appliances (M2M) -- a predefined set of cell-phones, the entropy is the lowest.

  34. Location & targets

  35. Location & targets • linear relation -- SMS spammers • Both M2M systems and cell-phone users cluster around the bottom-left area of • the graph. • M2M send up to 20000 messages to 1 single destination???

  36. Location & targets

  37. Location & targets • Cellphone users destinations-to-messages ratio and a small set of area codes. • A great majority of spammers exhibit the opposite behavior. • bottom-right corner (SMS) target very specific geographical regions. ratio of one destination/message. targeted area codes is limited

  38. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  39. mobility

  40. mobility

  41. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  42. Hardware choice • 1. USB Modem/Aircard A1 • 2. Feature mobile-phone M1 • 3. Feature mobile-phone M2 • 4. USB Modem/Aircard A2 • 5. USB Modem/Aircard A3

  43. Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic

  44. Voice call

  45. Voice call

  46. IP traffic

  47. Voice call

  48. IP traffic

  49. STOPPING THE CRIME • An advanced SMS spam detection algorithm is proposed based on an ensemble of decision trees • Over 40 specific features are extracted from messaging patterns and processed through a combination of decision trees

  50. CONCLUSIONS • pre-paid accounts ---- 7 and 11 days. • large number of messages sent to a wide target(also receive a large amount) • five different models of hardware • large number of phone calls, very short duration • main geographical sources in US: Sacramento, Los Angeles-Orange County and Miami Beach • certain networked appliances • have messaging behavior close to that of a spammer.

More Related