520 likes | 662 Views
Crime Scene Investigation: SMS Spam Data Analysis. Roger Piqueras Jover AT&T Security Research Center New York, NY roger.jover@att.com. Ilona Murynets AT&T Security Research Center New York, NY ilona@ att.com. IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.
E N D
Crime Scene Investigation: SMS Spam Data Analysis Roger PiquerasJover AT&T Security Research Center New York, NY roger.jover@att.com IlonaMurynets AT&T Security Research Center New York, NY ilona@att.com IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.
Spam is the commonly adopted name to refer to unwanted messages that are massively sent to a large number of recipients. e-mail spam • 90% of the daily e-mail via the Internet is spam • multiple solutions detect and block • a small amount of spam reaching inboxes SMS spam ?
SMS-spam • connect aircards & cell to PC • yearly growth larger than 500% • effective anti-abuse messaging filters injected • content-based algorithms (for email) works less efficient Why??? • acronyms/pruned spellings/emoticons • Shut down/swap SIM
SMS-spam • consume network resources for legitimate services otherwise. • user pays at a per received message basis • exposes smart phone users to viruses • fraudulent messaging activities such as phishing, identity theft and fraud This paper: • used forSMS spam detection engine
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
three data sets: SMS cell M2M • tier-1 cellular operator • Call Detail Records (CDR) of 9000 SMSspammer & 17000 legitimate (cell & M2M) • Mobile Originated (MO):transmitting party • Mobile Terminated (MT):receiver • Spammers identified & disconnected from the network. • SMS: prepaid cell: postpaid • M2M: TAC
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
notes • In all the figures throughout the paper, legitimate cellphone users, M2M systems and spammers (SMS)are represented in green, blue and red, respectively.
Account information • spammers (99.64%) are using pre-paid accounts with unlimited messaging plans • SIM cards are constantly switched to circumvent detection schemes • discard it once an account is canceled and work with a new one • average age is 7 to 11 days (legitimate user is several months to a couple years)
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Messaging Abuse • Spammers generate a large load of messages • Spammers not only send but also receive more than legitimate customers do • opt-out • trick
Messaging Abuse Actual spam messages often attempt to trick the recipient into replying to the message. Despite a small percentage of users will reply, the large amount of accounts targeted in a spam campaign results in many responses.
Messaging Abuse • legitimate accounts have a small set of recipients. (7 on average) • spammers hit a couple of thousand victims • legitimate users send multiple messages to a small set of destinations • spammers send one message to each victim
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Response ratio • legitimate users, messages are sent in response to a previous message in a sequential way. the response ratio close to 1. • For spammers the amount of MT SMSs is proportionally very small to the number of transmitted messages. the response ratio is close to 0
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Message timing and time series • Inter-SMS intervals for spammers are short less random -- low entropy • intervals for legitimate messages are less frequently random--higher entropy. • Messaging activities of certain M2M devices are prescheduled.
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Location & targets • California, • Sacramento and Orange • Los Angeles • New York/New Jersey/Long Island • Miami Beach • Illinois, Michigan • North Carolina and Texas.
Location & targets • The legitimate recipients -- local area (i.e. the area around the subscriber’s home or areas where the subscriber works, used to live or where friends and relatives reside). • The spam recipients distributed uniformly over the US population.
Location & targets • Spammers are characterized by messaging a large number of area codes, always greater than those of cell-phone users and M2M.
Location & targets • low entropy (legitimate cell) -- contacts repeatedly the same area codes. • High entropy (SMS) -- sends messages to a more random set of area codes. • Network enabled appliances (M2M) -- a predefined set of cell-phones, the entropy is the lowest.
Location & targets • linear relation -- SMS spammers • Both M2M systems and cell-phone users cluster around the bottom-left area of • the graph. • M2M send up to 20000 messages to 1 single destination???
Location & targets • Cellphone users destinations-to-messages ratio and a small set of area codes. • A great majority of spammers exhibit the opposite behavior. • bottom-right corner (SMS) target very specific geographical regions. ratio of one destination/message. targeted area codes is limited
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
Hardware choice • 1. USB Modem/Aircard A1 • 2. Feature mobile-phone M1 • 3. Feature mobile-phone M2 • 4. USB Modem/Aircard A2 • 5. USB Modem/Aircard A3
Outline • three data sets for analysis • Data analysis • Account information • Messaging Abuse • Response ratio • Message timing and time series • The Scene of the Crime • Location & targets • Mobility • Hardware choice • Voice and IP traffic
STOPPING THE CRIME • An advanced SMS spam detection algorithm is proposed based on an ensemble of decision trees • Over 40 specific features are extracted from messaging patterns and processed through a combination of decision trees
CONCLUSIONS • pre-paid accounts ---- 7 and 11 days. • large number of messages sent to a wide target(also receive a large amount) • five different models of hardware • large number of phone calls, very short duration • main geographical sources in US: Sacramento, Los Angeles-Orange County and Miami Beach • certain networked appliances • have messaging behavior close to that of a spammer.