Measurement and Classification of Humans and Bots in Internet Chat

Measurement and Classification of Humans and Bots in Internet Chat Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Reference Measurement and Classification of Humans and Bots in Internet Chat Steven Gianvecchio, Mengjun Xie, ZhenyuWu, and Haining Wang Department of Computer Science The College of William and Mary (USENIX Security),2008

Outline • Background • Measurement • Classification System • Experimental Evaluation • Conclusion

Chat Bots vs. BotNets • BotNets – networks of compromised machines • some use chat systems (IRC) for C&C, others use P2P, HTTP, etc. • abuse various systems • Chat Bots – automated chat programs • some are helpful, e.g., chat loggers • can abuse chat systems and their users • Send spam ,spread malicious software , mount phishing attacks • Our focus is on the Yahoo! Chat system.

Measurement • August-November 2007 – we collect data • August 2007 – Yahoo! adds CAPTCHA • very few chat bots • October 2007 – bots are back

Measurement • August and November 2007 • many chat bots • 1,440 hours of chat logs • 147 chat logs • 21 chat rooms

Measurement • To create our dataset, we read and label the chat users as • human, bot, or ambiguous • In total, we recognized 14 different types of chat bots • different triggering mechanisms • different text generation techniques

Types of Chat Bots • Periodic Bots – sends messages based on periodic timers • Random Bots – sends messages based on random timers • Responder Bots – responds to messages of other users • Replay Bots – replays messages of other users

Humans • inter-message delay – evidence of heavy tail • message size – well fit by Exponential (λ=0.034)

Periodic Bots • inter-message delay – several clusters with high probabilities • message size – messages built from templates approximate a normal distribution

Random Bots • inter-message delay – Equilikely distribution at 40, 64, and 88; Uniform distribution 45-125 • message size – messages selected from a small database

Responder Bots • inter-message delay – human-like timing • message size – multiple templates of different lengths

Replay Bots • inter-message delay – cluster with high probabilities (replay bots are periodic) • message size – human-like size, well fit by Exponential (λ=0.028)

Classification System • Entropy Classifier • detects abnormal behavior • based on message sizes and inter-message delays • accurate but slow • Machine Learning Classifier • detects “learned” patterns • based on message content • fast but must be trained

Observation – chat bots are less complex than humans, and thus, lower in entropy exploits the low entropy of chat bots Corrected Conditional Entropy Test (CCE) estimates higher-order entropy Entropy Test (EN) estimates first-order entropy Entropy Classifier 18

Machine Learning Classifier • Observation - chat spam like email spam is a text classification problem • exploits message content of chat bots • CRM114 • a powerful text classification system

ENTROPY CLASSIFIER BOT CORPUS HUMAN CORPUS CLASSIFY AS CHAT BOT CLASSIFY AS HUMAN INPUT MACHINE LEARNING CLASSIFIER • Hybrid Classification System • entropy classifier builds and maintains the bot corpus • machine learning classifier uses the bot and human corpora

Experimental Evaluation • Types of Chat Bots • Periodic Bots • Random Bots • Responder Bots • Replay Bots • Classifiers • entropy classifier – 100 messages • machine learning classifier – 25 messages

Experimental Evaluation • Classification Tests • Ent – entropy classifier • SupML – fully-supervised ML classifier, trained on AUG BOTS • SupMLre – fully-supervised ML classifier, retrained on NOV BOTS • EntML – entropy-trained ML on AUG BOTS

Entropy Classifier • EN – entropy • CCE – corrected conditional entropy • (imd) – inter-message delay • (ms) – message size

EN(imd) and CCE(imd) • problems against responder bots • detect most other chat bots

EN(ms) and CCE(ms) • problems against random and replay bots • detect most other chat bots

OVERALL • detects all chat bots • false positive rate is ~0.01 • 100 messages

Entropy and Machine Learning Classifiers • Ent – entropy classifier (from last slide) • SupML – fully-supervised ML classifier, trained on AUG BOTS • SupMLre – fully-supervised ML classifier, retrained on NOV BOTS • EntML – entropy-trained ML on AUG BOTS

Ent • OVERALL results from previous slide

SupML • has problems against November bots • needs to be retrained for new bots • SupMLre • detects all bots

EntML • false positive rate is ~0.0005 • (Ent is ~0.01) • 25 messages

Conclusion • Measurements • overall, chat bots are less complex than humans • some chat bots more human-like • Classification System • exploits benefits of both classifiers • quickly classifies known chat bots • accurately classifies unknown chat bots

Thank you !

Measurement and Classification of Humans and Bots in Internet Chat

Measurement and Classification of Humans and Bots in Internet Chat

Presentation Transcript

Internet Relay Chat

Secure Internet Chat

HUMANS AND NON-HUMANS

Internet Performance Tuning and Measurement

Measurement and Classification of Humans and Bots in Internet Chat

Classification and Internet technical filtering

Chat bots

Financial Instruments: Classification and Measurement

Bots and Botnets plus

Secure Internet Chat

Internet Relay Chat

Internet Relay Chat

Automated Classification and Analysis of Internet Malware

Benefits Of Using Chat-Bots For A Business

Psychic Chat and the Internet

3 types of business chat bots you can build

Classification and Management of Mast Cell Neoplasms in Dogs and Humans :

Fruit Classification and Calories Measurement System

Measurement and Modeling of Packet Loss in the Internet

and Chat

Internet Relay Chat

Why Chat Bots Are the Future of Communication