260 likes | 413 Views
5th Symposium on Information Systems Assurance. Data Mining of E-Mails to Support Periodic & Continuous Assurance. Glen L. Gray California State University at Northridge Roger Debreceny University of Hawai`i at M ā noa. Toronto: October 2007. In this Presentation.
E N D
5th Symposium on Information Systems Assurance Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. GrayCalifornia State University at NorthridgeRoger DebrecenyUniversity of Hawai`i at Mānoa Toronto: October 2007
In this Presentation • Continuous monitoring of emails – why? • Technologies • Social Network Analysis • Text analysis • Challenges • Opportunities
Continuous Monitoring of Emails – Why? • Increased focus on forensic approaches to auditing • Increased interest in continuous assurance and monitoring of business processes • Emails = Organization’s DNA • Evidential matter on: • Employee & management fraud (overrides) • Compliance (e.g., HIPAA) • Loss of intellectual property • Corporate policies
Enron Email Archive • Released by Federal Energy Regulatory Commission • 500K emails • 151 Enron employees • Cleaned version at Carnegie Mellonwww.cs.cmu.edu/~enron/ • Relational DB version at USCwww.isi.edu/~adibi/Enron/Enron_Dataset_Report.pdf
Key Word Queries • Yes, people do say self-incriminating things in their emails • Fraud • Corporate dysfunction • Overwhelming false positives • Need “smart” compound queries • Good continuous auditing (CA) candidate • Already scanning for spam, porn, etc.
Sender Deception -- Content • Deceptive emails include: • Fewer first-person pronouns to dissociate themselves from their own words • Fewer exclusive words, such as but and except, to indicate a less complex story • More negative emotion words because of the sender’s underlying feeling of guilt • More action verbs to, again, indicate a less complex story
Sender Deception -- Identification • Writeprint features • Lexical -- characters & words • Function words • Root words • Syntactic -- sentences • Structural -- paragraphs • Content-specific
Sender Deception -- Identification • Number of potential features unlimited • Optimum number can vary bycontext and language • Developing user profiles and comparing new emails to profiles would be challenging for real-time CA
Volume & Velocity • Volume = number of emails a person sends and/or receives over a period of time. • Velocity = how quickly the volume changes. • Many external factors (e.g., vacations, seasonal activities, etc.) impact these numbers • Need “rolling histogram”
Volume & Velocity • Key issue -- determining the optimum time intervals to sample the data • Continuous monitoring cannot be continuous in terms of sampling in real time • Comparing hourly, daily, and even weekly volumes and velocities will result in many false positives • Optimum time internal could vary by job title
Social Network Analysis • Social relationships as an undirected graph • Importance of understanding relationships within the flow of email exchanges
Social Network Analysis in Emails • Emails semi-structured data • sender • primary recipient(s) • copied recipient(s) • date • subject line • Social groups and cliques • CA = who doesn’t belong?
C C C C R S R C C C C R S R S Time Thread Analysis – This? S
C S R R R C C C S R R R S Time Thread Analysis – Or this? S
Challenges of Email Mining • Textual • Inconsistent use of abbreviations • Misspelled words • Smileys etc. etc. • Replies, replies, and more replies… • Inability to identify: • Identities of email participants • anon@anon.mail.sender.net • Roles and responsibilities
What Enron Emails Show? • People do say the darnest things • What did he know and when did he know it? • Verified numerous bodies of email data mining research • Content analysis • Social network analysis
Tools • Content monitoring • eSoft Corporation’s ThreatWall • Symantec’s Mail Security 8x00 Series • Vericept Corporation’s Vericept Content 360º • Reconnex Corporation’s iGuard Appliance • InBoxer, Inc. Anti-Risk Appliance • Social networks • Microsoft SNARF • Heer Vizter
Research Questions • Role of email monitoring in overall CA environment? • Join SNA with examination of textual patterns. • Link SNA with control environment • Frauds/control overrides footprint? • What email cleaning is required for CA purposes? • Privacy and policy issues? • Lessons from existing commercial products?
Your Questions Thank Youglen.gray@csun.edu rogersd@hawaii.edu