Spambase data

Spambase data Represent an email by a vector

Spambase dataset • Data Set characteristics: Multivariate • Number of Instances: 4601 • Number of spam emails: 1813 • Number of ham emails: 2788 • Zero-rule accuracy: 60.6% • Attribute Characteristics: Integer, Real • Number of Attributes: 57 • Associated Tasks: Classification • Missing Values? Yes

Text to vector • The original emails are text files of different lengths. • To eliminate the problems associated with learning to classify objects of different lengths, each email is transformed into a vector of 57+1 dimensions. • The last dimension represents the class label: 1 for spam and 0 for ham.

Key words • A set of 48 words were chosen from among the contents of the training set. • These words were deemed to be relevant for distinguishing between spams and hams. • They are as follows: make, address, all, 3d, our, over, remove, internet, order, mail, receive, will, people, report, addresses, free, business, email, you, credit, your, font, 000, money, hp, hpl, george, 650, lab, labs, telnet, 857, data, 415, 85, technology, 1999, parts, pm, direct, cs, meeting, original, project, re, edu, table, conference.

Frequency variables for words • Given an email text and a particular WORD, we calculate its frequency, i.e., the percentage of words in the e-mail that match WORD: word_freq_WORD = 100*r/t, • r is number of times the WORD appears in the email • t is the total number of words in e-mail. • For example, the word ‘make’ never occurs in email 1, so word freq make = 0, while 0.64% of the words in email 1 are the word ‘address’, so word freq address of email 1 = 0.64%. • There are 48 such frequency variables. Each corresponds to a dimension (an axis) in the Euclidean training space.

word_freq_george histogram 80 emails has "george" frequency in (19.9, 20.6] % range. Red: hamBlue: spam

Frequency variables for special characters • Six characters are also chosen as special. They are ; ( [ ! $ # • Frequency variables are also created for these. • Now training space is 48+6=54 dimensional.

Last 3 variables • Given an email text, three more variables are created. • Average length of uninterrupted sequences of capital letters • Longest uninterrupted sequence length of capital letters • Total number of capital letters in the email. • Altogether, there are 57 attributes (variables) to describe an email, • plus 1 attribute for the class label. • This is how text emails are transformed into 58-dimensional vectors. • For this spam email problem, 4601 emails were transformed and stored in the UCI Repository.

Lexical representation • "Han Solo loves Leia" • "Leia loves Han Solo" • Bag-of-words (lexical) representation is a simple and a good one. However, it ignores the syntactic structure of sentences as well as their semantics and pragmatics so that that it cannot distinguish between the sentences • To achieve human performance, it is not enough for the machine to classify documents by using just word tokens (lexicons).

Syntacticrepresentation • "Han Solo loves Leia" • "Leia loves Han Solo" • In the first sentence, the two-word noun "Han Solo" is the subject. • Syntactic information such as word position, neighboring words, multi-word phrase, noun, verb, etc. is important for clarification.

Semanticrepresentation • "Han Solo loves Leia" • "Leia loves Han Solo" • The two sentences are talking about love. • Semantically, they can be analyzed to indicate the love relationship between a man and a woman. • Simple semantic net representation is helpful to discover this loveconcept.

Simple semantic nets • Nodes are labeled with names (nouns). • Arcs are labeled with relationships. • Special link label "isa" means "is a". • Show membership or subset relationships

Pragmatic representation • "Han Solo loves Leia" • "Leia loves Han Solo" • The two sentences are talking about love. • Pragmatically, it is somewhat common knowledge that Han Solo and Leia are characters from the Star War series. • This pragmatic information can contribute toward class distinction for some cases.

Summary • These linguistic representations can be acquired from the text automatically by AI parsing and translation techniques using some quick and simple heuristics. • The result is a lexical, syntactic, semantic, and pragmatic representations of the text files. • We want them to be as simple as possible. • The point is to represent the text to the extent that the machine can sort the documents for the user, not to actually develop a deep understanding of the content of the document which is a far tougher problem to deal with and requires more sophisticated semantic nets.

Spambase data

Spambase data

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data, Data Everywhere

Data, Data, and more Data

Data, Data, and more Data

Data Data Data

Data Mining: Data

Data ! Data! Data!

Data, Data Everywhere….

Faster data – Better data – Cheaper data

DATA? What DATA?

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data, Data, Everywhere...

Data Begets Data

Data, Data, and More Data

Data Mining: Data

Data Mining: Data