1 / 14

Spambase data

Spambase data. Represent an email by a vector. Spambase dataset. Data Set characteristics:   Multivariate Number of Instances: 4601 Number of spam emails: 1813 Number of ham emails: 2788 Zero-rule accuracy: 60.6% Attribute Characteristics: Integer, Real Number of Attributes: 57

kobe
Download Presentation

Spambase data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spambase data Represent an email by a vector

  2. Spambase dataset • Data Set characteristics:  Multivariate • Number of Instances: 4601 • Number of spam emails: 1813 • Number of ham emails: 2788 • Zero-rule accuracy: 60.6% • Attribute Characteristics: Integer, Real • Number of Attributes: 57 • Associated Tasks: Classification • Missing Values? Yes

  3. Text to vector • The original emails are text files of different lengths. • To eliminate the problems associated with learning to classify objects of different lengths, each email is transformed into a vector of 57+1 dimensions. • The last dimension represents the class label: 1 for spam and 0 for ham.

  4. Key words • A set of 48 words were chosen from among the contents of the training set. • These words were deemed to be relevant for distinguishing between spams and hams. • They are as follows: make, address, all, 3d, our, over, remove, internet, order, mail, receive, will, people, report, addresses, free, business, email, you, credit, your, font, 000, money, hp, hpl, george, 650, lab, labs, telnet, 857, data, 415, 85, technology, 1999, parts, pm, direct, cs, meeting, original, project, re, edu, table, conference.

  5. Frequency variables for words • Given an email text and a particular WORD, we calculate its frequency, i.e., the percentage of words in the e-mail that match WORD: word_freq_WORD = 100*r/t, • r is number of times the WORD appears in the email • t is the total number of words in e-mail. • For example, the word ‘make’ never occurs in email 1, so word freq make = 0, while 0.64% of the words in email 1 are the word ‘address’, so word freq address of email 1 = 0.64%. • There are 48 such frequency variables. Each corresponds to a dimension (an axis) in the Euclidean training space.

  6. word_freq_george histogram 80 emails has "george" frequency in (19.9, 20.6] % range. Red: hamBlue: spam

  7. Frequency variables for special characters • Six characters are also chosen as special. They are ; ( [ ! $ # • Frequency variables are also created for these. • Now training space is 48+6=54 dimensional.

  8. Last 3 variables • Given an email text, three more variables are created. • Average length of uninterrupted sequences of capital letters • Longest uninterrupted sequence length of capital letters • Total number of capital letters in the email. • Altogether, there are 57 attributes (variables) to describe an email, • plus 1 attribute for the class label. • This is how text emails are transformed into 58-dimensional vectors. • For this spam email problem, 4601 emails were transformed and stored in the UCI Repository.

  9. Lexical representation • "Han Solo loves Leia" • "Leia loves Han Solo" • Bag-of-words (lexical) representation is a simple and a good one. However, it ignores the syntactic structure of sentences as well as their semantics and pragmatics so that that it cannot distinguish between the sentences • To achieve human performance, it is not enough for the machine to classify documents by using just word tokens (lexicons).

  10. Syntacticrepresentation • "Han Solo loves Leia" • "Leia loves Han Solo" • In the first sentence, the two-word noun "Han Solo" is the subject. • Syntactic information such as word position, neighboring words, multi-word phrase, noun, verb, etc. is important for clarification.

  11. Semanticrepresentation • "Han Solo loves Leia" • "Leia loves Han Solo" • The two sentences are talking about love. • Semantically, they can be analyzed to indicate the love relationship between a man and a woman. • Simple semantic net representation is helpful to discover this loveconcept.

  12. Simple semantic nets • Nodes are labeled with names (nouns). • Arcs are labeled with relationships. • Special link label "isa" means "is a". • Show membership or subset relationships

  13. Pragmatic representation • "Han Solo loves Leia" • "Leia loves Han Solo" • The two sentences are talking about love. • Pragmatically, it is somewhat common knowledge that Han Solo and Leia are characters from the Star War series. • This pragmatic information can contribute toward class distinction for some cases.

  14. Summary • These linguistic representations can be acquired from the text automatically by AI parsing and translation techniques using some quick and simple heuristics. • The result is a lexical, syntactic, semantic, and pragmatic representations of the text files. • We want them to be as simple as possible. • The point is to represent the text to the extent that the machine can sort the documents for the user, not to actually develop a deep understanding of the content of the document which is a far tougher problem to deal with and requires more sophisticated semantic nets.

More Related