450 likes | 566 Views
CL and Social Media. LING 575 Fei Xia Week 2: 01/11/2011. Outline. A few announcements Personal vs. Business email Email zone classification Deception detection Hw2 Hw1: quick update from the students. A few announcements. Databases on Patas.
E N D
CL and Social Media LING 575 Fei Xia Week 2: 01/11/2011
Outline • A few announcements • Personal vs. Business email • Email zone classification • Deception detection • Hw2 • Hw1: quick update from the students
Databases on Patas • Three mysql databases on patas/capuchin • enron: the ISI database • Same to have many more senders • The tables are slightly different from the paper • berkeley_enron: the database from Berkeley • zonerelease: email zone annotation • Query the database: • usrid: enronmail • password: askwhy
Databases on Patas (cont) • mysql -u enronmail -p -h capuchin • enter your password (“askwhy”) • use database_name; • show tables; • select * from table_name limit 5; • mysql API for Perl and other languages
Recent workshops on social media • NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf • ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx • International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php
Task • Determine whether an email is personal or business • (Jabbari et al., 2006) • Manual annotation • Inter-annotator agreement • Automatic classification
Annotated data • Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm • Stored on patas under$data_dir/personal_vs_business/ • Size: • 12,500 emails • 83% business, 17% personal • Mismatch between the paper and the data
Class labels • Business: • core business, routine admin, inter-employee relations, soliciting, image, keeping_current • Personal: • close personal, personal maintenance, personal circulation
Inter-annotator agreement • 2,200 emails are double annotated: • 6% disagreement • 82% are labeled as “business” by both • 12% are labeled as “personal” by both • disagreements: about 130 emails • 25% for subscription • 18% for travel arrangement • 13% for colleague meetings • 8% for service provided to Enron employees • Questions: • What do annotators see? The email only or the thread? Do they only look at the email body, or do they look at the “To” field as well?
Automatic classification • Classification algorithm: (Guthrie and Walker, 1994) • Data: • 4,000 messages on “core business” • 1,000 messages on “close personal” • Results: 0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)
(Guthrie and Walker, 1994)Algorithm for text classification • Let T1, T2, …, Tk be class labels. • Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti. • Training: • partition the set of words into W1, W2, …, Wm • for each Ti, • “merge” the documents in the training data whose class label is Ti • calculate pij for each Wj • Ex: |T|=2, |W|=3, pijis (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2 • Testing: • let nj be the frequency of the words in the test document that belongs to Wj • Ex: the frequencies are (10, 200, 8900) • choose the Ti that maximizes
(Guthrie and Walker, 1994):Experiments • Two class labels: T1 and T2 • Three word sets: W1, W2, and W3 • W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2). • W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1). • W3 includes the rest of the words • Accuracy: 100%
Issues • Using word features: the words in a business email could vary a lot depending on what the business is. • Other important cues: • the relation between the sender and the recipient • Do they work in the same company? • What is the path between them in the company report chain? • Are they friends? • other emails in the same thread • the nature of the sender/recipient/company’s work and the words in the emails (e.g., “stock”, “parent meeting”) • … • Other ideas?
Email zone classification • Task: given a message, break it down to zones (e.g., header, greeting, body, disclaimer, etc.) • Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009 • Data: • Available at http://zebra.thoughtlets.org/ • Stored on patas under $data_dir/email_zoning_dataset/EmailZoneData/ • Stored on capuchin as a mysql database called “zonerelease”
Email zones in (Estival et al., 2007) • Five categories: • Author text • Signature • Advertisement (automatically appended ones) • Quoted text • Reply lines
Email zones in (Lampert et al., 2009) • Sender zones • Author: new content from the current email sender, excluding any text that has been included from previous messages. • Greetings: e.g., “Hi, Mike” • Signoff: e.g., “thanks. AJ” • Quoted conversation zones • Reply: content quoted from a previous message • Forward: Content from an email message outside the current conversation thread that has been forwarded by the current email sender
Email zones (cont) • Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages • Signature • Advertising • Disclaimer • Attachment: automatically generated text
Manual annotation • Annotated data: • almost 400 email messages • 11881 lines (7922 non-blank lines) • use the Berkeley database (“berkeley_enron”) • one annotator • Use 10-fold cross validation
Automatic classification • Classifier: SVM • Two approaches: • two stages: (zone fragment classification) • segment a message into zone fragments • classify those fragments • one stage: • classify each line
Detecting zone boundaries • Different kinds of boundaries: • Blank boundaries: line 12 • Separate boundaries: line 17-20 • Adjoining boundaries: lines 10 and 11 • Use heuristic approach: • consider every blank line or lines beginning with 4+ repeated punctuation marks • cannot handle adjoining boundaries • high recall, low precision
Classifying zone fragments • Features: • Graphic features: layout of text in the email • Orthographic features: the use of distinctive chars and char sequences including punctuation, capital letters and numbers • Lexical features: information about the words used in the email text
Graphic features • the number of words in the text fragment • the number of characters in the text fragment • the start position of the text fragment • the end position of the text fragment • the average line length (in chars) within the text fragement • the length of the text fragment relative to the previous fragment • the number of blank lines preceding the text fragement • …
Orthographic features • whether all lines start with the same character (e.g., ‘>’); • whether a prior text fragment in the message contains a quoted header; • whether a prior text fragment in the message contains repeated punctuation characters; • whether the text fragment contains a URL; • whether the text fragment contains an email address; • whether the text fragment contains a sequence of four or more digits; • the number of capitalised words in the text fragment; • the percentage of capitalised words in the text fragment; • …
Lexical features • word unigram • word bigram • whether the text fragment contains the sender’s name; • whether a prior text fragment in the message contains the sender’s name; • whether the text fragment contains the sender’s initials; and • whether the text fragment contains a recipient’s name.
Issues • Sequence labeling problem: • add features that look at the labels of preceding segments • Is the 9-zone label set sufficient? • How to take advantage of emails in the bigger context? • emails in the same discussion thread • emails by the same sender • general email structure: e.g., greeting, body, signoff, etc.
Papers for today • [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003. • [13] L. Zhou, J.K. Burgoon, J.F. NunamakerJr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.
(Newman et al., 2003) • Assumptions: Deceptive communications should be characterized by • fewer first-person singular pronouns (e.g., “I”, “me”, and “my”): disassociate one from one’s statements • more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing • fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources
Experiments: Five studies • videotaped abortion attitudes • typed abortion attitudes • handwritten abortion attitudes • feelings about friends • mock crime
Experiments • Trained on four studies and used the "classifier" on the remaining study • Accuracy: about 61% • They found these four types of words have the weights consistent with their assumptions.
(Zhou et al., 2004) • Experiments: • students are asked to exchange emails about a desert survival task • students are asked to tell the truth or lies • features: 27 linguistic cues
Hypothesis • Deceptive senders display • higher (a) quantity, (b) expressivity, (c) positive affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and • less (g) complexity, (h) diversity, and (i) specificity of language in their messages than truthful senders and than their respective receivers
Linguistics cues • quality: • # of words • # of verbs • # of NPs • # of sentences • expressivity: • # of adj/adv divided by # of nouns and verbs
Linguistics cues (cont) • positive effect: expression of positive emotion • informality: # of misspelled words / # of words • uncertainty: • # of modifiers (adj/adv) • # of modal verbs • # of uncertainty words • # of third person pronouns
linguistic cues (cont) • nonimmediacy: • passive voice • generalizing terms • (fewer) self references • group references: first person plural pronouns
Linguistic cues • Complexity: • Ave # of clauses per sent • Ave sentence length • Ave word length • … • Diversity: • lexical diversity • content word diversity • redundancy • …
Issues • Different settings for deceptions could affect the cues (e.g., length of the messages): • interviews • emails • blogs • lie or asked to lie
Hw2 • Your presentation • Reading assignments • Suggestions for others’ projects