CL and Social Media

CL and Social Media LING 575 Fei Xia Week 2: 01/11/2011

Outline • A few announcements • Personal vs. Business email • Email zone classification • Deception detection • Hw2 • Hw1: quick update from the students

A few announcements

Databases on Patas • Three mysql databases on patas/capuchin • enron: the ISI database • Same to have many more senders • The tables are slightly different from the paper • berkeley_enron: the database from Berkeley • zonerelease: email zone annotation • Query the database: • usrid: enronmail • password: askwhy

Databases on Patas (cont) • mysql -u enronmail -p -h capuchin • enter your password (“askwhy”) • use database_name; • show tables; • select * from table_name limit 5; • mysql API for Perl and other languages

Recent workshops on social media • NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf • ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx • International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php

Personal vs. business emails

Task • Determine whether an email is personal or business • (Jabbari et al., 2006) • Manual annotation • Inter-annotator agreement • Automatic classification

Annotated data • Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm • Stored on patas under$data_dir/personal_vs_business/ • Size: • 12,500 emails • 83% business, 17% personal • Mismatch between the paper and the data

Class labels • Business: • core business, routine admin, inter-employee relations, soliciting, image, keeping_current • Personal: • close personal, personal maintenance, personal circulation

Inter-annotator agreement • 2,200 emails are double annotated: • 6% disagreement • 82% are labeled as “business” by both • 12% are labeled as “personal” by both • disagreements: about 130 emails • 25% for subscription • 18% for travel arrangement • 13% for colleague meetings • 8% for service provided to Enron employees • Questions: • What do annotators see? The email only or the thread? Do they only look at the email body, or do they look at the “To” field as well?

Automatic classification • Classification algorithm: (Guthrie and Walker, 1994) • Data: • 4,000 messages on “core business” • 1,000 messages on “close personal” • Results: 0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)

(Guthrie and Walker, 1994)Algorithm for text classification • Let T1, T2, …, Tk be class labels. • Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti. • Training: • partition the set of words into W1, W2, …, Wm • for each Ti, • “merge” the documents in the training data whose class label is Ti • calculate pij for each Wj • Ex: |T|=2, |W|=3, pijis (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2 • Testing: • let nj be the frequency of the words in the test document that belongs to Wj • Ex: the frequencies are (10, 200, 8900) • choose the Ti that maximizes

(Guthrie and Walker, 1994):Experiments • Two class labels: T1 and T2 • Three word sets: W1, W2, and W3 • W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2). • W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1). • W3 includes the rest of the words • Accuracy: 100%

Issues • Using word features: the words in a business email could vary a lot depending on what the business is. • Other important cues: • the relation between the sender and the recipient • Do they work in the same company? • What is the path between them in the company report chain? • Are they friends? • other emails in the same thread • the nature of the sender/recipient/company’s work and the words in the emails (e.g., “stock”, “parent meeting”) • … • Other ideas?

Email zoning

Email zone classification • Task: given a message, break it down to zones (e.g., header, greeting, body, disclaimer, etc.) • Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009 • Data: • Available at http://zebra.thoughtlets.org/ • Stored on patas under $data_dir/email_zoning_dataset/EmailZoneData/ • Stored on capuchin as a mysql database called “zonerelease”

Email zones in (Estival et al., 2007) • Five categories: • Author text • Signature • Advertisement (automatically appended ones) • Quoted text • Reply lines

Email zones in (Lampert et al., 2009) • Sender zones • Author: new content from the current email sender, excluding any text that has been included from previous messages. • Greetings: e.g., “Hi, Mike” • Signoff: e.g., “thanks. AJ” • Quoted conversation zones • Reply: content quoted from a previous message • Forward: Content from an email message outside the current conversation thread that has been forwarded by the current email sender

Email zones (cont) • Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages • Signature • Advertising • Disclaimer • Attachment: automatically generated text

Manual annotation • Annotated data: • almost 400 email messages • 11881 lines (7922 non-blank lines) • use the Berkeley database (“berkeley_enron”) • one annotator • Use 10-fold cross validation

Automatic classification • Classifier: SVM • Two approaches: • two stages: (zone fragment classification) • segment a message into zone fragments • classify those fragments • one stage: • classify each line

Detecting zone boundaries • Different kinds of boundaries: • Blank boundaries: line 12 • Separate boundaries: line 17-20 • Adjoining boundaries: lines 10 and 11 • Use heuristic approach: • consider every blank line or lines beginning with 4+ repeated punctuation marks • cannot handle adjoining boundaries • high recall, low precision

Classifying zone fragments • Features: • Graphic features: layout of text in the email • Orthographic features: the use of distinctive chars and char sequences including punctuation, capital letters and numbers • Lexical features: information about the words used in the email text

Graphic features • the number of words in the text fragment • the number of characters in the text fragment • the start position of the text fragment • the end position of the text fragment • the average line length (in chars) within the text fragement • the length of the text fragment relative to the previous fragment • the number of blank lines preceding the text fragement • …

Orthographic features • whether all lines start with the same character (e.g., ‘>’); • whether a prior text fragment in the message contains a quoted header; • whether a prior text fragment in the message contains repeated punctuation characters; • whether the text fragment contains a URL; • whether the text fragment contains an email address; • whether the text fragment contains a sequence of four or more digits; • the number of capitalised words in the text fragment; • the percentage of capitalised words in the text fragment; • …

Lexical features • word unigram • word bigram • whether the text fragment contains the sender’s name; • whether a prior text fragment in the message contains the sender’s name; • whether the text fragment contains the sender’s initials; and • whether the text fragment contains a recipient’s name.

Results

Confusion matrix for nine-zone line classification

Precision and recall

Issues • Sequence labeling problem: • add features that look at the labels of preceding segments • Is the 9-zone label set sufficient? • How to take advantage of emails in the bigger context? • emails in the same discussion thread • emails by the same sender • general email structure: e.g., greeting, body, signoff, etc.

Deception detection

Papers for today • [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003. • [13] L. Zhou, J.K. Burgoon, J.F. NunamakerJr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.

(Newman et al., 2003) • Assumptions: Deceptive communications should be characterized by • fewer first-person singular pronouns (e.g., “I”, “me”, and “my”): disassociate one from one’s statements • more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing • fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources

Experiments: Five studies • videotaped abortion attitudes • typed abortion attitudes • handwritten abortion attitudes • feelings about friends • mock crime

Experiments • Trained on four studies and used the "classifier" on the remaining study • Accuracy: about 61% • They found these four types of words have the weights consistent with their assumptions.

(Zhou et al., 2004) • Experiments: • students are asked to exchange emails about a desert survival task • students are asked to tell the truth or lies • features: 27 linguistic cues

Hypothesis • Deceptive senders display • higher (a) quantity, (b) expressivity, (c) positive affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and • less (g) complexity, (h) diversity, and (i) specificity of language in their messages than truthful senders and than their respective receivers

Linguistics cues • quality: • # of words • # of verbs • # of NPs • # of sentences • expressivity: • # of adj/adv divided by # of nouns and verbs

Linguistics cues (cont) • positive effect: expression of positive emotion • informality: # of misspelled words / # of words • uncertainty: • # of modifiers (adj/adv) • # of modal verbs • # of uncertainty words • # of third person pronouns

linguistic cues (cont) • nonimmediacy: • passive voice • generalizing terms • (fewer) self references • group references: first person plural pronouns

Linguistic cues • Complexity: • Ave # of clauses per sent • Ave sentence length • Ave word length • … • Diversity: • lexical diversity • content word diversity • redundancy • …

Issues • Different settings for deceptions could affect the cues (e.g., length of the messages): • interviews • emails • blogs • lie or asked to lie

Hw2 • Your presentation • Reading assignments • Suggestions for others’ projects

CL and Social Media

CL and Social Media

Presentation Transcript

Government and Social media

Ethics and Social Media

Social Media and Strategy

Social Media and Perkins

Social Media and Communications

Social Media and Social Computing

Websites and Social Media

Social media and you

Social Media and Ministry

Social Media and Social Change

Social Media and Social Studies

Social movements and social media

Social media and social innovation

Social Media and You:

Technology and Social Media

Social Media and radicalization

Social Networks and Social Media

Cl - channel and Cl - CHANNELOPATHIES

Technology and Social Media

Social Media and ROI

News and Social Media

Colleges and Social Media