170 likes | 175 Views
This research project applies a machine learning algorithm, specifically SVM classification, to analyze the Enron email corpus and identify different rhetorical moves within the emails. The dataset includes over 500,000 emails written by upper-level Enron management, and the classifier predicts class labels based on extracted features. The results could have wide applications for organizations looking to review their communication practices.
E N D
Mapping Technical Communication:Using SVM Classification to Identify Rhetorical Moves in the Enron Email Corpus Ryan M. Omizo @OmizoRM WIDE-MATRIX Research
Machine learning algorithm fits a classifier to data coded by humans according to given classes • Example classes used in sentiment analysis: “positive” and “negative” • Textual data is normalized • Notable Features are extracted and associated with the given class • Classifier predicts class labels on unseen or test data based on this learning Supervised Learning Ryan M. Omizo @OmizoRM
Made public by the Federal Energy Regulatory Commission during its investigation • 150+ email writersfrom upper-level Enron management • 500,000+emails • Source files: http://www.cs.cmu.edu/~./enron/ • My CurrentSample: • Sent emails from 11 users • 622 email documents Enron Email Dataset Ryan M. Omizo @OmizoRM
Large volume of textual data by a range of users • Includes a diversity of writing styles and topics • Divided by writer name and folders/documents • Conforms to an established genre of professional communication • Classifier could have wide applications for organizations seeking to mine/review their communication practices (e.g., customer relations emails; human resource communications; public relations) Why Enron Email Dataset? Ryan M. Omizo @OmizoRM
I have completed preliminary rankings on the attached sheet. You had included my name, and I did not put in a preliminary rank. Brent Price's name is missing. I would put him in the Excellent category at this time. Let me know if you need anything else. It is possible that I may change a few of the preliminary rankings as I get more information from my direct reports who supervise certain directors. I will let you know. --Sally Example Enron Emails saw a lot of the bulls sell summer against length in front to mitigate margins/absolute position limits/var. as these guys are taking off the front, they are also buying back summer. el paso large buyer of next winter today taking off spreads. certainly a reason why the spreads were so strong on the way up and such a piece now. really the only one left with any risk premium built in is h/j now. it was trading equivalent of 180 on access, down 40+ from this morning. certainly if we are entering a period of bearish to neutral trade, h/j will get whacked. certainly understand the arguments for h/j. if h settles $20, that spread is probably worth $10. H 20 call was trading for 55 on monday. today it was 10/17. the market's view of probability of h going crazy has certainly changed in past 48 hours and that has to be reflected in h/j. Ryan M. Omizo @OmizoRM
1st pass text processing for Training and Testing • Email files stripped of extraneous data (headers, footers, requests, numbers, unconventional characters such as emoticons) • Tokenized into sentences (broken at end punctuation marks (‘.’, ‘?’, ‘!’) • Duplicate sentences are removed • Exported to Excel for hand coding at the sentence level Procedure Ryan M. Omizo @OmizoRM
Coding Scheme • Staging – provides readers information required to understand a task or a situation • Guiding – involves author and readers in a goal-oriented process in which actions must be completed and/or decisions must be reached • Bridging – prompts turn-taking exchanges or phatic communication in order to build and maintain relationships Procedure Ryan M. Omizo @OmizoRM
Provides readers information required to understand a task or a situation • Stipulatory • Describes past actions or conditions of future actions • Often contains reported speech • “Larry I am going on vacation for two weeks” • “You have asked about the sulfur analyzer” • “These permits went out for public notice on October, which means that the comment period ends on . . .” Code - Staging Ryan M. Omizo @OmizoRM
Involves author and readers in a goal-oriented process in which actions must be completed and/or decisions must be reached • Stipulatory or deliberative • Will often be imperative in mood • Step by step instructions • Information gathering • Recommendations • Options • “Use one of the following methods when using manual sampling” • “Please advise so we can keep the project moving” • “If you have any questions, let me know” Code - Guiding Ryan M. Omizo @OmizoRM
Prompts turn-taking exchanges or phatic communication in order to build and maintain relationships • Salutations that might open a conversation • “Dear James, I hope you are doing well.” • Closings that invite feedback or future correspondences • “Hope to hear from you” • “Looking forward to dinner” • Bids for understanding, agreement, or mollification • “Sorry for not getting back to you sooner.” • Offers affective motivation for the completion of an action of the building of relationships involving hortatory • “Let’s do this!” or “Go team!” Code - Bridging Ryan M. Omizo @OmizoRM
I think that it is a great idea to compare recommendations for promotions. Our meeting will take place on December 8th, so we can be ready to discuss promotion recommendations as early as Monday, December 11. When would be a good time for you? The invitation to attend our December 8th meeting still stands. I know that it is along way to come for a single meeting (although it will most likely be a fully day), but if there are other things that you could accomplish while here, it could be worth it. --Sally Procedure – Code Sample Ryan M. Omizo @OmizoRM • Staging • Guiding • Bridging
To insure an even distribution, each class was capped at 0.8 of the smallest training set (Bridging). The training and testing data are randomized before the creation of each classifier. Training and Testing Ryan M. Omizo @OmizoRM
Processing is completed with a combination of customized tokenizers and modules from http://scikit-learn.organd nltk.org. • Training and testing data is • Cleaned with regex filter to remove contractions and other non-content • Tokenized at the word level • punctuation marks like “?” and “!” remain as tokens • Select parts of speech are tagged and added to tokenized list • Vectorized • TF-IDF ranked • KN-best features are selected • Non-selections forced by taking the highest decision function Processing Ryan M. Omizo @OmizoRM
sklearn.multiclass.OneVsRestClassifier • Multi-class classifier – 1 classifier per class • Linear SVC Classifier – OnevsRest Classifier Ryan M. Omizo @OmizoRM
Classifier – SVM Ryan M. Omizo @OmizoRM
0.80 training/0.20 testing • Training and Testing data randomized after each trial • Accuracy = 0.74044795783926221 10-Fold Cross Validation Results Ryan M. Omizo @OmizoRM *precision = true positive/true positive + false positive *recall = true positive/true positive + false negative *F1 score = 2/ (1/precision + 1/recall)
Realistic to computationally classify for rhetorical moves at the sentence level • Requires a large corpus of coded data • Requires more refined class (i.e., more specific; more numerous) Potential Lessons Learned Ryan M. Omizo @OmizoRM