160 likes | 300 Views
Distributions and Distributional Lexical Semantics for Stop Lists. Corpus Profiling 2008 BCS London. Neil Cooke BSc DMS CEng FIET PhD Student CCSR. Dr Lee Gillam Computer Science Department. Introduction Finding Enron’s Confidential Information Lexical Semantic techniques
E N D
Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee Gillam Computer Science Department
Introduction Finding Enron’s Confidential Information Lexical Semantic techniques Archaeological remains of Context Choosing the right stop words Lexical Semantic Similarity Questions Contents
Our domain of research Security and intellectual property protection Context sensitive checking of out going emails to remove false positives The search for accidental stupidity, not for the professional spy Introduction
Introduction • Zipfian Expectations f*r Log rank
Introduction • Zipfian Expectations Low frequency words
Introduction • Sources of Corpora variance • Typos Spelling mistakes • Duplication • Straight / exact copy • Reworded copy • Sources of Enron variance • Straight Duplicate Emails (52%) • Near Duplicate Emails (2%) • Specialist machine: Email formatting • Specialist Text: Business, Power Generation, Social • Straight & Reworded Text Duplication: Banners
Introduction • Enron Raw – Enron Clean
Key word “Confidential” Banner or Real text ? Finding Enron’s Confidential information DISCLAIMER: This e-mail message is intended only for the named recipient(s) above and may contain information that is privileged, confidentialand/or exempt from disclosure under applicable law. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.
Banner Context Vector Space 25 users; 94005 emails: 4608 “confidential” emails • Finding & using size 3223 banner instances 22 key words 2663 body instances 22 key words
Choosing the right words • Collocates with low entropy: tend to Flat Line • Collocates with high entropy: tend to Peak • Kurtosis : bit hard to do and use Energy • can do this in two axis: • Collocate:- Q_peak • Nucleate:- Q_test • Q_test = Sum(Q_peak) • number of collocates
Choosing the right words • Should be able to identify Stop words Top 2000 BNC used as the stop word reference list, of which 1262 match the top 3992 collocates of energy
Lexical Semantic Similarity • Should be able to use it to identify similarity Dice & Cosine
Lexical Semantic Similarity • Terms with medium document frequency used directly • Terms with high document frequency should be moved to the left by transforming them in to entities of lower frequency • Terms with low document frequency should be moved to the right on the document frequency spectrum by transforming them into entities of higher frequency • Depreciating common or stop words • Appreciating rare words • Salton G., A. Wong, C.S. Yang, 1975, A Vector space model for automatic indexing, Journal of the American Society for Information Science, 18:613-620. Poor Discriminator Good Discriminator Frequency
Lexical Semantic Similarity Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, • Width of collocate window reduces precision • Shape is important • It’s a Broadband/narrow band signal to noise ratio issue signal noise Window Size
Further Work to do • Is it better or worse than other methods ? • Carry out Synonyms Test using TOEFL data set. • Compare Qw approach against Frequency based Cosine approach Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, TOEFL test data provided by: Tom Landauer, Institute of Cognitive Science, University of Colorado Boulder
Any Questions Show End