1 / 16

Distributions and Distributional Lexical Semantics for Stop Lists

Distributions and Distributional Lexical Semantics for Stop Lists. Corpus Profiling 2008 BCS London. Neil Cooke BSc DMS CEng FIET PhD Student CCSR. Dr Lee Gillam Computer Science Department. Introduction Finding Enron’s Confidential Information Lexical Semantic techniques

rafe
Download Presentation

Distributions and Distributional Lexical Semantics for Stop Lists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee Gillam Computer Science Department

  2. Introduction Finding Enron’s Confidential Information Lexical Semantic techniques Archaeological remains of Context Choosing the right stop words Lexical Semantic Similarity Questions Contents

  3. Our domain of research Security and intellectual property protection Context sensitive checking of out going emails to remove false positives The search for accidental stupidity, not for the professional spy Introduction

  4. Introduction • Zipfian Expectations f*r Log rank

  5. Introduction • Zipfian Expectations Low frequency words

  6. Introduction • Sources of Corpora variance • Typos Spelling mistakes • Duplication • Straight / exact copy • Reworded copy • Sources of Enron variance • Straight Duplicate Emails (52%) • Near Duplicate Emails (2%) • Specialist machine: Email formatting • Specialist Text: Business, Power Generation, Social • Straight & Reworded Text Duplication: Banners

  7. Introduction • Enron Raw – Enron Clean

  8. Key word “Confidential” Banner or Real text ? Finding Enron’s Confidential information DISCLAIMER: This e-mail message is intended only for the named recipient(s) above and may contain information that is privileged, confidentialand/or exempt from disclosure under applicable law. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.

  9. Banner Context Vector Space 25 users; 94005 emails: 4608 “confidential” emails • Finding & using size 3223 banner instances 22 key words 2663 body instances 22 key words

  10. Choosing the right words • Collocates with low entropy: tend to Flat Line • Collocates with high entropy: tend to Peak • Kurtosis : bit hard to do and use Energy • can do this in two axis: • Collocate:- Q_peak • Nucleate:- Q_test • Q_test = Sum(Q_peak) • number of collocates

  11. Choosing the right words • Should be able to identify Stop words Top 2000 BNC used as the stop word reference list, of which 1262 match the top 3992 collocates of energy

  12. Lexical Semantic Similarity • Should be able to use it to identify similarity Dice & Cosine

  13. Lexical Semantic Similarity • Terms with medium document frequency used directly • Terms with high document frequency should be moved to the left by transforming them in to entities of lower frequency • Terms with low document frequency should be moved to the right on the document frequency spectrum by transforming them into entities of higher frequency • Depreciating common or stop words • Appreciating rare words • Salton G., A. Wong, C.S. Yang, 1975, A Vector space model for automatic indexing, Journal of the American Society for Information Science, 18:613-620. Poor Discriminator Good Discriminator Frequency

  14. Lexical Semantic Similarity Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, • Width of collocate window reduces precision • Shape is important • It’s a Broadband/narrow band signal to noise ratio issue signal noise Window Size

  15. Further Work to do • Is it better or worse than other methods ? • Carry out Synonyms Test using TOEFL data set. • Compare Qw approach against Frequency based Cosine approach Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, TOEFL test data provided by: Tom Landauer, Institute of Cognitive Science, University of Colorado Boulder

  16. Any Questions Show End

More Related