Distributions and Distributional Lexical Semantics for Stop Lists

Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee Gillam Computer Science Department

Introduction Finding Enron’s Confidential Information Lexical Semantic techniques Archaeological remains of Context Choosing the right stop words Lexical Semantic Similarity Questions Contents

Our domain of research Security and intellectual property protection Context sensitive checking of out going emails to remove false positives The search for accidental stupidity, not for the professional spy Introduction

Introduction • Zipfian Expectations f*r Log rank

Introduction • Zipfian Expectations Low frequency words

Introduction • Sources of Corpora variance • Typos Spelling mistakes • Duplication • Straight / exact copy • Reworded copy • Sources of Enron variance • Straight Duplicate Emails (52%) • Near Duplicate Emails (2%) • Specialist machine: Email formatting • Specialist Text: Business, Power Generation, Social • Straight & Reworded Text Duplication: Banners

Introduction • Enron Raw – Enron Clean

Key word “Confidential” Banner or Real text ? Finding Enron’s Confidential information DISCLAIMER: This e-mail message is intended only for the named recipient(s) above and may contain information that is privileged, confidentialand/or exempt from disclosure under applicable law. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.

Banner Context Vector Space 25 users; 94005 emails: 4608 “confidential” emails • Finding & using size 3223 banner instances 22 key words 2663 body instances 22 key words

Choosing the right words • Collocates with low entropy: tend to Flat Line • Collocates with high entropy: tend to Peak • Kurtosis : bit hard to do and use Energy • can do this in two axis: • Collocate:- Q_peak • Nucleate:- Q_test • Q_test = Sum(Q_peak) • number of collocates

Choosing the right words • Should be able to identify Stop words Top 2000 BNC used as the stop word reference list, of which 1262 match the top 3992 collocates of energy

Lexical Semantic Similarity • Should be able to use it to identify similarity Dice & Cosine

Lexical Semantic Similarity • Terms with medium document frequency used directly • Terms with high document frequency should be moved to the left by transforming them in to entities of lower frequency • Terms with low document frequency should be moved to the right on the document frequency spectrum by transforming them into entities of higher frequency • Depreciating common or stop words • Appreciating rare words • Salton G., A. Wong, C.S. Yang, 1975, A Vector space model for automatic indexing, Journal of the American Society for Information Science, 18:613-620. Poor Discriminator Good Discriminator Frequency

Lexical Semantic Similarity Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, • Width of collocate window reduces precision • Shape is important • It’s a Broadband/narrow band signal to noise ratio issue signal noise Window Size

Further Work to do • Is it better or worse than other methods ? • Carry out Synonyms Test using TOEFL data set. • Compare Qw approach against Frequency based Cosine approach Bullinaria J.A., J. P. Levy, 2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, TOEFL test data provided by: Tom Landauer, Institute of Cognitive Science, University of Colorado Boulder

Any Questions Show End

Distributions and Distributional Lexical Semantics for Stop Lists

Distributions and Distributional Lexical Semantics for Stop Lists

Presentation Transcript

tutorial for arrays and lists

Distributional models

Tutorial for Arrays and Lists

Semantics and discourse

Tutorial for Arrays and Lists

Semantics and discourse

Some questions ( doubts ) about STS (mostly from the perspective of distributional semantics)

Corpus Linguistics Formal vs. Distributional Semantics

Montague meets Markov: Combining Logical and Distributional Semantics

Dimension reduction techniques for distributional data

Syntax and Semantics

Vector-Space (Distributional) Lexical Semantics

Semantics for DSL

Semantics and Services

Lists and the ‘ for ’ loop

Syntax and Semantics

Semantics and Pragmatics

Lists and the ‘ for ’ loop

Need for Semantics

Semantics for Privacy and Context

Distributional word Similarity

Semantics for Biodiversity