540 likes | 671 Views
Selecting ‘Suspicious’ Messages in Intercepted Communication. David Skillicorn School of Computing, Queen’s University Research in Information Security, Kingston (RISK) Math and CS, Royal Military College skill@cs.queensu.ca. Legal interception occurs in three main contexts:
E N D
Selecting ‘Suspicious’ Messages in Intercepted Communication David Skillicorn School of Computing, Queen’s University Research in Information Security, Kingston (RISK) Math and CS, Royal Military College skill@cs.queensu.ca
Legal interception occurs in three main contexts: • Government broad-spectrum interception of communication for national defence and intelligence (e.g. Echelon). Usually excludes communication between `citizens’, who are hard to identify in practice. Usually a simple surrogate rule is applied. • Law enforcement interception pursuant to a warrant. • Organizational interception, typically email and IM, for improper behaviour (e.g. SOX), criminal activity, and industrial espionage. Lots of other communication takes place in public, and so is always available for examination: chat, blogs, web pages
For governments and organizations, the volumes of data intercepted are large: 3 billion messages per day for Echelon; 1 TB /day for the CIA. Finding anything interesting in this torrent is a challenge – 1 in a million or less. Early stage processing must concentrate on finding the definitely uninteresting, so that it can be discarded. Selecting the potentially interesting can be done downstream, using more sophistication because the volumes are much smaller.
The main approach for selection: Use a set of keywords whose presence causes a message to be selected for further analysis. Example: German Federal Intelligence Service: nuclear proliferation (2000 terms), arms trade (1000), terrorism (500), drugs (400), as of 2000 (certainly changed now). It also seems plausible that a range of other techniques are applied, based on properties such as: content overlap, sender/receiver identities, times of transmission, specialized word use, etc.. (Social Network Analysis)
1. Models that assume that the problem is to discover the boundary between ‘good’ and ‘bad’ based on some fixed set of properties can be defeated easily. The carnival booth approach – probe to learn the boundary, then avoid it. Looking for a fixed set of anomalies means missing an unexpected anomaly. Randomizing the boundary can help. Better to look for anything unusual rather than the expected unusual.
2. It is hard for humans to behave unnaturally in a natural way. This is even more true when the behaviour is subconscious. e.g. Stephen Potter’s Oneupmanship for tennis players customs interviews digit choices on tax returns/accounts/false invoices So there’s an inherent signature to unnatural behaviour, in any context.
3. Create a big, obvious, primary detection system … then create a secondary detection system that looks for reaction (evasion) of the first system ! Innocent people either don’t know about or don’t react to such a system; but those who are being deceptive cannot afford not to. (The more the primary system looks for markers that are subconsciously generated, the harder it is to react appropriately.) The boundary between innocence and reaction is often easier to detect than the boundary between innocence and deception.
How does this apply to communication? Most informal communication relies on subconscious mechanisms governing textual markers such as word choice, and voice markers such as pitch. Awareness of simple surveillance measures may cause problems with these mechanisms, creating detectable changes. The presence of a watchlist of words suggests substituting innocuous words – but word choice is also partly a subconscious process.
Replacing words that might be on the keyword watch list by other words or locutions could prevent messages from being selected based on their content. But … knowing that there is a watch list is not the same thing as knowing what’s on it: `bomb’ is probably a word not to use; What about: `fertilizer’, `meeting’, `suicide’, …? A keyword watch list plays the role of a primary selection mechanism; it doesn’t matter that its existence is known, but it does matter that some of its details are unknown. Randomization can even be useful.
Substitution can be: * based on a codebook (e.g. `attack’ = `wedding’) * generated on the fly We expect that most substitutions on the fly will replace a word with a new word whose natural frequency is quite different: `attack’ is the 1072nd most common English word `wedding’ is the 2912nd most common English word This can be avoided, but only with some attention – more later.
The use of a substitution with the `wrong’ frequency in a number of messages may make the entire conversation unusual enough to be detected. This has the added advantage that it can put together messages that belong, even if their endpoints have been obscured.
Linguistic background The frequency of words in English (and many other languages) is Zipf – frequent words are very frequent, and frequency drops off very quickly. We restrict our attention to nouns. In English Most common noun – time 3262nd most common noun – quantum
A message-frequency matrix has a row corresponding to each message, and a column corresponding to each noun. The ij th entry is the frequency of noun j in message i . The matrix is very sparse. We generate artificial datasets using a Poisson distribution with mean f * 1/j+1 , where f models the base frequency. We add 10 extra rows representing the correlated threat messages, using a block of 6 columns, uniformly randomly 0s and 1s, added at columns 301—306.
messages nouns
Technology – Matrix decompositions. The basic idea: * Treat the dataset as a matrix, A, with n rows and m columns; * Factor A into the product of two matrices, C and F A = C F where C is n x r, F is r x m and r is smaller than m. Think of F as a set of underlying `real’ somethings and C as a way of `mixing’ these somethings together to get the observed attribute values. Choosing r smaller than m forces the decomposition to somehow represent the data more compactly. F A = C
Two matrix decompositions are useful : Singular value decomposition (SVD) – the rows of F are orthogonal axes such that the maximum possible variation in the data lies along the first axis; the maximum of what remains along the second, and so on. The rows of C are coordinates in this space. Independent component analysis (ICA) – the rows of F are statistically independent factors. The rows of C describe how to mix these factors to produce the original data. Strictly speaking, the row of C are not coordinates, but we can plot them to get some idea of structure.
The messages with correlated unusual word usage are marked with red circles First 3 dimensions – SVD
(Fortunately) both unusual word use and correlated word use are necessary to make such messages detectable. Correlation with proper word frequencies (SVD) So ordinary conversations don’t show up as false positives!!
Uncorrelated with unusual word frequencies (SVD) Conversations about unusual things don’t show up as false positives either!!
This trick permits a new level of sophistication in connecting related messages into conversations when the usual indicators are not available. It does exactly the right thing – ignoring conversations about ordinary topics, and conversations about unusual topics, but homing in on conversations about unusual topics using inappropriate words. Because the dataset is sparse, SVD takes time linear in the number of messages. The complexity of ICA is less clear but there are direct hardware implementations (SOSUS).
Message-rank matrices are useful because they defend against the countermeasure of rules like “use the word 5 ranks below the one you want to use”. A message-rank matrix has a row corresponding to each message, and a column corresponding to the rank, in English, of the j th most frequent noun in the message. Message-rank matrices have many fewer columns, which makes them easier and faster to work with (e.g. Enron email dataset: 200,000+ ‘words’ but average number of nouns per message <200).
messages rank of jth noun in message
Replacing words with those, say, five positions down the list does not show up in the SVD of a message-frequency matrix:
What about if the substitution is a word with the same natural frequency? Can we still detect the substitution because of a ‘bump’ in the flow of the sentence? The graph of adjacent words in English has a small world property – paths outwards to rare words quickly return to the common-word centre. So looking at frequencies of pairs (triples) of words is not very revealing, e.g. “Everything is ready for the watch” “the watch” “for the watch” “ready for the watch” “is ready for the watch” get slowly more unusual
We’ve developed a number of measures for oddity of a word in a context. Each one independently is quite weak. However, combining them produces a usable detector. We use Google’s responses as a surrogate for frequencies of words, quoted phrases, and bags of words in English. Google sees a lot of text… but it’s a blunt instrument because we only use the number of documents returned as a measure of frequency (this doesn’t seem to matter); and Google’s treatment of stop words is a bit awkward.
Measures I: Contextualized frequency When a word is appropriate in a sentence, the frequencies f{ the, cat, sat, on, the mat } and f{ the, sat, on, the, mat } should be quite similar. But … f{ the, unicorn, sat, on, the mat } and f{ the, sat, on, the mat} should be very different. This could signal that `unicorn’ is a substituted word.
So we define sentence oddity to be frequency of bag of words with word of interest omitted ------------------------------------------------------------------ frequency of bag of words containing all words The larger this measure is, the more likely that the word of interest is a substitution (we hope). We use the frequency of a bag of words because most strings of any length don’t occur at all, even at Google. However, short strings might occur with measurable frequency – this is the basis of our second measure.
Measures II: k-gram frequency Given a word of interest, its left k-gram is the string preceding the word of interest up to and including the first non-stopword. right k-gram is the string following the word of interest up to and including the first non-stopword. “A nine mile walk is no joke” (f = 33) left k-gram: “mile walk” (f = 50) right k-gram: “walk is no joke” (f = 876,000)
Using a k-gram avoids the problems of a small-world adjacency graph – it ignores visits to the (boring) middle region of the graph, but captures connections between close visits to the outer layers. It’s a way to get a kind of 2-gram, both of whose words are non-trivial. If the word of interest is a substitute, both its left and right k-grams should be small. Left and right k-grams measure very different properties of sentences.
Measures III: Hypernym oddity The hypernym of a noun is a more general term that includes the class of things described by the noun. e.g. broodmare – mare – horse – equine – odd-toed ungulate – hoofed mammal – mammal – vertebrate Notice that the chain oscillates between ordinary words and technical terms. In informal text, ordinary words are much more likely than technical terms. However, a substitution might be a much less ordinary word in this context.
We define the hypernym oddity to be f( bag of words with word of interest replaced by its hypernym) – f( bag of words with word of interest) We expect this measure to be positive when the word of interest is a substitution, and close to zero or negative when the word is appropriate. Although hypernyms are semantic relatives of the original words, we can get them automatically using Wordnet – although there are usually multiple hypernyms and we can’t tell which one is `right’ automatically.
Pointwise mutual information (PMI) PMI = f (word) f(adjacent region) -------------------------------- f(word + adjacent region) where + is concatenation in either direction, and the maximum is taken over all adjacent regions that have non-zero frequencies. PMI blends some of the properties of sentence oddity and k-grams. It looks for stable phrases (those with high frequency).
We use the Enron email corpus (and the Brown news corpus), and extract sentences at random. In email, such sentences are often very unusual – typos, shorthand, technical terms. So it’s difficult data to work with. We replaced the first noun in each Enron email sentence by the noun with closest frequency, using the BNC frequency rankings, and removed sentences where the new noun wasn’t known to Wordnet or the sentence (as a BoW) occurred too infrequently at Google. This left a set of 1714 ordinary sentences and a set of 1714 sentences containing a noun substitution. Having two sets of sentences allowed us to train a decision tree on each of the measures to determine a good boundary value between ordinary and substitution sentences.
Each individual measure is very weak. However, they make their errors on different sentences, so combining their predictions does much better than any individual measures. Results for the Brown corpus are similar, although (surprisingly) a little weaker – we expected that more formal language would make substitutions easier to spot. This may reflect changing writing styles, under-represented at Google. Results are the same when Yahoo is used as a frequency oracle.
Analysing a matrix whose rows represent the emails of individuals, whose columns represent words, and whose entries are the frequency of use of words by individuals in Enron emails allows us to address questions such as: * Does word usage vary with company role, either explicit or implicit? * Do people who communicate offline develop similarities in their email word usage? * Can changing word usage over time reveal changing offline relationships? YES to all three.
Humans leak information about their mental states quite strongly – but we are not wired to notice. The leakage comes via frequencies and frequency changes in ‘little’ words, such as pronouns. Detection via software is straightforward. Detecting mental state means that we can: * decide which parts of bin Laden’s messages he believes and which are pitched for particular audiences * distinguish between testosterone and terrorism on Salafist websites * assess the truthfulness of witnesses (and politicians)