Information Theory and Security

Information Theory and Security

Lecture Motivation • Up to this point we have seen: • Classical Crypto • Symmetric Crypto • Asymmetric Crypto • These systems have focused on issues of confidentiality: Ensuring that an adversary cannot infer the original plaintext message, or cannot learn any information about the original plaintext from the ciphertext. • In today’s lecture we will put a more formal framework around the notion of what information is, and use this to provide a definition of security from an information-theoretic point of view.

Lecture Outline • Probability Review: Conditional Probability and Bayes • Entropy: • Desired properties and definition • Chain Rule and conditioning • Coding and Information Theory • Huffman codes • General source coding results • Secrecy and Information Theory • Probabilistic definitions of a cryptosystem • Perfect Secrecy

The Basic Idea • Suppose we roll a 6-sided dice. • Let A be the event that the number of dots is odd. • Let B be the event that the number of dots is at least 3. • A = {1, 3, 5} • B = {3, 4, 5, 6} • I tell you: the roll belongs to both A and B then you know there are only two possibilities: {3, 5} • In this sense tells you more than just A or just B. • That is, there is less uncertainty in than in A or B. • Information is closely linked with this idea of uncertainty: Information increases when uncertainty decreases.

Probability Review, pg. 1 • A random variable (event) is an experiment whose outcomes are mapped to real numbers. • For our discussion we will deal with discrete-valued random variables. • Probability: We denote pX(x) = Pr(X = x). For a subset A, • Joint Probability: Sometimes we want to consider more than two events at the same time, in which we case we lump them together into a joint random variable, e.g. Z = (X,Y). • Independence: We say that two events are independent if

Probability Review, pg. 2 • Conditional Probability: We will often ask questions about the probability of events Y given that we have observed X=x. In particular, we define the conditional probability of Y=y given X=x by • Independence: We immediately get • Bayes’s Theorem: If pX(x)>0 and pY(y)>0 then

Entropy and Uncertainty • We are concerned with how much uncertainty a random event has, but how do we define or measure uncertainty? • We want our measure to have the following properties: • To each set of nonnegative numbers with , we define the uncertainty by . • should be a continuous function: A slight change in p should not drastically change • for all n>0. Uncertainty increases when there are more outcomes. • If 0<q<1, then

Entropy, pg. 2 • We define the entropy of a random variable by • Example: Consider a fair coin toss. There are two outcomes, with probability ½ each. The entropy is • Example: Consider a non-fair coin toss X with probability p of getting heads and 1-p of getting tails. The entropy is The entropy is maximum when p= ½.

Entropy, pg. 3 • Entropy may be thought of as the number of yes-no questions needed to accurately determine the outcome of a random event. • Example: Flip two coins, and let X be the number of heads. The possibilities are {0,1,2} and the probabilities are {1/4, 1/2, 1/4}. The Entropy is • So how can we relate this to questions? • First, ask “Is there exactly one head?” You will half the time get the right answer… • Next, ask “Are there two heads?” • Half the time you needed one question, half you needed two

Entropy, pg. 4 • Suppose we have two random variables X and Y, the joint entropy H(X,Y) is given by • Conditional Entropy: In security, we ask questions of whether an observation reduces the uncertainty in something else. In particular, we want a notion of conditional entropy. Given that we observe event X, how much uncertainty is left in Y?

Entropy, pg. 5 • Chain Rule: The Chain Rule allows us to relate joint entropy to conditional entropy via H(X,Y) = H(Y|X)+H(X). (Remaining details will be provided on the white board) • Meaning: Uncertainty in (X,Y) is the uncertainty of X plus whatever uncertainty remains in Y given we observe X.

Entropy, pg. 6 • Main Theorem: • Entropy is non-negative. • where denotes the number of elements in the sample space of X. • (Conditioning reduces entropy) with equality if and only if X and Y are independent.

Entropy and Source Coding Theory • There is a close relationship between entropy and representing information. • Entropy captures the notion of how many “Yes-No” questions are needed to accurately identify a piece of information… that is, how many bits are needed! • One of the main focus areas in the field of information theory is on the issue of source-coding: • How to efficiently (“Compress”) information into as few bits as possible. • We will talk about one such technique, Huffman Coding. • Huffman coding is for a simple scenario, where the source is a stationary stochastic process with independence between successive source symbols

Huffman Coding, pg. 1 • Suppose we have an alphabet with four letters A, B, C, D with frequencies: We could represent this with A=00, B=01, C=10, D=11. This would mean we use an average of 2 bits per letter. • On the other hand, we could use the following representation: A=1, B=01, C=001, D=000. Then the average number of bits per letter becomes (0.5)*1+(0.3)*2+(0.1)*3+(0.1)*3 = 1.7 Hence, this representation, on average, is more efficient. A B C D 0.5 0.3 0.1 0.1

Huffman Coding, pg. 2 • Huffman Coding is an algorithm that produces a representation for a source. • The Algorithm: • List all outputs and their probabilities • Assign a 1 and 0 to smallest two, and combine to form an output with probability equal to the sum • Sort List according to probabilities and repeat the process • The binary strings are then obtained by reading backwards through the procedure 1 A 0.5 1.0 1 B 0.3 0.5 1 0 C 0.1 0.2 0 D 0.1 0 • Symbol Representations • A: 1 • B: 01 • C: 001 • D: 000

Huffman Coding, pg. 3 • In the previous example, we used probabilities. We may directly use event counts. • Example: Consider 8 symbols, and suppose we have counted how many times they have occurred in an output sample. • We may derive the Huffman Tree (Exercise will be done on whiteboard) • The corresponding length vector is (2,2,3,3,3,4,5,5) • The average codelength is 2.83. If we had used a full-balanced tree representation (i.e. the straight-forward representation) we would have had an average codelength of 3. S1 S2 S3 S4 S5 S6 S7 S8 28 25 20 16 15 8 7 5

Huffman Coding, pg. 4 • We would like to quantify the average amount of bits needed in terms of entropy. • Theorem: Let L be the average number of bits per output for Huffman encoding of a random variable X, then Here, lx =length of codeword assigned to symbol x. • Example: Let’s look back at the 4 symbol example Our average codelength was 1.7 bits.

Huffman Coding, pg. 5 • An interesting and useful question is: What if I use the wrong distribution when calculating the code? How badly will my code perform? • Suppose the true distribution is px, and you used another distribution to find the lengths lx. Define the auxiliary distribution . • Theorem: If we code the source X with lx instead of the correct Huffman code, then the resulting average codelength will satisfy: where the Kullback-Leibler Divergence D(p||q) is

Another way to look at cryptography, pg. 1 • So far in class, we have looked at the security problem from an algorithm point-of-view (DES, RC4, RSA,…). • But why build these algorithms? How can we say we are doing a good job? • Enter information theory and its relationship to ciphers… • Suppose we have a cipher with possible plaintexts P, ciphertexts C, and keys K. • Suppose that a plaintext P is chosen according to a probability law. • Suppose the key K is chosen independent of P • The resulting ciphertexts have various probabilities depending on the probabilities for P and K.

Another way to look at cryptography, pg. 2 • Now, enter Eve… She sees the ciphertext C and several security questions arise: • Does she learn anything about P from seeing C? • Does she learn anything about the key K from seeing C? • Thus, our questions are associated with H(P|C) and H(K|C). • Ideally, we would like for the uncertainty to not decrease, i.e. H(P | C) = H(P) H(K | C) = H(K)

Another way to look at cryptography, pg. 3 • Example: Suppose we have three plaintexts {a,b,c} with probabilities {0.5, 0.3, 0.2}. Suppose we have two keys k1 and k2 with probabilities 0.5 and 0.5. Suppose there are three ciphertexts U,V,W. We may calculate probabilities of the ciphertexts Similarly we get pC(V)=0.25 and pC(W)=0.25 Ek1(a)=U Ek1(b)=V Ek1(c)=W Ek2(a)=U Ek2(b)=W Ek2(c)=V

Another way to look at cryptography, pg. 4 Suppose Eve observes the ciphertext U, then she knows the plaintext was “a”. We may calculate the conditional probabilities: Similarly we get pP(c|V)=0.4 and pP(a|V)=0. Also pP(a|W)=0 , pP(b|W)=0.6 , pP(c|W)=0.4. What does this tell us? Remember, the original plaintexts probabilities were 0.5, 0.3, and 0.2. So, if we see a ciphertext, then we may revise the probabilities… Something is “learned”

Another way to look at cryptography, pg. 5 We use entropy to quantify the amount of information that is learned about the plaintext given the ciphertext is observed. The conditional entropy of P given C is Thus an entire bit of information is revealed just by observing the ciphertext!

Perfect Secrecy and Entropy • The previous example gives us the motivation for the information-theoretic definition of security (or “secrecy”) • Definition: A cryptosystem has perfect secrecy if H(P|C)=H(P). • Theorem: The one-time pad has perfect secrecy. • Proof: See the book for the details. Basic idea is to show each ciphertext will result with equal likelihood. We then use manipulations like: Equating these two as equal and using H(K)=H(C) gives the result. Why?

Information Theory and Security