640 likes | 812 Views
Assumptions in the Use of Heuristic Optimisation in Cryptography. John A Clark Dept. of Computer Science University of York, UK jac@cs.york.ac.uk. Overview. Purpose of Talk Introduction to heuristic optimisation techniques Examples of current use and assumptions made
E N D
Assumptions in the Use of Heuristic Optimisation in Cryptography John A ClarkDept. of Computer Science University of York, UK jac@cs.york.ac.uk
Overview • Purpose of Talk • Introduction to heuristic optimisation techniques • Examples of current use and assumptions made • Why such assumptions may usefully be relaxed • Some idle speculation about possibilities
Purpose of Talk • Heuristic search techniques have proven extraordinarily useful at solving `hard’ problems in a great number of engineering fields • Little application to cryptographic problems • Why? • Limited success but close inspection seems to suggest that some limitations are self-imposed. • Wish to question some basic assumptions about the way these techniques are used.
Optimisation • Subject of huge practical importance. An optimisation problem may be stated as follows: • Find the value x that maximises the function z(y) over D. Similarly for minimising. Given a domain D and a function z: D find xinD such that z(x)=sup{z(y): y in D}
Optimisation • Traditional optimisation techniques include: • calculus (e.g. solve differential equations for extrema) • hill-climbing: inspired by notion of calculus • gradient ascent etc. • (quasi-) enumerative or otherwise exact: • brute force • dynamic programming • branch and bound • linear programming
Optimisation Problems • Traditional techniques not without their problems • assumptions may simply not hold • e.g. non-differentiable discontinuous functions • non-linear functions • problem may suffer from ‘ curse (joy?) of dimensionality’ - the problem is simply too big to handle exactly (e.g. by brute force). NP hard problems. • Some techniques may tend to get stuck in local optima for non-linear problems (see later) • The various difficulties have led researchers to investigate the use of heuristic techniques typically inspired by natural processes that typically give good solutions to optimisation problems (but forego guarantees).
Heuristic Optimisation • A variety of techniques have been developed to deal with non-linear and discontinuous problems • highest profile one is probably genetic algorithms • works with a population of solutions and breeds new solutions by aping the processes of natural reproduction • Darwinian survival of the fittest • proven very robust across a huge range of problems • can be very efficient • Simulated annealing - a local search technique based on cooling processes of molten metals (used in this paper) • Will illustrate problems with non-linearity and then describe simulated annealing.
Local Optimisation - Hill Climbing • Let the current solution be x. • Define the neighbourhood N(x) to be the set of solutions that are ‘close’ to x • If possible, move to a neighbouring solution that improves the value of z(x), otherwise stop. • Choose any y as next solution provided z(y) >= z(x) • loose hill-climbing • Choose y as next solution such that z(y)=sup{z(v): v in N(x)} • steepest gradient ascent
x0 x1 x2 x3 Local Optimisation - Hill Climbing z(x) Really want toobtain xopt Neighbourhood of a point x might be N(x)={x+1,x-1}Hill-climb goes x0 x1 x2 since f(x0)<f(x1)<f(x2) > f(x3) and gets stuck at x2 (local optimum) xopt
x0 x1 x2 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 Simulated Annealing Allows non-improving moves so that it is possible to go down z(x) in order to rise again to reach globaloptimum x In practice neighbourhood may be very large and trial neighbour is chosen randomly. Possible to accept worsening move when improving ones exist.
Simulated Annealing • Improving moves always accepted • Non-improvingmoves may be accepted probabilistically and in a manner depending on the temperature parameter T. Loosely • the worse the move the lesslikely it is to be accepted • a worsening move is less likely to be accepted the cooler the temperature • The temperature T starts high and is gradually cooled as the search progresses. • Initially virtually anything is accepted, at the end only improving moves are allowed (and the search effectively reduces to hill-climbing)
Simulated Annealing • Current candidate x. Minimisation formulation. At each temperature consider 400 moves Always accept improving moves Temperature cycle Accept worsening moves probabilistically. Gets harder to do this the worse the move. Gets harder as Temp decreases.
Simulated Annealing Do 400 trial moves Do 400 trial moves Do 400 trial moves Do 400 trial moves Do 400 trial moves Do 400 trial moves
Problem Fault Injection and Side Channels on An Analysis Technique
Identification Problems • Notion of zero-knowledge introduced by Goldwasser and Micali (1985) • Indicate that you have a secret without revealing it • Early scheme by Shamir • Several schemes of late based on NP-complete problems • Permuted Kernel Problem (Shamir) • Syndrome Decoding (Stern) • Constrained Linear Equations (Stern) • Permuted Perceptron Problem(Pointcheval)
Given Find So That Pointcheval’s Perceptron Schemes • Interactive identification protocols based on NP-complete problem. • Perceptron Problem.
Given Find So That Has particular histogram H of positive values 1 3 5 .. .. .. Pointcheval’s Perceptron Schemes • Permuted Perceptron Problem (PPP). Make Problem harder by imposing extra constraint.
1 3 5 Example: Pointcheval’s Scheme • PP and PPP-example • Every PPP solution is a PP solution. Has particular histogram H of positive values
Generate random matrix A • Generate random secret S • Calculate AS • If any (AS)i <0 then negate ith row of A Generating Instances • Suggested method of generation: Significant structure in this problem; high correlation between majority values of matrix columns and secret corresponding secret bits
Image elements tend to be small 1 3 5 7… Instance Properties • Each matrix row/secret dot product is the sum of n Bernouilli (+1/-1) variables. • Initial image histogram has Binomial shape and is symmetric about 0 • After negation simply folds over to be positive -7–5-3-1 1 3 5 7…
Neighbourhood defined by single bit flips on current solution Cost function punishes any negative image components costNeg(y)=|-1|+|-3| =4 current solution Y PP Using Search: Pointcheval • Pointcheval couched the Perceptron Problem as a search problem.
Using Annealing: Pointcheval • PPP solution is also PP solution. • Based estimates of cracking PPP on ratio of PP solutions to PPP solutions. • Calculated sizes of matrix for which this should be most difficult • Gave rise to (m,n)=(m,m+16) • Recommended (m,n)=(101,117),(131,147),(151,167) • Gave estimates for number of years needed to solve PPP using annealing as PP solution means • Instances with matrices of size 200 ‘could usually be solved within a day’ • But no PPP problem instance greater than 71 was ever solved this way ‘despite months of computation’.
Perceptron Problem (PP) • Knudsen and Meier approach (loosely): • Carrying out sets of runs • Note where results obtained all agree • Fix those elements where there is complete agreement and carry out new set of runs and so on. • If repeated runs give same values for particular bits assumption is that those bits are actually set correctly • Used this sort of approach to solve instances of PP problem up to 180 times faster than Pointcheval for (151,167) problem but no upper bound given on sizes achievable.
Actual Secret Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 All runs agree All agree (wrongly) Profiling Annealing • Approach is not without its problems. • Not all bits that have complete agreement are correct. 1 -1
Knudsen and Meier • Have used this method to attack PPP problem sizes (101,117) • Needs hefty enumeration stage (to search for wrong bits), allowed up to 264 search complexity • Used new cost function w1=30, w2=1 with histogram punishment cost(y)=w1costNeg(y)+w2costHist(y)
Analogy Time I: Encryption Plaintext P The Black Box Assumption – essentially considering encryption only as a mathematical function. In the public arena only really challenged in the 90’s when attacks based on physical implementation arrived Key • Fault Injection Attacks (Belcore, and others) Ciphertext C • Paul Kocher’s Timing Attacks • Simple Power Analysis • Differential Power Analysis The computational dynamics of the implementation can leak vast amounts of information
Analogy Time II: Annealing Problem P The Black Box Assumption – virtually every application of annealing simply throws the technique at problem and awaits the final output. Is this really the most efficient use of information? Let’s look inside the box….. Initialisation data Final Solution C
Analogy Time III: Internal Computational Dynamics Problem P, e.g. minimise cost(y,A,Hist) The algorithm carries out 100 000s of cost function evaluations which guide the search. Initialisation data Why did it take the path it did? Bear in mind the whole search process is public and so we can monitor it. Final Solution C
Analogy Time IV: Fault Injection Warped or Faulty Problem P’ Invariably people assume you need to solve the problem at hand. Reflected in ‘well-motivated’ or direct cost functions Initialisation data What happens if we inject a ‘fault’ into the process? Mutate the problem into a similar but different one. Can we make use of the solutions obtained to help solve original problem? Final Solution C’
PP Move Effects • What limits the ability of annealing to find a PP solution? • A move changes a single element of the current solution. • Want current negative image values to go positive • But changing a bit to cause negative values to go positive will often cause small positive values to go negative. 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Problem Fault Injection • Can significantly improve results by punishing at positive value K • For example punish any value less than K=4 during the search • Drags the elements away from the boundary during search. • Also use square of differences |Wi-K|2 rather than simple deviation 7 6 5 4 3 2 1 0
Problem Fault Injection • Comparative results • Generally allows solution within a few runs of annealing for sizes (201,217) • Number of bits correct is generally worst when K=0. • Best value for K varies between sizes (but can do profiling to test what it is) • Has proved possible to solve for size (401,417) and higher. • Enormous increase in power for essentially change to one line of the program • Using powers of 2 rather than just modulus • Use of K factor • Morals… • Small changes may make a big difference. • The real issue is how the cost function and the search technique interact • The cost function need not be the most `natural’ direct expresion of the problem to be solved. • Cost functions are a means to an end. • This is a form of fault injection on the problem.
Profiling Annealing • But look again at the cost function templates • Different weights w1 and w2 will given different results yet the resulting cost functions seem plausibly well-motivated. • We can view different choices of weights as different viewpoints on the problem. • Now carry out runs using the different costs functions. • Very effective – using about 30 cost functions have managed to get agreement on about 25% of the key with less than 0.5 bits on average in error • Additional cost functions remove incorrect agreement (but may also reduce correct agreement).
Radical Viewpoint Analysis Problem P Problem P1 Problem P2 Problem Pn-1 Problem Pn Essentially create mutant problems and attempt to solve them. If the solutions agree on particular elements then they generally will do so for a reason, generally because they are correct. Can think of mutation as an attempt to blow the search away from actual original solution
Profiling Annealing: Timing • Simulated annealing can make progress, typically getting solutions with around 80% of the vector entries correct (but don’t know which 80%) • But this throws away a lot of information – better to monitor the search process as it cools down. • Based on notion of thermostatistical annealing. • Watch the elements of the secret vector as the search proceeds. • Record the temperature cycle at which the last change to an elements value occurs, i.e. +1 to –1 or vice versa • At the end of the search all elements are fixed. • Analysis shows that some elements will take some values early in the search and then never subsequently change. • They get ‘stuck’ early in the search. • The ones that get stuck early often do so for good reason – they are the correct values.
Profiling Annealing: Timing • Tested 30 PPP instances (101,117) with 32 different strategies (different weights wi for negativity and histogram component costs and different values of K). Ten runs at each strategy. • Maximum number of initial bits fixed at correct values • Some strategies far better than others – value of K is very important: K=13 seems very good candidate. • Channel is highly volatile – hence need for repeated runs. • Note also that some runs had up to 108 of 117 bits set correctly in final solution. • For small K the minimum number of bits correct in final solution is radically worse than for larger values of K. <4040-4950-5960-6970-79 2101422
Profiling Annealing: Timing • Tested 30 PPP instances (151,167) with 16 different strategies (different weights wi for negativity and histogram component costs and different values of K). Ten runs at each strategy. • Maximum number of initial bits fixed at correct values • Similar general results as before. • Also tried for (201,217) – some runs in excess of 100 initial stuck bits correct. <4040-4950-5960-6970-7980+ 1591122
Some Questions • Can you fix an element of the solution at +1 and –1 and determine likelihood of correctness based on distribution of results obtained? • Affects of different parameters (e.g. power parameters)? • How well can we profile the distribution of results in order to isolate those ones at the extremes of correctness? • Can we apply similar profiling tricks to other NP-complete problems • Permuted Kernel Problem • Syndrome Decoding
Example – Permuted Kernel Problem • Arithmetic carried out mod p
Example – Syndrome Decoding • Arithmetic carried out mod 2 Small number k of bits in S set to 1
Some Questions • Why does everyone try to find the secret/key directly? • e.g. for Block ciphers can we use guided search techniques to generate better approximations? • Use search to generate better (or more) cryptanalytic tools, e.g. multiple approximations? • Very loose. What would happen if you tried to search for a key on a difficult traditional encryption algorithm? Encrypt(K: P)=C Suppose you tried a guided search based on Hamming Distance Encrypt(K’: P)=C’ Cost(C’,C)=hamming(C,C’) (or sum of such costs over Pi) No chance of success at all. But what is the distribution of the failures? Is there a cost function that would induce an exploitable distribution of solutions?
Some Questions • Work combines fault injection and a `timing’ attack? • What is the equivalent of differential power analysis for heuristic search?
Optimisation for Design Boolean Functions
Design as Optimisation • Let DS be the design space or search space • Let f(y) be a function over the design space that signifies how good (bad) a candidate y is. • measuring goodness we talk in terms of a fitness function • (measuring badness we talk in terms of a cost function) • Find z in DS such that f(z)=sup{f(y):y in DS} • Traditional techniques such as hill-climbing tend to get stuck in local optima. Need ability to escape from these to achieve global optimum.
0 0 0 1 -1 0 0 0 1 0 1 1 0 1 0 0 1 2 0 1 1 0 1 3 1 0 0 1 -1 4 1 0 1 0 1 5 1 1 0 1 -1 6 1 1 1 1 -1 7 Boolean Function Design • A Boolean function x f(x) f(x) For present purposes we shall use the polar representation Will talk only about balanced functions where there are equal numbers of 1s and -1s.
Lw(x) Lw(x)=(-1) Preliminary Definitions • Definitions relating to a Boolean function f of n variables Lw(x)=w1x1… wnxn Linear function (polar form) Walsh Hadamard
ACf=max |Sf(x)f(x+s) | s x Preliminary Definitions • Non-linearity • Auto-correlation • For present purposes we need simply note that these can be easily evaluated given a function f. They can therefore be used as the functions to be optimised. Traditionally they are.
Using Parseval’s Theorem • Parseval’s Theorem • Loosely, push down on F(w)2 for some particular w and it appears elsewhere. • Suggests that arranging for uniform values of F(w)2 will lead to good non-linearity. This is the initial motivation for our new cost function. NEW FUNCTION!
0 0 0 0 0 0 1 1 0 1 0 2 0 1 1 3 1 0 0 4 1 0 1 5 1 1 0 6 1 1 1 7 Moves Preserving Balance • Start with balanced (but otherwise random) solution. Move strategy preserves balance x f(x) f(x) g(x) Neighbourhood of a particular function f to be the set of all functions obtained byexchanging (flipping) any two dissimilar values. Here we have swapped f(2) and f(4) 1 -1 -1 0 1 1 0 1 -1 0 1 1 1 -1 1 0 1 1 1 -1 -1 1 -1 -1
Getting in the Right Area • Previous work (QUT) has shown strongly • Heuristic techniques can be very effective for cryptographic design synthesis • Boolean function, S-box design etc • Hill-climbing works far better than random search • Combining heuristic search and hill-climbing generally gives best results • Aside – notion applies more generally too - has led to development of memetic algorithms in GA work. • GAs known to be robust but not suited for ‘fine tuning’. • We will adopt this strategy too: use simulated annealing to get in the ‘right area’ then hill-climb. • But we will adopt the new cost function for the first stage.