Recovering Data in Presence of Malicious Errors

Recovering Data in Presence of Malicious Errors Atri Rudra University at Buffalo, SUNY

The setup C(x) x y = C(x)+error • Mapping C • Error-correcting code or just code • Encoding: xC(x) • Decoding: yX • C(x) is a codeword x Give up

Codes are useful! Deep-space communication Satellite Broadcast Internet Cellphones ECC Memory RAID CDs/DVDs Paper Bar-codes

1 1 1 0 0 1 0 0 0 0 1 1 Redundancy vs. Error-correction • Repetition code: Repeat every bit say 100 times • Good error correcting properties • Too much redundancy • Parity code: Add a parity bit • Minimum amount of redundancy • Bad error correcting properties • Two errors go completely undetected • Neither of these codes are satisfactory

Two main challenges in coding theory • Problem with parity example • Messages mapped to codewords which do not differ in many places • Need to pick a lot of codewords that differ a lot from each other • Efficient decoding • Naive algorithm: check received word with all codewords

The fundamental tradeoff • Correct as many errors as possible with as little redundancy as possible • This talk: Answer is yes Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?

Overview of the talk • Specify the setup • The model • What is the optimal tradeoff ? • Previous work • Construction of a “good” code • High level idea of why it works • Future Directions • Some recent progress

Error-correcting codes C(x) x • Mapping C : kn • Message length k, codelength n • n≥ k • Rate R =k/n 1 • Efficient means polynomial in n • Decoding Complexity y x Give up

Shannon’s world • Noise is probabilistic • Binary Symmetric Channel • Every bit is flipped w/ probability p • Benign noise model • For example, does not capture bursty errors Claude E. Shannon

Hamming’s world We will consider this channel model • Errors are worst case • error locations • arbitrary symbol changes • Limit on total number of errors • Much more powerful than Shannon • Captures bursty errors Richard W. Hamming

A “low level” view • Think of each symbol in  being a packet • The setup • Sender wants to send k packets • After encoding sends n packets • Some packets get corrupted • Receiver needs to recover the original k packets • Packet size • Ideally constant but can grow with n

Decoding x C(x) • C(x) sent, y received • x k,y n • How much of y must be correct to recover x ? • At least k packets must be correct • At most (n-k)/n = 1-R fraction of errors • 1-R is the information-theoretic limit • : the fraction of errors decoder can handle • Information theoretic limit implies 1-R R = k/n y

R 1-R c1 c2 y Can we get to the limit or 1-R ? • Not if we always want to uniquely recover the original message • Limit for unique decoding,  <(1-R)/2 (1-R)/2 (1-R)/2 1-R (1-R)/2

(1-R)/2 List decoding[Elias57, Wozencraft58] Almost all the space in higher dimension. All but an exponential (in n) fraction • Always insisting on unique codeword is restrictive • The “pathological” cases are rare • “Typical” received word can be decoded beyond (1-R)/2 • Better Error-Recovery Model • Output a list of answers • List Decoding • Example: Spell Checker

(1-R)/2 Advantages of List decoding • Typical received words have an unique closest codeword • List decoding will return list size of one such received words • Still deal with worst case errors • How to deal with list size greater than one ? • Declare an error; or • Use some side information • Spell checker

The list decoding problem Given a code and an error parameter  For any received word y Output all codewords c such that candydisagree inat mostfractionof places • Fundamental Question • The best possible tradeoff between R and ? • With “small” lists • Can it approach information-theoretic limit 1-R ?

Other applications of list decoding Cryptography Cryptanalysis of certain block-ciphers [Jakobsen98] Efficient traitor tracing scheme [Silverberg, Staddon, Walker 03] Complexity Theory Hardcore predicates from one way functions [Goldreich,Levin 89; Impagliazzo 97; Ta-Shama, Zuckerman 01] Worst-case vs. average-case hardness [Cai, Pavan, Sivakumar 99; Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal, Kabanets 06] Other algorithmic applications IP Traceback [Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin, Anderson 00] Guessing Secrets [Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham, Leighton 01] May 25, 2007 17 Ph.D. Final Exam

Overview of the talk • Specify the setup • The model • The optimal tradeoff between rate and fraction of errors • Previous work • Construction of a “good” code • High level idea of why it works • Future Directions • Some recent progress

Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Information theoretic limit • < 1 - R • Information-theoretic limit • Can handle twice as many errors

Achieving information theoretic limit • There exist codes that achieve the information theoretic limit • ≥ 1-R-o(1) • Random coding argument • Not a useful result • Codes are not explicit • No efficient list decoding algorithms • Need explicit construction of such codes • We also need poly time (list) decodability • Requires list size to be polynomial

The challenge • Explicit construction of code(s) • Efficient list decoding algorithms up to the information theoretic limit • For rate R, correct 1-R fraction of errors • Shannon’s work raised similar challenge • Explicit codes achieving the information theoretic limit for stochastic models • The challenge has been met [Forney 66, Luby-Mitzenmacher-Shokrollahi-Spielman 01, Richardson-Urbanke01] • Now for stronger adversarial model

Unique decoding Inf. theoretic limit Frac. of Errors () Guruswami-Sudan Rate (R) The best until 1998 Motivating Question: Close the gap between blue and green line with explicit efficient codes •   1 -R1/2 • Reed-Solomon codes • Sudan 95, Guruswami-Sudan98 • Better than unique decoding • At R=0.8 • Unique: 10% • Inf. Th. limit: 20% • GS : 10.56 %

Unique decoding Inf. theoretic limit Guruswami-Sudan Parvaresh-Vardy Frac. of Errors () Rate (R) The best until 2005 •   1-(sR)s/(s+1) • s  1 • Parvaresh,Vardy • s=2 in the plot • Based on Reed-Solomon codes • Improves GS for R < 1/16

Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Our Result •   1- R -  • > 0 • Folded RS codes • [Guruswami, R.06] Guruswami-Sudan Parvaresh-Vardy Our work

Overview of the talk • Specify the setup • The model • The optimal tradeoff between rate and fraction of errors • Previous work • Our Construction • High level idea of why it works • Future Directions • Recent progress

The main result • Construction of algebraic family of codes • For every rate R >0 and  >0 • List decoding algorithm that can correct 1 - R -  fraction of errors • Based on Reed-Solomon codes

Algebra terminology • Fwill denote a finite field • Think of it as integers mod some prime • Polynomials • Coefficients come from F • Poly of degree 3 over Z7 • f(X) = X3 +4X +5 • Evaluate polynomials at points in F • f(2) = (8 + 8 + 5) mod 7 = 21 mod 7 =0 • Irreducible polynomials • No non-trivial polynomial factors • X2+1 is irreducible over Z7 , while X2-1 is not

f(1) f(3) f(2) f(4) f(n) Reed-Solomon codes • Message: (m0,m1,…,mk-1) Fk • View as poly. f(X) = m0+m1X+…+mk-1Xk-1 • Encoding, RS(f) = ( f(1),f(2),…,f(n) ) • F={ 1,2,…,n} • [Guruswami-Sudan] Can correct up to 1-(k/n)1/2 errors in polynomial time

f(4) f(1) f(2) f(3) f(n) g(3) g(4) g(1) g(2) g(n) Parvaresh Vardy codes (of order 2) g(X)=f(X)q mod E(X) f(X) g(X) • Extra information from g(X) helps in decoding • Rate, RPV = k/2n • [PV05] PV codes can correct 1 -(k/n)2/3 errors in polynomial time • 1 - (2RPV)2/3

Towards our solution • Suppose g(X) = f(X)q mod E(X) = f(-X) • Let us look again at the PV codeword f(-a1) f(1) g(1) f(-a1) g(-a1) f(1)

Folded Reed Solomon Codes • Suppose g(X) = f(X)q mod E(X) = f(-X) • Don’t send the redundant symbols • Reduces the length to n/2 • R = (k/2)/(n/2) = k/n • Using PV result, fraction of errors • 1 - (k/n)2/3 = 1 - R2/3 f(-a1) f(1) f(1) f(-a1)

Getting to 1-R- • Started with PV code with s = 2 to get 1 - R2/3 • Start with PV code with general s • 1 - Rs/(s+1) • Pick s to be “large” enough to approach 1-R- • Decoding complexity increases from that of Parvaresh-Vardy but still polynomial

What we actually do We show that for any generator g  F\{ 0 } g(X) = f(X)qmod E(X) = f(gX) Can achieve similar compression by grouping elements in orbits of g m’~n/m, R ~ (k/m)/(n/m) = k/n f(1) f(gm) f(g(m’-1)m) f(g) f(gm+1) f(g(m’-1)m+1) f(mm’-1) f(gm-1) f(g2m-1)

Proving f(X)qmod E(X) = f(gX) • First use the fact f(X)q= f(Xq) overF • Need to show f(Xq) mod E(X) = f(gX) • Proving Xq mod E(X) = gX suffices • Or, E(X) divides Xq-1 - g • E(X) = Xq-1 – gis irreducible

Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Our Result • · 1- R -  • > 0 • Folded RS codes • [Guruswami, R.06] Guruswami-Sudan Parvaresh-Vardy Our work

“Welcome” to the dark side…

Limitations of our work • To get to 1 - R - , need s > 1/ • Alphabet size = ns > n1/ • Fortunately can be reduced to 2poly(1/) • Concatenation + Expanders [Guruswami-Indyk’02] • Lower bound is 21/ • List size (running time) > n1/ • Open question to bring this down

Time to wake up

Overview of the talk • List Decoding primer • Previous work on list decoding • Codes over large alphabets • Construction of a “good” code • High level idea of why it works • Codes over small alphabets • The current best codes • Future Directions • Some (very) modest recent progress

Optimal Tradeoff for List Decoding • Best possible  is H-1 (1-R) • H()= - log  - (1- )log(1- ) • Exists (H-1(1-R-),O(1/ )) list decodable code • Random code of rate R has the property whp •  > H-1(1-R+) implies super poly list size • For any code • For large q, H-1 (1-R)  1-R q q q q

Our Results (q=2) Optimal tradeoff H-1(1-R) [Guruswami, R. 06] “Zyablov” bound [Guruswami, R. 07] Blokh-Zyablov Previous best Optimal Tradeoff # Errors Rate Zyablov bound Blokh-Zyablov bound

How do we get binary codes ? Concatenation of codes [Forney 66] C1: (GF(2k))K(GF(2k))N (“Outer” code) C2: GF(2)k(GF(2))n (“Inner” code) C1± C2: (GF(2))kK (GF(2))nN Typically k=O(log N) Brute force decoding for inner code C2(wN) C2(w1) C2(w2) m m1 m2 mK w1 w2 wN C1(m) C1± C2(m)

List Decoding concatenated code C1 = folded RS code C2 = “suitably chosen” binary code Natural decoding algorithm Divide up the received word into blocks of length n Find closest C2 codeword for each block Run list decoding algorithm for C1 Loses Information!

List Decoding C2 S1 S2 SN 2 GF(2)n y1 y2 yN 2 GF(2)k How do we “list decode” from lists ?

The list recovery problem Given a code and an error parameter  For any set of lists S1,…,SN such that |Si|  s, for every i Output all codewords c such that ci 2 Si forat least 1-fractionof i’s List decoding is special case with s=1

List Decoding C1±C2 List Recovering Algorithm for C1 y1 y2 yN List decode C2 S1 S2 SN

Putting it together [Guruswami, R. 06] C1 can be list recovered from 1 and C2 can be list decoded from 2 errors C1±C2 list decoded from 12 errors Folded RS of rate R list recoverable from 1-R errors Exists inner codes of rate r list decoded from H-1 (1-r) errors Can find one by “exhaustive” search C1±C2 list decodable fr’m (1-R)H-1(1-r) errors

Multilevel Concatenated Codes C1: (GF(2k))K(GF(2k))N (“Outer” code 1) C2: (GF(2k))L(GF(2k))N (“Outer” code 2) Cin: GF(2)2k(GF(2))n (“Inner” code) M1 m1 m2 M2 ML mK m M w1 v1 v2 w2 wN vN C1(m) C2(M) Cin(v1,w1) Cin(v2,w2) Cin(vN,wN) C1 and C2 are FRS

Advantage over rate rR Concat Codes C1,C2 ,Cinhave rates R1, R2and r Final rate r(R1+R2)/2, choose R1< R Step 1: Just recover m List decode Cinup toH-1 (1-r) errors List recover C1 up to 1-R1 errors M1 m1 m2 M2 ML mK m M w1 v1 v2 w2 wN vN C1(m) C2(M) Cin(v1,w1) Cin(v2,w2) Cin(vN,wN) Can handle (1-R1)H-1(1-r) >(1-R)H-1(1-r) errors

Advantage over Concatenated Codes Step 2: Just recover M, given m Subcode of Cinof rater/2 acts on M List decode subcode upto H-1(1-r/2) errors List recover C2upto 1-R2 errors Can handle (1-R2) H-1(1-r/2) errors M1 m1 m2 M2 mK ML M m w1 v1 v2 w2 vN wN C1(m) C2(M) Cin(v1,w1) Cin(v2,w2) Cin(vN,wN)

Recovering Data in Presence of Malicious Errors

Recovering Data in Presence of Malicious Errors

Presentation Transcript

Non Malicious Program Errors (Buffer Overflows)

Types and Sources of Errors in Statistical Data

Modeling Presence/Absence Data

Recovering Data F rom Corrupt Packets

Common Errors in Data Collection

Recovering hardware injections in LIGO s5 data

Recovering from User Errors

Bug Isolation in the Presence of Multiple Errors

ENSEMBLE KALMAN FILTER IN THE PRESENCE OF MODEL ERRORS

Errors in Genetic Data

Recovering Burst Injections in LIGO S5 Data

Modeling relative survival in the presence of incomplete data

Recovering Shape in the Presence of Interreflections

Data Errors, Model Errors, and Estimation Errors

Clustering Seasonality Patterns in the Presence of Errors

Errors in Genetic Data

Automatic Detection and Repair of Errors in Data Structures

RECOVERING ANCIENT INSTRUMENTAL DATA:

Data Errors, Model Errors, and Estimation Errors

Recovering Data Against Virus Attack

DIFFERENT METHODS USED IN RECOVERING DATA