520 likes | 531 Views
This paper explores the use of proxy data in data-driven systems, highlighting the need for accountability, oversight, and correction to ensure fairness and privacy. It discusses examples in various domains such as education, law enforcement, credit, web services, healthcare, and online advertising.
E N D
Accountable Data-Driven Systemswith Proxy Use Matt Fredrikson (with AnupamDatta, GihyukKo, Peter Mardziel, ShayakSen)
Data-driven systems are ubiquitous Education Law Enforcement Credit … Web services Healthcare
Data-driven systems are opaque Online Advertising System User data Decisions
Opacity threatens fairness 1816 Online Advertising System User data Decisions 311 $200k+ Jobs
Opacity threatens privacy …able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. • Accountable Data-Driven Systems • Oversight to detect violations and explain behaviors • Correction to prevent future violations Take a fictional Target shopper who … bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant
Use restrictions Do not use a protected information type for certain purposes • Non-discrimination • Do not use race or gender for employment decisions • Business necessity exceptions • Privacy • Do not use health information for marketing or housing decisions • Contextual exceptions
Decisions with explanations[Datta, Sen, Zick2016] • How much causal influence do various inputs have on a classifier’s decision about individuals or groups? Negative Factors: Occupation Education Level Positive Factors: Capital Gain
Explicit use via causal influence Example: Credit decisions Classifier (uses only income) Age Decision Income Conclusion: Measures of association not informative
Key idea: causal intervention Classifier (uses only income) Age 21 44 28 63 Decision Income $90K $20K $100K $10K Replace feature with random values from the population, and examine distribution over outcomes.
Challenge: Proxy use Classifier (targets pregnant women) Pre-natal vitamins? Ad display Blue rug? Need to determine when information type is inferred Example: targeted advertisements
Proxy use: a closer look What do we mean by proxy use? Explicit use is also proxy use Pregnant? no yes Ad #2 Ad #1
Proxy use: a closer look What do we mean by proxy use? Explicit use is also proxy use “Inferred use” is proxy use vitamins? yes no rug? Ad #2 no yes Ad #2 Ad #1
Proxy use: a closer look • What do we mean by proxy use? • Explicit use is also proxy use • “Inferred use” is proxy use • Inferred values must be influential vitamins? yes no rug? rug? no no yes yes Ad #2 Ad #2 Ad #1 Ad #1
Proxy use: a closer look • What do we mean by proxy use? • Explicit use is also proxy use • “Inferred use” is proxy use • Inferred values must be influential • Associations must be two-sided vitamins? yes no rug? Ad #2 no yes Ad #2 alcohol? no yes Ad #2 Ad #1
One- and two-sided associations • What happens if we allow one-sided association? • Consider this model: • Uses postal code to determine state • Zip code predicts race • …but not the other way around • This is a benign use of information associated with a protected information type zip code Philadelphia Pittsburgh Ad #1 Ad #2
Proxy use: a closer look rug & vitamins? • What do we mean by proxy use? • Explicit use is also proxy use • “Inferred use” is proxy use • Inferred values must be influential • Associations must be two-sided • Black-box association is unnecessary for proxy use yes no engaged? engaged? no no yes yes Ad #2 Ad #1 Ad #1 Ad #2
Towards a formal definition: axiomatic basis • (Axiom 1: Explicit use) If random variable Z is an influential input of the model A, then A makes proxy use of Z. • (Axiom 2: Preprocessing) If a model A makes proxy use of Z, and A’(x) = A(x, f(x)), then A’ also makes proxy use of Z. • Example: A’ infers a protected piece of info given directly to A • (Axiom 3: Dummy) If A’(x,x’) = A(x)for all x and x’, then A’ has proxy use of Z exactly when A does. • Example: feature never touched by the model. • (Axiom 4: Independence) If Z is independent of the inputs of A, then A does not have proxy use of Z. • Example: model obtains no information about protected type
“Ideal” proxy use axioms are inconsistent Key Intuition: • Preprocessing forces us to preserve proxy use under functioncomposition • But the rest of the model cancancel out a composed proxy • Let X, Y, Z be pairwise independent random variables, and Y = X ⊕ Z • Then A(Y, Z)=Y⊕ Z makes proxy use of Z (explicit use axiom) • So does A’(Y, Z, X)=Y ⊕ Z (dummy axiom) • And so does A’’(Z, X) = A’(X ⊕Z, Z, X) (preprocessingaxiom) • But A’’(Z, X) = X ⊕Z ⊕Z = X, and X, Zare independent!
Syntactic relaxation rug & vitamins true false engaged? engaged? no no yes yes offer offer no offer no offer We address this with a more syntactic definition Composition is tied to how the function is represented as a program Implicate a specific part of the program as responsible for proxy use
Modeling Systems | Example ⟨value⟩ ::= R | True | False | ⟨string⟩ ⟨exp⟩ ::= ⟨value⟩ | var | op(⟨exp⟩ , … , ⟨exp⟩) | if ( ⟨exp⟩ ) then { ⟨exp⟩ } else { ⟨exp⟩ } Operations: arithmetic operations: +, -, *, etc. boolean connectives: or, and, not, etc. relations: ==, <, ≤, >, etc. Programs are written in a simple expression language Note: our approach generalizes to other inductively-defined languages
Modeling Systems | Example if (rug-and-vitamins) then if (engaged) thenAd1 elseAd2 else if (engaged) then Ad2 else Ad1 rug & vitamins? true false engaged? engaged? false false true true Ad #1 Ad #2 Ad #2 Ad #1
Program decomposition Decomposition Given a program p, a decomposition (p1, X, p2) consists of two programs p1, p2, and a fresh variable X such that replacing X with p1inside p2 yields p. vitamins? yes no p1 rug? Ad #2 rug? no yes p2 vitamins? no yes Ad #2 Ad #1 no yes Ad #2 Ad #1 X Ad #2
Characterizing use Influential Decomposition A decomposition (p1, X, p2) is influential if X can change the outcome of p2 vitamins? yes no clothes? Ad #2 no yes p1 p2 vitamins? rug? Ad #2 Ad #1 no yes no yes X Ad #2 Ad #2 Ad #1 This decomposition is influential
Program decomposition Influential Decomposition A decomposition (p1, X, p2) is influential if X can change the outcome of p2 rug & vitamins? true false p2 p1 engaged? engaged? rug & vitamins? X false false true true true false Ad #2 Ad #2 Ad #1 Ad #1 engaged? engaged? This one is not false false true true Ad #2 Ad #2 Ad #1 Ad #1
Modeling Systems | Probabilistic Semantics Expression semantics: ⟦exp⟧ : Instance Value I is a random variable over dataset instances ⟦exp⟧ : I V V is a random variable over the expression’s value Joint over input instances (I) and expression values (Vi) for each expression expi. Pr[ I, V0, V1, ..., V9 ] marginals: Pr[V4 = True, V0 = Ad1] conditionals: Pr[V4= True | V0= Ad1] exp0 rug & vitamins? exp1 true false exp4 engaged? exp2 engaged? exp3 exp5 false false true true Ad #1 Ad #2 Ad #2 exp6 exp7 exp8 exp9 Ad #1
Characterizing proxies Proxy Given a decomposition (p1, X, p2) and a random variable Z, p1is a proxy for Z if ⟦p1⟧(I) is perfectly two-way associated with Z. p2 p1 rug & vitamins? X p1 is a proxy for pregnancy status true false engaged? engaged? false false true true Ad #2 Ad #2 Ad #1 Ad #1
Putting it all together Proxy Use A program p has proxy use of random variable Z if there exists an influential decomposition (p1, X, p2) of p that is a proxy for Z. • This is close to our intuition from earlier • Formally, it satisfies similar axioms: • Dummy and independence axioms remain largely unchanged • Explicit use, preprocessing use program decomposition instead of function composition
Still not quite right… • First reaction: this definition is probably too strong • Proxies must be perfect two-way predictors of protected information • This is a practical impossibility • On the other hand, the definition also is too weak • Influence is existential: must exist some way for the proxy to influence outcome • The events that lead to this might be exceedingly unlikely Solution: quantify association and influence, use thresholds
Quantifying association • What’s the right way to measure association? • Pearson correlation: “Is there a linear dependence between these variables?” • Spearman correlation: “Is there some monotonic dependence between these variables?” • Information theoretic measures: “How much does knowledge of a variable reduce uncertainty about the other?” • Two key requirements: • Captures two-sided associations • Invariant under renaming of values
Information theory | The Briefest Introduction • Mutual Information [ I(X;Y)] – a measure of the dependence of two random variables. I(X;Y) = H(X,Y) – (H(X|Y) + H(Y|X)) I(X;Y) = 0 if X,Y are independent I(X;Y) = H(X) = H(Y) = H(X,Y) if they are identical, correlated 0 ≤ I(X;Y) ≤min{H(X), H(Y)} ≤ H(X,Y)
Quantifying use Luckily, prior work gives us a good starting point … … … … Inputs: • Intervention: replace a feature, keep everything else fixed. Classifier Causal Intervention: Replace feature with random values from the population, and examine distribution over outcomes.
Quantifying decomposition influence • Intervene on p1 • Compare the behavior: • With intervention • As the system runs normally • Measure divergence: vitamins=yes clothes=yes clothes=yes clothes=no p1 p2 vitamins? outcome rug? no yes ɩ(p1, p2) = EX,X’[ ⟦p⟧(X) ≠ ⟦p2⟧(X, ⟦p1⟧(X’)) ] no yes X Ad #1 Ad #2 Ad #2 Ad #2 Ad #1
Quantitative proxy use • A decomposition (p1, X, p2) is an(ε,δ)-proxy use of Z when • The association between p1and Z ≥ ε, and • p1’s influence in p2, ɩ(p1, p2) ≥ δ A program has (ε, δ)-proxy use of Z when it admits a decomposition that is an (ε, δ)-proxy use of Z Corresponding definitions of “use privacy” and “proxy non-discrimination” follow
Proxy use | Algorithmics • Accountable Data-Driven Systems • Oversight to detect violations and explain behaviors • Correction to prevent future violations Modeling systems Characterizing proxy use Detecting and repairing violations
Witnesses exp0 rug & vitamins? exp1 Exp0 zip = z1or z3 true false exp4 exp2 true false engaged? engaged? exp3 exp5 • Using Witnesses • Localize where the violation occurs in a system • Localize where scrutiny/human eyeballs need to be applied • Localize where repair should be applied exp2 exp1 false no offer false true offer true exp6 exp7 exp8 exp9 Ad #1 Ad #2 Ad #2 Ad #1
Detecting and removing proxy use • How do we remove (ε,δ)-proxy-use violation? • Naive algorithm • Replace Expi with a constant O( 1 ) // any constant O( N * M ) // best constant, M – # possible values • Does system have an (ε,δ)-proxy-use of a protected variable? • Naive algorithm O(S*N2) • S -- number of expressions • N – dataset instances
Repair exp0 rug & vitamins? true exp1 Exp0 offer zip = z1or z3 true false exp4 exp2 true false engaged? engaged? exp3 exp5 exp2 exp1 false no offer false true offer true exp6 exp7 exp8 exp9 Ad #1 Ad #2 Ad #2 Ad #1
Experience: Setting Parameters • How do we select (ε,δ)? • Recall that: • ε: mutual information between proxy and protected type • δ: (roughly) probability that proxy changes model’s outcome • Our first-cut approach: • Find the most influential subcomputations in the program • Measure their association, consult“normativejudgement oracle”
Experience: Examples Income prediction using a benchmark census dataset • Gender, Education, Age, Capital Gains, Ethnicity, others • Marital status: Married-civ-spouse, Divorced, Never-married, Separated, … • Classification: Income <50k,>= 50K • ~30,000 individuals Income ... capital-loss ≤ 1882.5 model accuracy 83.6 % ... gender = female Marital status ~ after repair 81.7 % ... Age ≤ 30 ... ...
Experience: Examples Contraception method of married women predicted from family information. • wife's age, husband's education, # children, wife's occupation, husband's occupation, standard-of-living index, media exposure • Wife's education 1=low, 2, 3, 4=high • Wife's religion 0=Non-Islam, 1=Islam • Classification: Contraceptive method used 1=No-use 2=Long-term 3=Short-term • 1473 individuals Indonesian contraception ... wife-educ ≤ 3 model accuracy 61.2% ... # children ≤ 3 Wife’s religion ~ after repair 52.1% ... age ≤ 31 ... ...
Ongoing work Better program analysis Wider class of models Adversarial settings
Thank you! Cornell Tom Ristenpart Helen Nissenbaum CMU ShayakSen GihyukKo ICSI / Berkeley Michael Tschantz AnupamDatta PiotrMardziel
The Big Picture Classifier Race Associated • Age • Income • Zip-code • Race • … • Accountable Big Data Systems • Oversight to detect violations and explain behaviors • Correction to prevent future violations Credit offer? Big data system
QII for Individual Outcomes … … … … Inputs: Classifier Outcome Causal Intervention: Replace feature with random values from the population, and examine distribution over outcomes.
Repairing Proxy Usage • Have (ε,δ)-proxy-usage violation, an expression Expi • How can we remove it? • (Slightly less) Naive algorithm • Replace an Expi (nearby violation Exp) with a constantO(s * N * M) • s – the number of “nearby” expressions
Nearby Expression Repair ... true false Exp0 Exp1 Zip-code==z1,z3 ... true false Exp2 Exp5 Exp4 Exp3 true Interested==yes Interested==yes ... ... false false true true offer no-offer no-offer offer
Association Race, Zip-code,… Race, Zip-code,… Exp1 Exp0 Zip-code=z1,z2 Zip-code=z1,z3 Proxy use (odd zip-codes are white, even are black) No use true false true false no offer no offer offer offer Decision Decision uncorrelated random variables Xrace, V1 =⟦Exp1⟧X correlated random variables Xrace, V0 =⟦Exp0⟧X
Information theory | The Briefest Introduction • Surprise [I(p)] – a measure of ”information” or the “surprise” in an event of probability p. I(1) = 0 I(0.5) = 1 I(0.25) = 2 I(0) = undefined (or “infinite surprise”) I(p) = -log p