プライバシ保護データマイニング (PPDM)

個人情報保護が叫ばれる 複数の企業、組織が協力しないと日本はどんどん遅れていく 2002年くらいから伸びてきた分野です。最近は機械学習、データ工学系の学会で相当数の論文が発表されています。こういうご時勢ですから、ひょっとすると重要な技術要素になるかもしれません。プライバシ保護データマイニング(PPDM) 東京大学中川裕志

PPDMの基礎概念

2002年から2006年ころまでに導入された概念 • PPDMを始めた動機 • k-匿名性（k-anonymity） • l-多様性（l-diversity） • t-closeness

動機 • 複数の組織がプライシーに係わるクリティカルなデータ (sensitivedata)を持ち、場合によっては公開している • microdata (vs. aggregated macrodata) と呼ばれる詳細データが解析やマイニングに利用される状況である。（ＵＳでは公開は法令で義務化 ) • microdataの保護のためsanitized（不要部分の削除など） • 例えば、explicit identifiers (Social Security Number, name, phone #) の削除 • しかし、それで十分か？ • 否! link attacksの脅威 • 公開データからプライバシー情報を推測できる可能性あり

link attack の例 • Sweeney [S01a] によれば、Massachussetts州知事の医療記録が公開情報から特定可能 • MA では、収集した医療データを sanitize　して公開している（下図） (microdata) 左円内 • 一方、選挙の投票者名簿は公開右円内 • 両者をつきあわせると • 6 人が知事と同じ生年月日 • 　　うち3 人が男 • 　　うち1 人が同じzipcode • 1990年のthe US 1990 census dataによれば • 87% の人が (zipcode, 性別, 生年月日)によって一意特定可能

microdataのプライバシー • microdataの属性 • explicit identifiers は削除 • quasi identifiers （QI＝擬ID)は個人特定に利用可能 • sensitive attributes は sensitive 情報を持つプライバシー保護の目標は、個人をsensitive 情報から特定できないようにすること

k-匿名性（k-anonymity） • k-匿名性によるプライバシー保護, Sweeney and Samarati[S01, S02a, S02b] • k-匿名性: 個人を他のk-1人に紛れさせる • つまり、公開されたmicrodataにおいては、Quasi Identifier:QIの値が同一の個人は少なくともk人存在することを保証 • よって、link attackでも個人特定の確率は 1/k • 実現方法 • 一般化 and 抑圧 • 当面はデータの値の perturbation（摂動）は考えない。摂動は、後に差分プライバシーのところで活用されることになる • プライバシーとデータマイニングにおける有用性のトレードオフ • 必要以上に匿名化しない

k-匿名性の例 匿名化手法 • 一般化 • 例えば、対象分野のデータは抽象度によって階層化されているなら、上の階層のデータを公開 • 抑圧 • 特異性のあるデータ項目は削除 original microdata 2-anonymous data

generalization latticeK-anonymity assume domain hierarchies exist for all QI attributes sex birthdate zipcode objective find the minimum generalization that satisfies k-anonymity construct the generalization lattice for the entire QI set more generalization i.e., maximize utility by finding minimum distance vector with k-anonymity less

generalization latticeincognito [LDR05] exploit monotonicity properties regarding frequency of tuples in lattice • reminiscent of OLAP hierarchies and frequent itemset mining (I) generalization property (~rollup) if at some node k-anonymity holds, then it also holds for any ancestor node e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2> (II) subset property (~apriori) if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot be k-anonymous incognito [LDR05] considers sets of QI attributes of increasing cardinality (~apriori) and prunes nodes in the lattice using the two properties above

seen in the domain space • consider the multi-dimensional domain space • QI attributes are the dimensions • tuples are points in this space • attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy

seen in the domain space not 2-anonymous incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode rollup sex 2-anonymous sex rollup zipcode

seen in the domain spacetaxonomy [LDR05, LDR06] generalization taxonomy according to groupings allowed single dimensional global recoding incognito [LDR05] multi dimensional global recoding mondrian [LDR06] multi dimensional local recoding topdown [XWP+06] generalization strength

mondrian[LDR06] • define utility measure: discernability metric (DM) • penalizes each tuple with the size of the group it belongs • intuitively, the ideal grouping is the one in which all groups have size k • mondrian tries to construct groups of roughly equal size k • what else (besides Mondrian) does this painting remind you? • it’s reminiscent of the kd tree: • cycle among dimensions • median splits 2-anonymous

measuring group quality • DM depends only on the cardinality of the group • no measure of how tight the group is • a good group is one that contains tuples with similar QI values • define a new metric [XWP+06]: normalized certainty penalty (NCP) • measures the perimeter of the group bad generalization long boxes good generalization square-like boxes

Topdown [XWP+06] • start with the entire data set • iteratively split in two • reminiscent of R-tree quadratic split • R木は、階層的に入れ子になった相互に重なり合う最小外接矩形(MBR) で空間を分割する • continue until left with groups which contain <2k-1 tuples • split algorithm • find seeds, 2 points that are furthest away • heuristic, not complete quadratic search • the seeds will become the 2 split groups • examine points randomly (unlike quadratic split) • assign point to the group whose NCP will increase the least

boosting privacy with external data • external databases (e.g., voter list) are used by attackers • can we use them to our benefit? • try to improve the utility of anonymized data • join k-anonymity (JKA) [SMP] 3-anonymous microdata k-anonymity join join 3-anonymous joined microdata public data JKA

k-匿名性の問題点 • k-匿名性の例 • Homogeneityによる攻撃: 最終グループは全員cancer • 背景知識による攻撃: 第1グループで、日本人は心臓疾患にかかりにくいことが知られていると。。。 microdata 4-anonymous data

l-多様性[MGK+06] • 各グループにおいて　sensitiveなデータの値がうまく管理されていることを目指す • homogeneity 攻撃を防ぐ • 背景知識攻撃を防ぐ • l-多様性(簡単な定義) • あるグループがl-多様性を持つとは、 • そのグループ内では少なくともl種類のsensitive なデータ値が存在する • group内にl種類のsensitiveな値があり、できるだけ均等に出現することが望ましい。

anatomy[XT06] • fast l-diversity algorithm • anatomy is not generalization • seperates sensitive values from tuples • shuffles sensitive values among groups • algorithm • assign sensitive values to buckets • create groups by drawing from l largest buckets

t-closeness • l-多様性があっても、ある属性がaの確率99%,bの確率1%というように偏りが激しいと、プライバシーは危険 • ２つのグループ（上記a属性のグループとb属性のグループ）は、sensitive データの分布における距離と、全属性の分布における距離が t 以下であるとき、t-closeness である。 • 上記の分布間の距離としては、属性を各次元としてにおいてEarthMover’s distance(EMD)を用いる

k-anonymity, l-diversity, t-closenessの参考文献 • LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-domain k-Anonymity. SIGMOD, 2005. • LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional k-Anonymity. ICDE, 2006. • Samarati, P. Protecting Respondents' Identities in Microdata Release. IEEE TKDE, 13(6):1010-1027, 2001. • Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. • Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection using Generalization and Suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. • Ninghui Li,Tiancheng Li,Venkatasubramanian, S. “t-Closeness: Privacy Beyond k-Anonymity and –Diversity”. ICDE2007, pp.106-115, 2007.

ここまで述べてきたように、公開された複数のデータベースを串刺しする攻撃への対策は、t-closenessに至って、一段落した感あり。ここまで述べてきたように、公開された複数のデータベースを串刺しする攻撃への対策は、t-closenessに至って、一段落した感あり。 • 攻撃者は、データベースへの質問者の場合を想定 • 攻撃者の事前知識に左右されることなく、データベースのプライバシー保護の強度を数学的に制御できる概念として、2006年以降、マイクロソフトのCynthia Dworkが中心になって提案した差分プライバシーがトレンドとなった。

DIFFERENTIAL PRIVACY差分プライバシー • 同じドメインのデータベース：D1,D2要素が1個だけ異なる • D1,D2が質問 f に対して区別できない結果を返す　　データベースの内容が利用者に同定しにくいという相対的安全性：差分プライバシー • X=D1 or D2 に対してYをうまく決めて • t=f(X)+Y  p(t-f(D1))　≦eεp(t-f(D2)) • あるいは　　　　　　　　　　　　　　としたい。 • このようなYの分布はラプラス分布で実現　　Ｄ１Ｄ２ α １要素だけ違う

パラメータ　ε　の調整 どんな関数 f がこの枠組みに入れるのかが研究課題

Differential Privacy の文献 • C. Dwork. Differential privacy. In ICALP, LNCS, pp.1–12, 2006. • C. Dwork. Dierential privacy: A survey of results. InTAMC, pp. 1-19, 2008. • Cynthia Dwork, Frank McSherry, KunalTalwar. “The Price of Privacy and the Limits of LP Decoding”. STOC’07, pp.85-94, 2007

PPDMに関する最近の研究の動向KDD2010より

分類 • 必ずしも信用できないクラウドサーバ計算を任せる場合の元データのプライバシー維持 • Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation • k-Support Anonymity Based on Pseudo Taxonomy for Outsourcing of Frequent Itemset Mining • 差分プライバシー • Data Mining with Differential Privacy • Discovering Frequent Patterns in Sensitive Data • 暗号技術による分散データからのマイニング • Collusion-Resistant Privacy-Preserving Data Mining • その他 • Versatile Publishing for Privacy Preservation

必ずしも信用できないクラウドサーバ計算を任せる場合の元データのプライバシー維持必ずしも信用できないクラウドサーバ計算を任せる場合の元データのプライバシー維持

(1)Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation • 信用できない外部のサーバにSVMをoutsourcingするときに、元データを推定されないようにKernelをランダム変換するアルゴリズム • 従来は、教師データからランダムに選んだ小さな部分でSVMの学習をする方法。そこそこの精度。ただし、テストにおいては外部サーバにデータを知られてしまう。 • そこで新規提案

Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation • まず、準備としてm個の教師データのうちm’(<<m)個の部分集合だけを用いるReduced SVMを説明。本来は少ないメモリでSVMを行うアルゴリズム • 参考 Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the 1st SIAM International Conference on Data Mining (SDM), 2001. • y=分類の誤り、γ=原点からの距離 • A=教師データ • D=1 if 正解 , =－1 if不正解 • w=重みベクトル=ATDu • 分離平面　xTw= γ •  xT ATDu = γ • linear kernel K(xT AT)Du = γ

するとSVMの最適化は (ただしA’=AT) • これは条件なし最小化問題　(10) • Newton 法などで解ける。 • ここで、カーネル行列K(A,A’)は大きすぎてメモリに乗らないので、Aの次元をmとすると、より小さな次元 m’<<mの (これはAのランダムな部分集合）を用いたカーネル行列　　　　を(10)に代入し最適化問題を解くのがRSVM • この　　をAとは関係ない乱数にしてしまうのが次ページのアイデア　　　　

Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation • 教師データxに行列Mで表されるランダム変換を施したMxとm’個のランダムデータrに逆変換した(MT)-1rを外部サーバに送り、この2種のデータ間でのペアからなるkernel k(xi,rj)でSVMを学習。(m’<<m)K(A,A’)によるReduced SVM • ランダムベクトルで学習するので、kernel matrix k(xi,xj)も外部サーバに知られない。ランダムベクトルは漏れなければ他の学習データも漏れない。 • 線形カーネルの場合は、置換したMxを計算サーバに与えて計算する識別関数は、以下の通り • (Mx)T(MT)-1r=xTMT(MT)-1r=xTrなので、多項式カーネルも計算できる

よって、実際のテストデータzはMzと変換して計算をoutsourceすれば、計算はO(m’)で少なく、データ内容を（かなり）保護できる。よって、実際のテストデータzはMzと変換して計算をoutsourceすれば、計算はO(m’)で少なく、データ内容を（かなり）保護できる。 • 精度の実験結果は以下(m’=m/10) 参考 . Keng-Pei Lin,Ming-Syan Chen. Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation ，KDD2010

(2)k-Support Anonymity Based on Pseudo Taxonomy for Outsourcing of Frequent Itemset Mining • This paper focuses on outsourcing frequent itemset mining. • k-support anonymity • To achieve k-support anonymity, we introduce a pseudo taxonomy tree.

3-support anonymity  support 2のitem にはa,g,hの3種類あり ○の中の数は、その部分木に含まれるtransactionの数の合計

insert 1,2はp3には含まれているのでsup(tea)に影響なし insert sup(p6)=3<sup(wine) sup(p7)=1<sup(wine) sup(wine)に影響なし split 1,5を追加してもsup(wine)は不変。

木の変更によって supTN(child node)<supTI(sensitive node)<supTN(parent node) という関係を崩さなければ、supportの計算は保存される sensitive insert split increase

差分プライバシー

(3)Discovering Frequent Patterns in Sensitive Data • Sensitiveなデータのデータセットからトップk個の再頻出パタン( mostfrequent patterns： top k FPM)を抽出するにあたって、ε 差分プライバシーを満たすような細工をする。 • 近似 top k FPM • fkをk番目に多いパタンの真の頻度とする。　　　　信頼度＝ρ:確率(1- ρ)以上で以下の条件を満たす • Soundness: 真の頻度が(fk− γ)より少ない頻度のパタンは出力しない。 • Completeness:真の頻度が(fk+γ) より大きいパタンは全て出力する。 • Precision: 出力された全パタンの頻度は真の頻度から±η の範囲に入る。

提案アルゴリズム • 入力：パタン集合=U,データセットサイズ=n • 前処理：γ＝(8k/εn)ln(|U|/ρ) とし、通常のFrequent Pattern Mining アルゴリズムで、頻度> (fk− γ) のパタン集合Uを抽出。残りのパタンの頻度は (fk− γ) と見なす • 雑音加算とサンプリング：Uの各パタンの頻度にLaplace(4k/εn)を加算。この加算の結果からトップkパタンを通常のFPMで抽出する。これをSと呼ぶ。 • 摂動(Perturbation)：S中のパタンの頻度にLaplace(2k/εn)を加算し、雑音加算された頻度を得、これを最終結果として出力する。ここでε/2-差分private ここでε/2-差分private 併せてε-差分private

提案されたアルゴリズムは　ε差分プライバシー提案されたアルゴリズムは　ε差分プライバシー • 少なくとも1-ρの確率で、真の頻度が(fk− γ)よりの大きなパタン全てを抽出でき、 U中で(fk+γ)より大きなパタン全てが出力される。ただし、 γ＝(8k/εn)ln(|U|/ρ) • 少なくとも1-ρの確率で、雑音加算された頻度と真の頻度の差はη以下。ただし、 η ＝(2k/εn)ln(k/ρ) • Top kパタンの抽出の計算量はO(k’+klogk) 　　　　　ただし、k’は頻度> (fk− γ) のパタンの個数

(4)Data Mining with Differential Privacy • ID3における decision tree 構築時には、treeのあるnode にぶら下がるデータをsplitし、information gainが最大のsplitの仕方を選ぶ。 • そこで、splitしたデータの個数にLaplace分布に応じたノイズを加え、これにより差分プライバシーを実現

Nτ＝|T|+Laplace(1/ε) 次のスライド参照:q( i.e.情報量利得）が最大となる属性を求める

ExpMech：Exponential Mechanism qは、Information Gain, Gini Index, Max などから選択してくる

その他(6)Versatile Publishing for Privacy Preservation • Micro data を公開しても　quasi ID  sensitive data という推論ができないようにデータベースを変形する手法 • 禁止する推論 QIDS の集合を{QS}とする。 • 全データを以下のようにして分割し変形し{QS}が禁止されるようにする。 • 全データから部分Tを切り出す。このTは上記の推論ができないように匿名化する。 • 別の部分T’を追加して既存の{T}が{QS}のルールを満たす場合は、TからSを除去する。

ここで我々の成果も少し紹介させていただきます。ここで我々の成果も少し紹介させていただきます。結託攻撃耐性のあるPPDMプロトコルの設計:Secure Product of Sum. KDD2010 にて発表

Outline • 背景 • secure protocol の提案 • 概要 • 要素技術と protocol • 安全な積計算protocol:SPoS • SPoSから導出される関連 protocol: SRoS and SCoS • 実験評価 • 結論

プライバシ保護データマイニング (PPDM)

プライバシ保護データマイニング (PPDM)

Presentation Transcript

Maximizing Business Benefits from Software Packages