Reconstruction-Based Association Rule Hiding

Reconstruction-Based Association Rule Hiding Author:YuhongGuo (MS-Ph.D. Candidate, Peking Univ., China) yhguo@pku.edu.cn Advisor: Prof. Shiwei Tang Co-Advisors: Prof. Dongqing Yang, Jian Pei Sunday, June 10, 2007

Association Rule Hiding: what? why?? and how??? • Problem: hide sensitive association rules in data without losing non-sensitives • Motivations:large repositories of data contain confidential rules disclosed with serious adverse effects Traditional: fine-tuning, control the hiding effects indirectly • Solutions • Data modification • distortion • blocking • Data reconstruction New promising: knowledge sanitization, control effects directly SIGMOD Ph.D. Workshop IDAR’07

Outline • Background • Motivation • Problem statement • Related work • Proposed Solution • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07

Privacy Preserving Data mining (PPDM) Background Motivation • Two problems addressed in PPDM • the protection of private data • the protection of sensitive rules (knowledge) contained in the data Data mining Data sharing Privacy preserving SIGMOD Ph.D. Workshop IDAR’07

Background Problem statement • Given • a database Dto be released • minimum threshold “MST”, “MCT” • a set of association rules R mined from D • a set of sensitive rules RhR to be hided • Find a new database D’such that • the rules in Rh cannot be mined from D’ • the rules in R-Rh can still be mined as many as possible KHD(Knowledge Hiding in Database) problem SIGMOD Ph.D. Workshop IDAR’07

Background Related work • Data modification approaches • Basic idea: data sanitization D->D’ • Current status:distortion,blocking, prosperous • Drawbacks • Cannot control hiding effects intuitively, lots of I/O • Data reconstruction approaches • Basic idea:knowledge sanitization D->K->D’ • Current status:limited, 3 papers • Advantages • Can easily control the availability of rules and control the hiding effects directly, intuitively, handily SIGMOD Ph.D. Workshop IDAR’07

Hide rules Hide large itemsets Data modification Data- Distortion Algo1a Algo1b Algo2a WSDA PDA Algo2b Algo2c Naïve MinFIA MaxFIA IGA RRA RA SWA Border-Based Integer-Programing Sanitization-Matrix Data- Blocking CR CR2 GIH Data reconstruction CIILM Background Classification of current algorithms lots of reconstruction-based work is expected SIGMOD Ph.D. Workshop IDAR’07

Outline • Background • Proposed Solution • Framework • Example • Discussion • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07

1 . Frequent Set Mining R 2 . Perform sanitization Algorithm 3 . FP - tree - based Inverse Frequent Set Mining ’ FS - R Rh Proposed Solution Framework of our approach FS D D D ’ FP - tree SIGMOD Ph.D. Workshop IDAR’07

Proposed Solution The first two phases • 1. Frequent set mining • Generate all frequent itemsets with their supports and support counts FS from original database D • 2. Perform sanitization algorithm • Input: FS output in phase 1, R, Rh • Output: sanitized frequent itemsets FS’ • Process • Select hiding strategy • Identify sensitive frequent sets • Perform sanitization In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh SIGMOD Ph.D. Workshop IDAR’07

TempD D1 D2 Proposed Method The third phase: FP-tree-based inverse mining • Basic idea: useFP-tree as a transition “bridge”, which reduces the gap between a database and its frequent itemsets and makes transformation more easily Temporary Database A set of Compatible databases Frequent Itemsets FP-Tree (i) (ii) (iii) ... FS (i) Generate a compatible FP-tree (ii) Generate a TempD that only includes frequent items (iii) Scatter infrequent items into TempD SIGMOD Ph.D. Workshop IDAR’07

F r e q u e n t I t e m s e t s : F S A s s o c i a t i o n R u l e s : R A : 6 1 0 0 % B : 4 6 6 % c o n f i d - r u l e s s u p p o r t C : 4 6 6 % e n c e σ = 4 D : 4 6 6 % Þ B A 1 0 0 % 6 6 % Þ C A 1 0 0 % 6 6 % A B : 4 6 6 % M S T = 6 6 % A C : 4 6 6 % Þ M C T = 7 5 % D A 1 0 0 % 6 6 % A D : 4 6 6 % A : 6 1 0 0 % C : 4 6 6 % c o n f i d - r u l e s s u p p o r t D : 4 6 6 % e n c e Þ A C 1 0 0 % 6 6 % A C : 4 6 6 % A D Þ 1 0 0 % 6 6 % A D : 4 6 6 % F r e q u e n t I t e m s e t s : F S ' A s s o c i a t i o n R u l e s : R - R h Proposed Solution Example: the first two phases O i g i n a l D a t a b a s e : D 1. Frequent set mining T I D I t e m s T 1 A B C E T 2 A B C T 3 A B C D T 4 A B D T 5 A D T 6 A C D 2. Perform sanitization algorithm SIGMOD Ph.D. Workshop IDAR’07

T I D I t e m s F P A : 6 1 0 0 % T 1 A C D C : 4 6 6 % T 2 A C D A : 6 D : 4 6 6 % T 3 A C T 4 A C A C : 4 6 6 % C : 4 D : 2 A D : 4 6 6 % T 5 A D F r e q u e n t I t e m s e t s : F S ' T 6 A D D : 2 R e l e a s e d D a t a b a s e : D ' T I D I t e m s T I D I t e m s T I D I t e m s T I D I t e m s E E T 1 A C D T 1 A C D T 1 A C D E T 1 A C D E T 2 A C D T 2 A C D T 2 A C D E T 2 A C D E T 3 A C T 3 A C T 3 A C . . . . . . . . . E T 3 A C T 4 A C T 4 A C T 4 A C T 4 A C T 5 A D T 5 A D T 5 A D T 5 A D T 6 A D T 6 A D T 6 A D T 6 A D D ' D ' D ' D ' q 1 2 p Proposed Solution Example: the third phase • Difficulties： • How to find the target FP-tree • How to control |D’| σ=4 SIGMOD Ph.D. Workshop IDAR’07

Proposed Solution Discussion • Sanitization algorithm • Compared with early popular data sanitization : performs sanitization directly on knowledge level of data • Inverse frequent set mining algorithm • Deals with frequent items and infrequent items separately: more efficiently, a large number of outputs Our solution provides user with a knowledge level window to perform sanitization handily and generates a number of securedatabases SIGMOD Ph.D. Workshop IDAR’07

Outline • Background • Proposed Solution • Current Progress • Work to date • Future work • Expected contributions • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07

Current Progress Work to date • FP-tree-based method for inverse frequent set mining (used in the 3rd phase of our framework) • First effort • Published in Proc. of BNCOD'06 • Provides a good heuristic search strategy to rapidly find a FP-tree satisfying the given constraints, leading to rapidly finding a set of compatible databases • Further work • Accepted by Journal of Software (JOS) • A more mature and well-designed FP-tree-based method for inverse frequent set mining by iteratively solving a sub linear constraint problem SIGMOD Ph.D. Workshop IDAR’07

DHD Integrated secure tool KHD Current Progress Future work • Develop a sound sanitization algorithm with the following considerations • The support and confidence of the rules in R- Rh should remain unchanged as much as possible • Can select appropriate hiding strategies according to different kinds of correlations among the rules in R and Rh • Can prevent rule-based reasoning • Investigate how to restrict the number of transactions in the new released database • Develop an integrated secureassociation rule mining tool • Can protect privacy data • Can protect sensitive rules contained in the data SIGMOD Ph.D. Workshop IDAR’07

Inverse Frequent Set Mining Algorithm ARH Evaluation Metrics Rule sanitization Algorithm Reconstruction-based ARH Framework Current Progress Expected contributions CHART: Credible Hiding Association Rule Tool SIGMOD Ph.D. Workshop IDAR’07

Outline • Background • Proposed Solution • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07

R R ~ R ② Lost Rules h h ③ Ghost Rules ① Hiding Failure R ’ Evaluation Plan • Dataset • BMS-POS • BMS-WebView-1 • BMS-WebView-2 • … • Evaluation • Hiding effects ① Hiding Failure Ratio Rh(D’)/Rh(D) ② Lost Rules Ratio ③ Ghost Rules Ratio • Data utility • Time performance • (~Rh(D) − ~Rh(D’))/ ~Rh(D) (∣R’∣−∣R∩R’∣)/∣R’∣ SIGMOD Ph.D. Workshop IDAR’07

Ongoing! 1 . Frequent Set Mining R 2 . Perform sanitization Algorithm 3 3 . . FP FP - - tree tree - - based Inverse Frequent Set Mining ’ FS - R Rh Basically completed! Reconstruction-based Association Rule Hiding Summary FS D D 3. FP-tree-based Inverse Frequent Set Mining D ’ FP - tree SIGMOD Ph.D. Workshop IDAR’07

Any suggestion or question? yhguo@pku.edu.cn Thanks for your attention

Reconstruction-Based Association Rule Hiding

Reconstruction-Based Association Rule Hiding

Presentation Transcript

Association Rule.

Association Rule Mining Constraint Based Association Rule Mining

Association Rule

Association Rule Mining

Association Rule Mining

Association Rule

Association Rule Discovery

Association Rule Mining

Association rule mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Hiding using Hash Tree