SUDA: A program for Detecting Special Uniques

SUDA: A program for Detecting Special Uniques Mark Elliot, Anna Manning, Ken Mayes, John Gurd, Michael Bane University of Manchester Mark.Elliot@manchester.ac.uk Cathie Marsh Centre for Census and Survey Research, University of Manchester

Overview • Principle of Special Uniqueness • Basic SUDA algorithms • Description of Software • Current and Future work

Principles of Special Uniqueness

Special Uniqueness Definition: A microdata record which is sample unique on key variable set K, which is also unique on a subset of K.

Elliot(2000), Elliot and Manning(2002), Merrett et al (2004) show that Special uniques are rarer in the population than non-special (a.k.a. random uniques).

Paradigm proposition Population counts Sample counts Where a=1, pr ( A=1 | a+b=1) => pr( A=1 | a+b > 1) Related to neighbourhoods concept; Rinott and Shlomo (2005)

Hunt for proof • As yet we have no proof for the proposition. • Simulation work shows us that: • if the variables are not independent then the proposition is true. • the degree of marginalisation is related to pr(A=1). This is true by induction.

Basics of SUDA design

Design Principle • SUDA is designed around the observation that 'Every superset of a unique attribute set (minimal or otherwise) is itself unique' (referred to as the Superset Relationship; Elliot et al. 2002).

The Minimal uniques search • The SUDA algorithm searches the lattice of all possible uniquenness patterns within the for unique combinations. • The lattice can get very big as the number of variables increases. • Efficiency savings are made through grouping records of

Example lattice

The IS score • IS metric is used in subsequent output metrics, in essence it corresponds to the proportion of the lattice which is unique for a given record. • This is a principled construct, but not a standard statistical one, it is though strongly correlated with the underlying risk measure 1/Fj

Description of Software

Record level output • IS metric: This is total IS metric calculated as described in section 2 of the paper. • 3) Scoring metric: The 3rd column contains either the Proportion of lattice metric or the DIS-SUDA metric depending on which the user asked for. • 4->N) MSUs: The sequence of columns after the output metrics give the number of MSUs for the record of each size up to the number the user specified. • N+1 -> N+K)Contribution percentage: The final set of columns are headed with the variable name with each of the variables the user has chosen. These columns record the percentage contribution of each variable to the total IS metric. This is simply the IS metric for the MSUs involving that variable over the IS metric for the record.

File level output Example Attribute contribution col#2 att 'age' percentage contribution 88.8954 col#3 att 'sex' percentage contribution 14.5084 col#4 att 'mstat' percentage contribution 26.7168 col#5 att 'econpr' percentage contribution 43.2581 col#6 att 'residents' percentage contribution 47.5376 col#7 att 'depchild' percentage contribution 26.2359

Example Attribute value contribution output col#2 att 'age'=0 percentage contribution 0.2813 col#2 att 'age'=1 percentage contribution 0.4001 col#2 att 'age'=2 percentage contribution 0.5090 col#2 att 'age'=3 percentage contribution 0.3256

Current and Future work

GRID STAD • This project aims to Grid enable SUDA, this further increases efficiency and efefctivle overcomes all normal limits on SUDAs operation. • Distributed analyses might seem like over kill, but we have big plans…..

Algorithm improvements • We have so far avoided the lure of making modelling asumptions in SUDA. Our approach has beeen non-parametric. • However we are considering biting the bullet and evaluating combining SUDA combinational power with a more theoretically grounded model based approach.

SUDA 2 • New recursive algorithm solves many of the computational limitations of SUDA 1 • Full assessment of cross-classifications of up to 50 variables is now possible in usable time. Scenario keys of 15 or so variable run in seconds.

SUDA: A program for Detecting Special Uniques

SUDA: A program for Detecting Special Uniques

Presentation Transcript

Detecting a cycle

Dynamically Detecting Likely Program Invariants

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks

A Method for Detecting Pleiotropy

A Framework for Detecting Malformed SMS Attack

A model for detecting illusory contours

Special Diabetes Program for Indians: Competitive Grant Program

Special Diabetes Program for Indians Competitive Grant Program

Special Diabetes Program for Indians Competitive Grant Program

Special Diabetes Program for Indians

PROSPECTS for detecting a

A Robust Framework for Detecting Structural Variations

Special Milk Program Requirements for Camps

SUDA: A program for Detecting Special Uniques

Special Diabetes Program for Indians Competitive Grant Program

Special Diabetes Program for Indians Competitive Grant Program

New Approaches for Detecting Similarities in Program Code

Special Personalised Gifts for a Special Occasion

Special Milk Program Requirements for Camps

Special Program Help For Troubled Teens