220 likes | 435 Views
SUDA: A program for Detecting Special Uniques. Mark Elliot, Anna Manning, Ken Mayes, John Gurd, Michael Bane University of Manchester Mark.Elliot@manchester.ac.uk Cathie Marsh Centre for Census and Survey Research, University of Manchester . Overview. Principle of Special Uniqueness
E N D
SUDA: A program for Detecting Special Uniques Mark Elliot, Anna Manning, Ken Mayes, John Gurd, Michael Bane University of Manchester Mark.Elliot@manchester.ac.uk Cathie Marsh Centre for Census and Survey Research, University of Manchester
Overview • Principle of Special Uniqueness • Basic SUDA algorithms • Description of Software • Current and Future work
Special Uniqueness Definition: A microdata record which is sample unique on key variable set K, which is also unique on a subset of K.
Elliot(2000), Elliot and Manning(2002), Merrett et al (2004) show that Special uniques are rarer in the population than non-special (a.k.a. random uniques).
Paradigm proposition Population counts Sample counts Where a=1, pr ( A=1 | a+b=1) => pr( A=1 | a+b > 1) Related to neighbourhoods concept; Rinott and Shlomo (2005)
Hunt for proof • As yet we have no proof for the proposition. • Simulation work shows us that: • if the variables are not independent then the proposition is true. • the degree of marginalisation is related to pr(A=1). This is true by induction.
Design Principle • SUDA is designed around the observation that 'Every superset of a unique attribute set (minimal or otherwise) is itself unique' (referred to as the Superset Relationship; Elliot et al. 2002).
The Minimal uniques search • The SUDA algorithm searches the lattice of all possible uniquenness patterns within the for unique combinations. • The lattice can get very big as the number of variables increases. • Efficiency savings are made through grouping records of
The IS score • IS metric is used in subsequent output metrics, in essence it corresponds to the proportion of the lattice which is unique for a given record. • This is a principled construct, but not a standard statistical one, it is though strongly correlated with the underlying risk measure 1/Fj
Record level output • IS metric: This is total IS metric calculated as described in section 2 of the paper. • 3) Scoring metric: The 3rd column contains either the Proportion of lattice metric or the DIS-SUDA metric depending on which the user asked for. • 4->N) MSUs: The sequence of columns after the output metrics give the number of MSUs for the record of each size up to the number the user specified. • N+1 -> N+K)Contribution percentage: The final set of columns are headed with the variable name with each of the variables the user has chosen. These columns record the percentage contribution of each variable to the total IS metric. This is simply the IS metric for the MSUs involving that variable over the IS metric for the record.
File level output Example Attribute contribution col#2 att 'age' percentage contribution 88.8954 col#3 att 'sex' percentage contribution 14.5084 col#4 att 'mstat' percentage contribution 26.7168 col#5 att 'econpr' percentage contribution 43.2581 col#6 att 'residents' percentage contribution 47.5376 col#7 att 'depchild' percentage contribution 26.2359
Example Attribute value contribution output col#2 att 'age'=0 percentage contribution 0.2813 col#2 att 'age'=1 percentage contribution 0.4001 col#2 att 'age'=2 percentage contribution 0.5090 col#2 att 'age'=3 percentage contribution 0.3256
GRID STAD • This project aims to Grid enable SUDA, this further increases efficiency and efefctivle overcomes all normal limits on SUDAs operation. • Distributed analyses might seem like over kill, but we have big plans…..
Algorithm improvements • We have so far avoided the lure of making modelling asumptions in SUDA. Our approach has beeen non-parametric. • However we are considering biting the bullet and evaluating combining SUDA combinational power with a more theoretically grounded model based approach.
SUDA 2 • New recursive algorithm solves many of the computational limitations of SUDA 1 • Full assessment of cross-classifications of up to 50 variables is now possible in usable time. Scenario keys of 15 or so variable run in seconds.