170 likes | 305 Views
Re-development of the Cell Suppression Methodology at the US Census Bureau. Philip Steel, James Fagan, Paul Massell , Richard Moore Jr., John Slanta , Bei Wang. Background. Jewett’s network flow program Need for new program 2012 economic census LP (linear programming) methodology
E N D
Re-development of the Cell Suppression Methodology at the US Census Bureau Philip Steel, James Fagan, Paul Massell, Richard Moore Jr., John Slanta, Bei Wang
Background • Jewett’s network flow program • Need for new program • 2012 economic census • LP (linear programming) methodology • R&M cell suppression team
Processing Model • Preprocessing • Create table description • Determine primaries • Unduplicate • Sequential processing of primaries • Queue reduction • Test company protection (aggregate/supercell) • Sequential processing of supercells
Table relations • Marginals are the sum of interior cells • Geographic relationships tend to generate our most complex sets of table relations • State is the sum of metropolitan areas within the state and the balance. • State is also the sum of counties • Of the form A=B+..+Z where A,B,…,Z are (one of) rows columns or levels that define some Cartesian integer space (i,j,k) • Duplicates are recorded as A=B (eg a county is also a place)
Additivity constraint generator (based on rowrelations) (b) for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) ≥ 1, ws(ii,j,k) = 0
Bounds hi,j,k = max(0,vi,j,k) for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) ⋲A
Skip P • Model changes only on the target primary constraints. • How can the minimal solution for one target be transformed to be a solution for another target? • By applying a scalar that converts the flow through the second P to the fixed value of the model! • Can be done when the scalar does not violate the bounding conditionsand the complementary flow in the target is 0. • I.e. when the solutions flow through the secondary target exceeds its protection requirement.
Empirical confirmation • In our large sparse tables, we would see a lot of objective 0 results. • That is, the solver finds a 0 cost pattern to protect the primary … it is already protected! • Skip P eliminated most objective 0 results and left intact the sequence of positive objectives their solutions.
Fat solution • CPLEX is using a dual simplex method to find solutions. • The solutions have a growing 0 cost component, with many more cells than are required to protect the target P. • The flow in the 0 cost cells far exceeds what is required to protect the target P (except in very small or dense examples). • The solution “lights up” the possible flows in the table’s current state, giving a “fat” solution.
dg10 sector 44 • Cartesian cells: 367,605 (2d) • Non-zero cells: 159,849 • Relations: 283 (row and column) • 14,000 potential tables, linked • P: 95,062 • LP problems: 10,604 • Typical LP size • Reduced LP has 64826 rows, 156809 columns, and 528838 nonzeros • Time: 8hr:37min (includes everything)
Comparison between network and LPon one (of hundreds) dataset from 2007 Statistics based on unduplicated data with an approximation of a published status flag
Thankyou! philip.m.steel@census.gov