Analyzing Socio-Spatial Structure of French Cities Using Gridded Census Data

Using gridded census datato analyze socio-spatial structure of french cities Short history of grids in the INSEE

1) The used of gridded data in the INSEE 2) The production of gridded data in the complex environment of the french Census

Starting pointSub-city districts for public action • A question from the DIV (« Délégation Interministérielle à la Ville »), ministry responsible for urban social policies (2005) • Context : 2005 urban riots. Are public actions ineffective or geographical areas for them badly choosen? • Redesign new one • Check relevance of existing ones • Question : How to check the relevance of deprived districts design by local authorities? • Cannot use existing zones • Existing districts : outdated, partial • Existing output area for statistical products: too large, too much internal heterogeneity • No data source was completely usable at point level. • => Use more detailed data but how transforming a set of points to a boundary of zone?

cluster Surface Effectif total Part Sous-population x y Z0 3017 20620 0.18429 448135 2177247 Z03 24 2113 0.39044 448691 2178662 Z02 24 1583 0.49274 447272 2176006 Z01 32 1075 0.41023 446114 2178859 Z04 15 514 0.47860 444778 2176134 The tool – an example, what data says…Poitiers - Health insurance register Blue : existing deprived districts Red : areas of high probabiblity of low income Grey shade : population density 200 m² grid cells

… and an effective result Blue: new deprived districts as defined by local government

The tool – how it worksGrids everywhere! • Probability density estimates using kernel method • -> gridded data instead of individual data • Part of data cannot be fully (up to the adress) localized • Quicker processes without quality loss • Weaker confidentiality issues allowing use in regional delegations of INSEE • Estimate 1: Whole Population in the data source • Estimate 2: « Deprived » population relative to this data source • Ratio of probabilities to compute relative risk • -> grid cells as a support of estimated functions • Cartography of high estimated risk • Zones are a selection of contiguous grid cells using an automatic rule • Signal but not design

Sub population Whole population From data to final map 1) Simplify the maps Rough data Probability estimates 4)Superpose the maps 3) Extract the outlines 2) Combine the maps Relative risk estimate

And census? • The tool is now used (within INSEE) to describe other phenomenas, with every available source • Using the census • Small LAU2s (out of reach for the tool : no detail for small geographical levels, but mainly not urban) • Exhaustive • Data collection over 5 years (each LAU in one year) • Large LAU2s (city cores) • Sample 40 % • Addresses register maintained for smapling purpose, used as a reference when localizing administrative registers • Data collection over 5 five years

Idea • To compare census data and administrative at location where they are both available to estimate together : • The administrative bias • The time shift

Filling the gaps of census collection Collected census data Can we deduce this from that? Data from an administrative register An address from the census sampling register

GWR • A regression, but not a global one • Standard regression gives correlated residuals : spatial distribution will be biased • Regression models with autocorrelated residuals seem not to be applicable easily (different variograms for different city parts) => Local regressions (Geographical Weighted Regression cf. Fotheringham)

GWRSpace as an explicative factor Local subsets for regressions Decreasing weights with the distance Estimates

DataGrids coming back! • Two kind of data • Census data + explicative data (administrative and dwellings from the address register) • Explicative data only • Administrative data not connected to the address register (20 %) is ignored but corresponding addresses are used with zeroed administrative data • …added up to avoid singularity problem in matrixes during estimation • -> grids • Multiplication of cells by intersecting with: • Housing type • Administrative districts

Internals • Weights • Actual weighting function doesn’t really matter • Classical • Added penalty (doubled distance) when cells have different building types (houses vs. Appartments) • Radius • Derived from a fixed number of neighbours • Actual number of neighbours minimizes the Aikake Information Criterion (AICC)

Prediction Small Area estimation is : Unable to compute locally Not spatial error term : ignored Spatial trend

Accuracy questions • Key issue is spatial autocorrelation • Local regressions behaviour (adjusted R², residuals) • Classical LISAs (local Moran…) • But no local accurary measure • Just a trend anyway • => Validation at global level, where census gives its own figure (mainly Horwitz/Thomson estimate) • Must include omitted correction term • Theorically the GWR gives best results, but there is no estimation of accuracy in both cases (now developping simulations to produce abacuses)

ExampleYoung people in Toulouse • From census : 94183 (5 years) • From estimations • Model (1year of data collection) • With fiscal source : 93500 • With health insurance source : 95440

Strasbourg – High unemployment areasEstimations with 2004, 2005, 2006 census surveys Estimated populations in deprived districts 7437 9665 13476 11425 7821 11740 1391 Final census figures (5 years estimation)

Thank youAny question?

Analyzing Socio-Spatial Structure of French Cities Using Gridded Census Data