430 likes | 551 Views
IEEE BDSE2013. Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks. Huadong Xia, Christopher Barrett, Jiangzhuo Chen, Madhav Marathe. Network Dynamics and Simulation Science Laboratory Virginia Tech NDSSL TR-13-153. Acknowledgement.
E N D
IEEE BDSE2013 Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen, MadhavMarathe Network Dynamics and Simulation Science Laboratory Virginia Tech NDSSL TR-13-153
Acknowledgement We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments. This work has been partially supported by DTRA Grant HDTRA1-11-1-0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0001, NIH MIDAS Grant 2U01GM070694-09, NSF PetaApps Grant OCI-0904844, NSF NetSE Grant CNS-1011769.
Outline • Background and Contributions • Methods: Network Synthesis • Comparison of Large Scale Networks • Conclusions
Importance of Computational Epidemiological Models • Pandemics cause substantial social, economic and health impacts • 1918 flu pandemic, killed 50-100 million people or 3 to 5 percent of world population. • … • SARS 2003, H1N1 2009, Avian flu (H7N9) 2013 • Mathematical and Computational models have played an important role in understanding and controlling epidemics • controlled experiments are not allowed for ethic consideration. • understand the space-time dynamics of epidemics
Networked Epidemiology • Heterogeneous • Spatial-Temporal features of populations • Massive, Irregular, Dynamic and Unstructured • Social contact networks are usually synthesized (Figure From the Internet)
The Four V’s in Networked Epidemiology Velocity Interactions Change every second Node Statuschanges every second They are modeled in minute scale Veracity • Data Do we collect enough raw data to render a clear picture? • Method Do we extract all useful information out of available raw data? Variety • Demographics • Geographic • Temporal Feature • Virus Infectivity • … … Volume Facts in Delhi 13.85MPopulation 2.67MHouseholds >200MContacts 2.64MLocations 7am 9am 3pm 8pm
Social Contact Network Modeling and Analysis • The Veracity of the network one makes depends on: • Time available to make such a network (human, computational) • The data available to make the network • The specific question that one would like to investigate • Different level of networks may be retrieved for the same region. • How do we evaluate networks that span large regions? • How to compare two networks constructed for the same population? • When is the synthesized network adequate?
Contributions • Propose a number of network measurements to understand and compare urban scale social contact networks which are extremely large, dynamics and unstructured. • Explore quantitatively the adequacy standards in modeling proximity networks.
Outline • Background and Contributions • Methods: Network Synthesis • Comparison of Large Scale Networks • Conclusions
Synthetic Populations and Their Contact Networks Goal: • Determine whoare whereand when. Process: • Create a statistically accurate baseline population • Assign each individual to a home • Estimate their activities and where these take place • Determine individual’s contacts & locations throughout a day.
What Is a Network Locations People • Networks capture social interaction pertinent to the disease • We focus on flu like diseases and the appropriate network is a social contact network based on proximity relationship. • Vertex attributes: • age • household size • gender • income • … • Vertex attributes: • (x,y,z) • land use • … • Edge attributes: • activity type: shop, work, school • (start time 1, end time 1) • (start time 2, end time 2) • …
Two Sets of Data Sources and Generation Methods for Delhi Synthetic Population and Network
Residential Contacts: for the Detailed Network Only Office Mall Residential Area School
Population Synthesis M13 M65 M33 M65 M17 M23 M65 M17 M53 M53 M21 M53 M17 M71 M33 M1\23 M23 M23 M17 M13 M13 M23 M53 M13 M65 M71 M71 M71 F46 F6 F36 F2 F36 F6 F4 F2 F22 F2 F46 F4 F22 F46 F6 F11 F4 F36 F11 F22 F22 F22 F22 F11 F22 F46 F4 F22 F2 F36 F6 F11 Extract individuals Split into HHs M47 M47 M47 M47
How to Compare Two Networks • Metrics • Entity level: the population, built infrastructure and their layout • Collective level: validate against aggregate statistics. • Network level: structural properties • Epidemic dynamics level: policy effects
Comparison for Synthetic Populations Individual level age-gender structure Household level demographic structure Entropy: 1.35 v.s. 1.02
Precision of Location Distribution the Coarse Network LandScan Grid Synthetic Locations Real Locations the Detailed Network
Temporal Visiting Degree in Random Selected Locations Note: First Row: the coarse network; Second Row: the detailed network
GPL: Temporal and Spatial Properties travel distance distribution radius of gyration distribution
GPL: Structural Properties • The people-location network GPL: the degree of a large portion of nonhome Locations have a power law like distribution.
Disease Spread in a Social Network • Within-host disease model: SEIR • Between-host disease model: • probabilistic transmissions along edges of social contact network • from infectious people to susceptible people
Epidemic Simulations to Study the Delhi Population • Disease model • Flu similar to H1N1 in 2009: assume R0=1.35, 1.40, 1.45, 1.60 (only the results when R0=1.35 are shown, but others are similar) • SEIR model: heterogeneous incubation and infectious durations • 10 random seeds every day • Interventions • Vaccination: implemented at the beginning of epidemic; compliance rate 25% • Antiviral: implemented when 1% population are infectious; covers 50% population; effective for 15 days • School closure: implemented when 1% population are infectious; compliance rate 60%; lasts for 21 days • Work closure: implemented when 1% population are infectious; compliance rate 50%; lasts for 21 days • Total five configurations (including base case). Each configuration is simulated for 300 days and 30 replicates
Comparison in Epidemic Simulations • Impact to Epidemic Dynamics (R0=1.35): • The coarse network exploits generic activity schedules, where people travel much more frequently. Therefore, the two networks show very different epidemic dynamics in base case.
Epidemic Simulation Results: Interventions • Similarities of two networks: • Vaccination is still most effective strategy. • Pharmaceutical interventions is more effective than the non-pharmaceutical. • School closure is more effective than work closure • Differences of two networks • Severity is significantly different • In delaying outbreak of disease, school closure is more effective than Antiviral in the coarse network, which is on the contrary in the detailed network.
Conclusions • Novel methodologies in creating a realistic social contact network for a typical urban area in developing countries • Comparison to a coarser network suggests: • Similarity reflects generic properties for social contact networks • Region specific features are captured in the detailed model • The epidemic dynamics of the region is strongly influenced by activity pattern and demographic structure of local residents • A higher resolution social contact network helps us make better public health policy • A realistic representation of social networks require adequate empirical input. We propose the criteria of adequacy: • Does the new input decrease uncertainty of the system? • Does the new input significantly change epidemics and intervention policy?
END Questions?
Epidemic Simulation Results: Vulnerability • Calibrate R0 to be 1.35 • Vulnerability is defined as: Normalized number of infected over 10,000 runs of random simulations • Vulnerability distribution of the detailed network is flat comparing to the coarse network, and it is less vulnerable due to less frequent travel.
Epidemic Simulation Results • Calibrate R0 to be 1.35
Delhi: National Capital Territory of India • Case study: • Delhi (NCT-I): a representative south Asian city that was never studied before. • Statistics: • 13.85 million people in 2001; 22 million in 2011 • Most populous metropolis: 2nd in India; 4th in the world • 573 square miles, 9 regions (refer to the pic) • The Yamuna river going through urban area. • Unique socio-cultural characteristics: • Large slum area • Tropical weather • Environmental hygiene
Two Versions of Delhi Networks • The coarse network: • Based on very limited data • Generic methodology applicable to any region in world • The detailed network: • Requires household level micro sample data and other detailed data, not available for all countries • Improvement on results is expected: • to evaluate the network generation model; • to understand importance of different levels of details.
V1: Synthetic Population Generation • Population generation Input: Joint distribution of age and gender of the population in Delhi (from the India census 2001) Algorithm: • Normalize the counts in the joint distribution of age and gender into a joint probability table • Create 13.85 million individuals one by one. For each individual: Randomly select a cell c with the probability of each cell of the city. Create a person with the age and gender corresponding to the cell c. End Output: 13.85 million individuals are created, each individual is associated with disaggregate attributes of gender and age.
Data Input • Demographic Data: basic census data + India Micro-Sample • India Census 2001 • Micro sample for household structure: India Human Development Survey 2005 by the University of Maryland and the National Council of Applied Economic Research, which tells about each household sample: hh size, hh head’s age, hh income, house types, animal care; and also for each individual in the hh: demographic details, religion, work, marital status, relationship to head, etc. • Activity Data: Thane travel survey + residential contacts survey • Activity templates from 2001 Household Travel Survey statistics for Thane, India, and 2005-2009 school attendance statistics from the UNESCO Institute of Statistics (UIS) • Activity templates are extracted with CART, and assigned to synthetic population with decision tree. • Survey on residential area contacts in India, conducted by NDSSL • Approximate 40% adults in India do not travel to work. The survey focused on them. • Collected people’s age, gender, and contact durations/frequencies near their home. • Location Data: MapMyIndia data • Ward-wise statistics for population and households. • Coordinates for locations such as schools, shopping centers, hotels etc. • Infrastructures such as roads, railway stations, land use etc. • Boundary for each city, town and ward.
V2: synthetic population creation method • Same methodology as we did for US populations: Input: total # of households Aggregate distribution of demographic properties from Census: hh size, householder’s age Household micro-samples Output: Synthetic population with household structure. Each individual is assigned an age and gender. Algorithm: 1. Estimate joint distribution of household size and householder’s age: 1) construct a joint table of hh size and householder’s age: fill in # of samples for each cell 2) multiply total # of households to distributions to calculate marginal totals for the table 3) run IPF to get a convergent joint table 4) normalize: divide counts in each cell with (total # of samples), it’s probability for each cell. (illustrated in next slide) 2. create the synthetic households and population: 1) randomly select a cell with the probability in joint table 2) select a household sample h from all samples associated with that cell uniformly at random 3) create a synthetic household H, so that H has same members as h, each member in H has same demographic attributes as those in h. 4) repeat step 2.1-2.3, until # of synthetic households is equal to the total # of households from Census.
V2: household distribution – a snapshot • Households are distributed along real streets/community blocks. • V2 avoids to distribute households on rivers, lakes and green land etc. (V1 distribute them uniformly within each 1(miles)*1(miles) block)
Flowchart: Generating Activity Sequences based on Thane Survey for Delhi-V2 • Activity templates generation Data sources: Demographics of the Thane sample population; UIS stat Frequency distribution of reported activity sequences Frequency distribution of trips: Trip start time Trip length decision tree sampling sampling Outcome: Commute categories Activity sequences 1) Demographics 2) Act template: Activity sequence Activity duration
Generation of the Residential Network • Motivation of the residential contact network: • Approximate 40% adults in India do not travel to work. The network model interaction among them around their homes (within residential area). • Survey data collected: • age, gender of staying at home people: node label • contact durations/frequencies of each person near their home: edge label/node degree • Formal question: generate a random network s.t. • Given degree distribution of a bunch of nodes • Given label of each node • Assumption: network tend to be homophilous (nodes of the similar labels is connected with higher probability ) • Method: • Configuration model with the added feature of node homophilous. • Refer to the next slide for details.
Random Network Generation: configuration model with the added feature of node homophilous. For each edge-type in (long-dur, mid-dur, short-dur), do: 1. Initialize each node with a degree drawn i.i.d. from the degree distribution according to its label (age/gender) 2. Form a list of “stubs” – connections of nodes that haven’t be matched with neighbors. Call it stubList. 3. Pick a starting node v0 randomly. 4. For each of v0’s stubs, choose an element v1 from the stubList as described in following: 1) v1 is chosen randomly from the stubList; 2) if v1 is same as v0 or already connected to v0, go to 4.1). 3) with a probability p (>0.5), we do test if v1 is similar to v0, if not, go to 4.1) and repeat the selection. 4) create an edge between v0 and v1, its duration is computed randomly based on the edge-type (long, mid or short duration) Done.