Generate country-scale networks of interaction from scattered statistics

Generate country-scale networks of interaction from scattered statistics Samuel Thiriot Computer Science Laboratory – University Paris 6 Orange Labs – France Télécom R&D Jean-Daniel KantComputer Science Laboratory – University Paris 6

Social networks for agent-based modeling • "social networks" in agent-based models are rather interaction networks; they define who interacts with who within the population • when the population is small, the network may be collected from the field. However, data collecting becomes intractable at a population scale. In that case it is common to use a generator to produce the network • a network generator is an algorithm which, given parameters, generates networks constrained by properties observed in real networks • a lot of agent-based models were shown to be highly sensitive to this structure (opinion dynamics, diffusion of innovations…) • => the descriptive power of the structure determines the relevance of simulation results

Case study: interactions in rural Kenya • example: model diffusion of contraceptive use in rural Kenya from field studies [Watkins et al. 2005] : • women discuss mainly contraceptive solutions with other women (sisters-in-law, co-wives). Discussions often take place during quotidian activities: when they retrieve water, walk together to the market, or in the beginning of office. They rarely discuss the problem with their husband, but speak more often with their brothers-in-law • stronger normative influence come from mothers and husband • => the structure of family is determinant, as are affiliations (workplace, quotidian activities) • how to generate a plausible network of interactions for rural Kenya, compliant with these observations ? • more generally, how may a social scientist constraint networks using field observations ?

Requirements • R1: generate models of large populations • a lot of models aim to reproduce dynamics in large populations (e.g. opinion dynamics, consumer behavior, diffusion of innovations [Thiriot & Kant, 2005]) • R2: represent different kinds of relationships linking two agents • the nature of the relationship changes the influence between these agents (find a work [Granovetter 1973], conversations about products [Carl 2006], recommendations inside families [Engel et al. 1996]) • R3: detail attributes of agents in the network of interactions • attributes change the frequency and nature of interaction (e.g., the content of word-of-mouth changes with distance [Carl 2006]) • attributes often condition the creation of relationships (homophily principle, as shown later) • a lot of individual processes require attributes. (e.g. adoption of contraceptive use depends on woman's age and number of children) • R4: a relevant network generator should comply with processes of social selection

Evidence on social networks • social selection processes [Wasserman & Faust, 1994] • homophily: individual exhibit a strong tendency to create relationships with people who share similar characteristics [McPherson et al. 2001] • affiliation: two individuals sharing a common affiliation (project, workplace, event) have more chances to bond and interact frequently • transitivity: two individuals have more chances to bond if they share a common friend ("friends of my friends are also my friends") • large-scale statistics • low density: few links compared to the number of nodes • high clustering: a lot of groups strongly interconnected • short average path length (Milgram's experiment) • power-law distribution of degrees: several individuals have a high degree of connectivity, while most have a lower degree

Existing models compliant with (part of) evidence • random graphs with attributes • L(a1,a2)=1 is the random variable representing link existence between a1 and a2 • probability of link depends on agents' attributes Att(a) • p( L(a1,a2)=1 | Att(a1),Att(a2) ) • Markov random graphs [Frank & Strauss, 1986] • also comply with transitivity: two links may be dependant if they have a node in common • p( L(a1,a2)=1 | L(a1,a3)=1, L(a3,a2)=1) • recently unified with random graphs with attributes [Robins et al. 2001] • Small-world graphs or scale-free networks generate graphs with high clustering rate, short paths and power-law distribution of degree • Agent-based models were proposed to build networks, but their purpose is to test hypothesis, and don't aim to be descriptive (to date) • => No one of these models satisfy our requirements

Evidence: scattered statistics • plenty of knowledge is available on population as national census, sociological studies and other field studies • who individual are : gender, ethnicity, socioeconomic classes, incomes, family structure (number of children, marital status, etc…) • affiliations (employment, participation in associative life, sport, etc.) more detailed level than network statistics • in the case of Kenya: • sociological studies on the structure of family [Mburugu & Adams 2004] • Kenya demographic and Health Survey • specific studies on the modeled phenomenon [Watkins et al. 1995a,b] [Rutenberg & Watkins 1997] • these "scattered statistics" are already collected and published at a country scale. They constitute large-scale, detailed knowledge on the structure of society structure • surprisingly, no network generator relies on this information to constraint networks • R5 : take into account scattered statistics during generation

Approach (2) (1) • a methodology to piece together scattered statistics (R5) in the form of Bayesian networks • an algorithm to generate the network given these parameters, based on known social selection processes (R4) • a method to measure the compliance of the generated network with parameters • the resulting network of interaction describes relationships at a country scale (R1) will include different kinds of relationships (R2), and agents' attributes (R3) (3) (1) (2) (3)

Methodology to codify evidence

Step 1&2: define relationships & links • step 1: define the types of social links T that should be represented in the relationships network • identify links leading to different interactions in the model, or created by different processes • in Kenya, we know that contraceptive adoption by women depends on social interactions with parents, siblings, husband, brothers of husband, colleagues • as usual in social network analysis, we distinguish: • links created given agents' attributes TAtt • for our illustration, TAtt={spouses, motherOf, colleagues, friends} • links created by transitivity TTrans • here Ttrans = {fatherOf, siblings} • step 2 : define attributes • select agents' attributes supposed to influence the probability of a link to be created or useful for individual behavior (given data availability) • in our application, we retain marital status, age, gender, work (which determine colleague links) and spatial location

Step 3: represent attributes with BN • characteristics of individuals are often interdependent: • number of children given woman's age • employment given educational level • age given spatial location • type of work given gender • one can consider these attributes as random variables • Bayesian Networks enable intuitive representation of attributes dependencies number of links to createfor each kind of link t∈TAtt

Step 4: represent links with BN • probability to create a link of type t ∈ TAtt given agents' attributes may also be represented using a Bayesian network, named here matching BN • example: matching BN for link type "spouses" attributes of agent 1 link as spouses (yes/no) conditions on linking attributes of agent 2

Step 4: represent links with BN • Bayesian networks facilitate constraints on matching: • equality: spouses must live in the same location • Boolean operators: two agents can be linked as spouses if they live in the same location and their age is compliant and they are of different gender […] • mathematical difference : wives are on average ten years younger than their husbands • qualitative knowledge, by using approximated probabilities (e.g.: most colleagues live in the same town) • one matching BN per type in TAtt

Generation of networks

Generation of the population • All variables in the agent BN represent agents attributes • the Bayesian networks defines in which order to process variables • algorithm: • for each individual to create • for each attribute • use Monte Carlo sampling to choose randomly the value of attribute • at the end of this process, all agents in the population have their attributes fully determined

creation of links TAtt • the matching BN describes the probability for two agents took randomly in the population to be tied together attributes of agent 1 attributes of agent 2

creation of links TAtt • we "force" evidence for the creation of link p(link_spouses=yes)=1, and update probabilities of agents' attributes • the BN now describes two population subsets of candidates agents for linking C1,t C2,t

creation of links TAtt • for each kind of link t in TAtt • for each agent a1 in the set of candidates C1,t • use a1 attributes values as evidence in the matching BN • put evidence for linking p(create_link=yes)=1 • thus attributes of a2 in the matching BN describes characteristics of potential candidates for linking with link t given agent a1 attributes • search for probable candidates in C2,t : same process than generation with Monte carlo sampling (see paper) • creation of links by transitivity is then trivial

creation of links TTrans • parameters for links created by transitivity Ttrans are in the form: • p( L(a1,a2,t1)=1 | L(a1,a3,t2)=1, L(a3,a2,t3)=1) with t1,t2,t3 relationships networks • example: one creates "fatherOf" links from "motherOf" and "spouses" links p( L(a1,a2,fatherOf)=1 | L(a1,a3,motherOf)=1, L(a3,a2,spouses)=1) • creation of these transitive links is simple: • for each pair of agents (a1,a3) linked with relationship of type t2 • for each pair of agents (a3,a2) linked with relationship t3 • create a link of type t1 between a1 and a2 with probability p( L(a1,a2,fatherOf)=1 | L(a1,a3,motherOf)=1, L(a3,a2,spouses)=1)

Generated network

resulting graph • the resulting graph includes links of different kinds T, and includes attributes' values for each agent • each agent is positioned in its social environment (family, colleagues, friends) • this structure is replicated at a large scale (here, 10,000 agents)

measure errors : biais in statistical distribution • to check the compliance of the generated population with the parameters, we learn BNs from the generated population, and quantify the difference between theoretical and measured probabilities (average difference) • bias in statistical distribution • while BN describe a theoretical population with continuous probabilities, we generate a discrete population and link agents only when a suitable candidate is found. • a minimum population size is required to reach statistical representativity • depends on the complexity of the parameter BNs, and of the population size • => here, given our BNs, the minimal population required to generate a representative population is 10,000

measure errors : errors in parameters • errors in parameters • stable error rate, independent of the population size, highlight parameters discrepancy • illustrated in our example by the "spouses" link • => we can detect major problems in parameters • here, the discrepancy is: • number of men X number of wives per men > number of women having marital status = "married" • as a consequence, the theoretical probabilities to link men with women with link "spouses" cannot be satisfied • once detected, the problem is easily corrected

statistical properties of the network • statistical properties of the generated network • average path length increases very slowly with the population size, exhibiting a small-world effect. The average path length is about 5 • average degree , density and transitivity indicators are stable above the minimum population size • all these values are compliant with statistics of real networks

usage for social simulation • we generated a network of relationships ("who knows who") from knowledge about the social system • for agent-based simulation, we need a network of interactions, that is "who interacts with who" • eliminate relationships which don't lead to interaction • ex: mother don't discuss contraceptive solution with their young children • if necessary, tag networks with probabilities • in the case of Kenya, we know that long-distance relationships don't lead to communication (except for mother-daughter normative influence)

Summary • methodology: • the social scientist describe attributes' interdependencies and links conditions using graphical models • generation of the population and network are automated • result: • a population of heterogeneous agents with attributes values compliant with complex interdependencies • a network of relationships constrained by properties of the target population • kinds of relationships explicitly represented in the network, enabling modelers to describe more precisely probabilities to interact across links • a large population compliant with family structure, affiliations, etc. • indicators to check coherency of parameters, and determine the minimum population required

Discussion • George Box: "All models are wrong, but some models are more useful than others" • This model (by definition) is wrong: • some phenomena cannot be represented easily • representation with Bayesian Networks is more restricting than ad-hoc graph generation • However, it aims to be useful: • Bayesian Networks provide an intuitive tool to piece together scattered statistics and qualitative knowledge • generation of graphs is simplified for social scientists, and don't requires development time nor expert knowledge in programming • network of interactions is not the real one, but is constrained given available data (best effort)

Future work • formal analysis of statistical properties of generated networks given input BN • improve generation efficiency • sensitivity analysis, in order to evaluate the risk when translating qualitative knowledge to probabilities • application to larger and more complex populations • the case of Kenya was chosen as a "relatively simple" case (few ethnics, few socio-economical differences [Watkins 1995]) • applying such a method to a larger & more complex population, like "industrialized" countries, will require a very large number of parameters • such a generation will enable more realistic simulations of social dynamics, including innovation diffusion

Thanks for your attention ! feel free to contact us for any question, remark or criticism : samuel.thiriot@lip6.fr

Generate country-scale networks of interaction from scattered statistics

Generate country-scale networks of interaction from scattered statistics

Presentation Transcript

Generate Revenue from Apps

Country Statistics

Scale Free Networks

Phase Retrieval of Scattered Fields

Protein Interaction Networks

Scattered e- momentum

Scale Free Networks

Generate Descriptive Statistics

Extracting insight from large networks: implications of small-scale and large-scale structure

Report from InterAction

Scale - free networks

Scale Free Networks

Random Networks: Scale-free Networks

V12: Reliability of Protein Interaction Networks

V10: Reliability of Protein Interaction Networks

V9: Reliability of Protein Interaction Networks

Properties of Interaction Networks

COUNTRY STATISTICS

Extracting insight from large networks: implications of small-scale and large-scale structure

Scale Free Networks

Country Statistics

Scattered Seed Ministries….