180 likes | 396 Views
What's the topic for this lecture?. Introduction to the use of a multivariate modeling framework for network data Exponential random graph models (ERGM also known as P*-models) The software used in this presentation is the “Statnet” package in R
E N D
What's the topic for this lecture? • Introduction to the use of a multivariate modeling framework for network data • Exponential random graph models (ERGM also known as P*-models) • The software used in this presentation is the “Statnet” package in R • Information on Statnet is available in a special Volume of Journal of Statistical Software • ”Statistical Modeling of Social Networks with "statnet”, Vol 24, no. 1-9 (2008) Journal of Statistical Software
Why use multivariate statistics? • A phenomena must sometimes be explained with more than one variable • This is particularly true for social phenomena • In a multivariate analysis we combine different independent variables in order to predict the values on a dependent variable • In other words, we construct a model where we assume a causal relationship between the variable we want to explain, and a set of variables that our theory says causes the phenomena
Example of a multivariate model • We want to explain income variation in an organisation • Annual salary is our dependent variable (Y) • What explains income? • Number of years that the individual has been employed (X1) • Education (X2) • Position in organisation (X3) • Gender (X4)
Bivariate or multivariate models • If the relationship the variables is linear: • The bivariate linear model • Y = a + bX • The multivariate linear model • Y = a + b1X1 + b2X2 + b3X3 + b4X4
How to interpret a multivariate model • A multivariate model is not a series of bivariate relationships calculated at the same time • i.e. it is not Y = a + b1X1 OR Y = a + b2X2 OR Y = a + b3X3 • In a multivariate model we calculate the partial effects of a independent variable, i.e. its unique contribution to the model • X1 has an partial effect on Y when X2 and X3 are constant
Why use a ERGM? • Test hypothesis about the processes that generate a particular network structure • A ERG model can be estimated with logistic regression analysis • Our goal is to build a model that can predict links between nodes in the network
Logistic regression • Dependent variable in is binary (1 or 0) • What is estimated in a logistic regression is the logodds for the dependent variable • A logodds can be written as a linear function: • Ln(P(Y=1)/(1-P(Y=1)) = b + X1 + X2 + X3 • The logistic regression is estimated with MLE • MLE is a algorithm that finds estimates (i.e. b + X1 + X2 + X3) that maximises the likelihood of the model
Homophily theory • Lazarsfeld and Merton (1964) • Most human communication will occur between a source and a receiver who are alike • When individuals share common meanings, belief, and mutual understandings, communication between them is more likely to be effective. • Gender homophily in organisations has been observed in many studies • Our hypothesis is that gender homophily is a salient factor in research collaboration
Does gender structure co-authorship networks? • Data: publication from department of psychology, Umeå University • Publication year 2007-> • Number of published items = 51 articles • Number of authors = 114 (male = 64, female = 50) • Number of authors employed at the psychology department = 24 (male =16, female = 8)
The ERG model • Dependent variable is binary (Y = 1 if there is a co-authorship link between the authors) • We will build a model that tries to predict the existence of co-authorship links • We will use a set of node attributes as independent variables • It is also possible to use edge attributes in a ERG model
Independent variable and hypothesis • Node attributes • Number of authorships for each node • Employed at the department (1 = employed, 0 = not employed) • Gender (1= female, 0 = male) • The hypothesis • Co-authorships are effected by gender homophily, i.e. links is more probable if the authors have the same sex • Two types of homophily, baseline and inbreeding homophily • We will estimate the effect of inbreeding homophily
Intercept (baseline) model • Intercept model ========================== Summary of model fit ========================== Formula: psyk ~ edges Maximum Likelihood Results: Estimate Std. Error MCMC s.e. p-value edges -3.06538 0.06039 NA <1e-04 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 For this model, the pseudolikelihood is the same as the likelihood.
Interpretation of the intercept • -3.06538 is the logodds for the existence of a co-authorship link • i.e ln(P(Y=1)/(1-P(Y=1)) • This is the same as connectivity of the network, i.e. the number of existing links divided by the number of possible links • The interpretation of the intercept changes when we introduce our independent variables
Model 2 • Formula in R console: psyk ~ edges + nodematch(“gender“, diff = FALSE) + nodefactor(“gender") + nodematch(“department“, diff = FALSE) + Nodefactor(“department") + nodecov("production") • Terms used in formula: • Edges is the intercept term in the model • Nodefactor() returns main effect of a categorical attribute • Nodematch(, diff = FALSE) Uniform homophily • If diff = TRUE we get differential homophily • Nodecov() main effect of a numeric attribute
Model 2 Maximum Likelihood Results: Estimate Std. Error MCMC s.e. p-value edges -3.94783 0.22617 NA <1e-04 *** nodematch.gender 0.12807 0.12478 NA 0.3048 nodefactor.gender.1 0.18282 0.09190 NA 0.0467 * nodematch.dept -0.25089 0.16747 NA 0.1342 nodefactor.dept.1 -0.13087 0.14687 NA 0.3729 nodecov.prod 0.21710 0.02113 NA <1e-04 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 For this model, the pseudolikelihood is the same as the likelihood.
Model with a combined gender-department node attribute Maximum Likelihood Results: Estimate Std. Error MCMC s.e. p-value edges -3.780522 0.318871 NA <1e-04 *** nodematch.gendept.1 -0.470124 0.427773 NA 0.272 nodematch.gendept.2 0.809230 0.821364 NA 0.325 nodematch.gendept.3 0.042986 0.278974 NA 0.878 nodematch.gendept.4 0.115658 0.279259 NA 0.679 nodefactor.gendept.2 -0.148699 0.232859 NA 0.523 nodefactor.gendept.3 -0.172505 0.191351 NA 0.367 nodefactor.gendept.4 0.004574 0.194113 NA 0.981 nodecov.prod 0.211575 0.021842 NA <1e-04 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1