380 likes | 449 Views
Introduction to Multivariate Analysis. Epidemiological Applications in Health Services Research. Dr. Ibrahim Awad Ibrahim. Areas to be addressed today. Introduction to variables and data Simple linear regression Correlation Population covariance Multiple regression Canonical correlation
E N D
Introduction to Multivariate Analysis Epidemiological Applications in Health Services Research Dr. Ibrahim Awad Ibrahim.
Areas to be addressed today • Introduction to variables and data • Simple linear regression • Correlation • Population covariance • Multiple regression • Canonical correlation • Discriminant analysis • Logistic regression • Survival analysis • Principal component analysis • Factor analysis • Cluster analysis
Types of variables (Stevens’ classification, 1951) • Nominal • distinct categories: race, religions, counties, sex • Ordinal • rankings: education, health status, smoking levels • Interval • equal differences between levels: time, temperature, glucose blood levels • Ratio • interval with natural zero: bone density, weight, height
Variables use in data analysis • Dependent: result, outcome • developing CHD • Independent: explanatory • Age, sex, diet, exercise • Latent constructs • SES, satisfaction, health status • Measurable indicators • education, employment, revisit, miles walked
Data • Data screening and transformation • Normality • Independence • Correlation (or lack of independence)
Variable types and measures of central tendency • Nominal: mode • Ordinal: median • Interval: Mean • Ratio: Geometric mean and harmonic mean
Simple linear regression Y = A + BX Y B A X
Correlation • Mean = • Variance (SD)2 = • Population covariance = (X- x)(Y- y) • Product moment coefficient= =xy/ x y • It lies between -1 and 1
=0.00 Population covariance =0.33 =0.6 =0.88
Multiple regression and correlation Simple linear Y = + X Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp EF ejection fraction Exercise Body fat
Issues with regression • Missing values • random • pattern • mean substitution and ML • Dummy variables • equal intervals! • Multicollinearity • independent variables are highly correlated • Garbage can method
Canonical correlation • An extension of multiple regression • Multiple Y variables and multiple X variables • Finding several linear combinations of the X var and the same number of linear combinations of the Y var. • These combinations are called canonical variables and the correlations between the corresponding pairs of canonical variables are called CANONICALCORRELATIONS
Correlation matrix • Data screening and transformation • Normality • Independence • Correlation (or lack of independence)
Discriminant analysis • A method used to classify an individual in one of two or more groups based on a set of measurements • Examples: • at risk for • heart disease • cancer • diabetes, etc. • It can be used for prediction and description
Discriminant analysis • a and b are wrongly classified • discriminant function to describe the probability of being classified in the right group. B B a b A A
Logistic regression • An alternative to discriminant analysis to classify an individual in one of two populations based on a set of criteria. • It is appropriate for any combination of discrete or continuous variables • It uses the maximum likelihood estimation to classify individuals based on the independent variable list.
Survival analysis (event history analysis) • Analyze the length of time it takes a specific event to occur. • Time for death, organ failure, retirement, etc. • Length of time function of {explanatory variables (covariates)}
1980 1985 1990 Survival data example died died died lost surviving
Log-linear regression • A regression model in which the dependent variable is the log of survival time (t) and the independent variables are the explanatory variables. Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp Log (t) = + 1X1 + 2X2 + 3X3 . . .+ pXp + e
t 1980 1985 1990 Cox proportional hazards model • Another method to model the relationship between survival time and a set of explanatory variables. • Proportion of the population who die up to time (t) is the lined area
Cox proportional hazards model • The hazard function (h) at time (t) is proportional among groups 1 & 2 so that • h1(t1)/h2(t2) is constant.
Principal component analysis • Aimed at simplifying the description of a set of interrelated variables. • All variables are treated equally. • You end up with uncorrelated new variables called principal components. • Each one is a linear combination of the original variables. • The measure of the information conveyed by each is the variance. • The PC are arranged in descending order of the variance explained.
Principal component analysis • A general rule is to select PC explaining at least 5% but you can go higher for parsimony purposes. • Theory should guide this selection of cutoff point. • Sometimes it is used to alleviate multicollinearity.
Factor analysis • The objective is to understand the underlying structure explaining the relationship among the original variables. • We use the factor loading of each of the variables on the factors generated to determine the usability of a certain variable. • It is guided again by theory as to what are the structures depicted by the common factors encompassing the selected variables.
Cluster analysis • A classification method for individuals into previously unknown groups • It proceeds from the most general to the most specific: • Kingdom: Animalia Phylum: Chordata Subphylum: vertebrata Class: mammalia Order: primates Family: hominidae Genus: homo Species: sapiens
Patient clustering • Major: patients Types: medical Subtype: neurological Class: genetic Order: lateonset disease: Guillian Barre syndrom • Hierarchical: divisive or agglumerative
Presentation Schedule • 4 each on 4/22 and 4/27 • 5 on 4/29 • Each presentation should be maximum of 10 minutes and 5 minutes for discussion • E-mail me your requirements of software and hardware for your presentation. • Final projects due 5/7/99 by 5:00 pm in my office.