380 likes | 454 Views
Dive into the comprehensive world of multivariate analysis in epidemiological health research with a focus on variables, data types, regression techniques, correlation, and more. Explore practical applications in health services research to enhance data analysis skills.
E N D
Introduction to Multivariate Analysis Epidemiological Applications in Health Services Research Dr. Ibrahim Awad Ibrahim.
Areas to be addressed today • Introduction to variables and data • Simple linear regression • Correlation • Population covariance • Multiple regression • Canonical correlation • Discriminant analysis • Logistic regression • Survival analysis • Principal component analysis • Factor analysis • Cluster analysis
Types of variables (Stevens’ classification, 1951) • Nominal • distinct categories: race, religions, counties, sex • Ordinal • rankings: education, health status, smoking levels • Interval • equal differences between levels: time, temperature, glucose blood levels • Ratio • interval with natural zero: bone density, weight, height
Variables use in data analysis • Dependent: result, outcome • developing CHD • Independent: explanatory • Age, sex, diet, exercise • Latent constructs • SES, satisfaction, health status • Measurable indicators • education, employment, revisit, miles walked
Data • Data screening and transformation • Normality • Independence • Correlation (or lack of independence)
Variable types and measures of central tendency • Nominal: mode • Ordinal: median • Interval: Mean • Ratio: Geometric mean and harmonic mean
Simple linear regression Y = A + BX Y B A X
Correlation • Mean = • Variance (SD)2 = • Population covariance = (X- x)(Y- y) • Product moment coefficient= =xy/ x y • It lies between -1 and 1
=0.00 Population covariance =0.33 =0.6 =0.88
Multiple regression and correlation Simple linear Y = + X Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp EF ejection fraction Exercise Body fat
Issues with regression • Missing values • random • pattern • mean substitution and ML • Dummy variables • equal intervals! • Multicollinearity • independent variables are highly correlated • Garbage can method
Canonical correlation • An extension of multiple regression • Multiple Y variables and multiple X variables • Finding several linear combinations of the X var and the same number of linear combinations of the Y var. • These combinations are called canonical variables and the correlations between the corresponding pairs of canonical variables are called CANONICALCORRELATIONS
Correlation matrix • Data screening and transformation • Normality • Independence • Correlation (or lack of independence)
Discriminant analysis • A method used to classify an individual in one of two or more groups based on a set of measurements • Examples: • at risk for • heart disease • cancer • diabetes, etc. • It can be used for prediction and description
Discriminant analysis • a and b are wrongly classified • discriminant function to describe the probability of being classified in the right group. B B a b A A
Logistic regression • An alternative to discriminant analysis to classify an individual in one of two populations based on a set of criteria. • It is appropriate for any combination of discrete or continuous variables • It uses the maximum likelihood estimation to classify individuals based on the independent variable list.
Survival analysis (event history analysis) • Analyze the length of time it takes a specific event to occur. • Time for death, organ failure, retirement, etc. • Length of time function of {explanatory variables (covariates)}
1980 1985 1990 Survival data example died died died lost surviving
Log-linear regression • A regression model in which the dependent variable is the log of survival time (t) and the independent variables are the explanatory variables. Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp Log (t) = + 1X1 + 2X2 + 3X3 . . .+ pXp + e
t 1980 1985 1990 Cox proportional hazards model • Another method to model the relationship between survival time and a set of explanatory variables. • Proportion of the population who die up to time (t) is the lined area
Cox proportional hazards model • The hazard function (h) at time (t) is proportional among groups 1 & 2 so that • h1(t1)/h2(t2) is constant.
Principal component analysis • Aimed at simplifying the description of a set of interrelated variables. • All variables are treated equally. • You end up with uncorrelated new variables called principal components. • Each one is a linear combination of the original variables. • The measure of the information conveyed by each is the variance. • The PC are arranged in descending order of the variance explained.
Principal component analysis • A general rule is to select PC explaining at least 5% but you can go higher for parsimony purposes. • Theory should guide this selection of cutoff point. • Sometimes it is used to alleviate multicollinearity.
Factor analysis • The objective is to understand the underlying structure explaining the relationship among the original variables. • We use the factor loading of each of the variables on the factors generated to determine the usability of a certain variable. • It is guided again by theory as to what are the structures depicted by the common factors encompassing the selected variables.
Cluster analysis • A classification method for individuals into previously unknown groups • It proceeds from the most general to the most specific: • Kingdom: Animalia Phylum: Chordata Subphylum: vertebrata Class: mammalia Order: primates Family: hominidae Genus: homo Species: sapiens
Patient clustering • Major: patients Types: medical Subtype: neurological Class: genetic Order: lateonset disease: Guillian Barre syndrom • Hierarchical: divisive or agglumerative
Presentation Schedule • 4 each on 4/22 and 4/27 • 5 on 4/29 • Each presentation should be maximum of 10 minutes and 5 minutes for discussion • E-mail me your requirements of software and hardware for your presentation. • Final projects due 5/7/99 by 5:00 pm in my office.