490 likes | 510 Views
This study evaluates the performance of clustering algorithms on grouping children based on their early childhood growth patterns, using different inputs and algorithms. Analysis includes BMIz scores, principal component analysis, and mixed modeling.
E N D
Evaluating the performance of clustering using different inputs and algorithms to group children based on early childhood growth patterns September 20, 2012 Jobayer Hossain, Ph.D. Tim Wysocki, Ph.D. Samuel S. Gidding, MD MingXing Gong, M.Sc. H. Timothy Bunnell, Ph.D. Nemours Biomedical Research
Early Childhood Growth Pattern • Childhood growth pattern is • the pattern of the temporal change in height, weight, and head circumference • A determinant of body composition and body weight • Commonly used measures of body composition and weight are– • Weight-for-length for ages < 2 years • Body mass index (BMI) for ages ≥ 2 years • Standardized scores of these two measures are weight-for-length z- score and BMI z-score (both termed as BMIz in this presentation) Nemours Biomedical Research
Enrollment Criteria • Inclusion Criteria: • Born between 2001 and 2005 • Had first well child visit at a Nemours clinic in < 1 month of age • Had at least one well child visit each year for the next 5 years • Exclusion Criteria: Children with medical diagnoses associated with poor growth and development such as cancer and cystic fibrosis. Nemours Biomedical Research
Data Collection and BMIz Calculation • 3365 children were enrolled in the study • Retrospectively collected (from Nemours electronic medical record (EMR)) height, weight, age, other demographics, and comorbidities of childhood obesity for children ages between 0-5 years • Calculated BMIz (using height, weight, age and gender data) on their clinic visits. • Interpolated (LOESS) BMIz score for ages (months) 1, 6, 12, 18, 24, 30, 36, 42, 48, 60 for each subject Nemours Biomedical Research
Demographics Nemours Biomedical Research
Temporal Change in Mean BMIz • Trend in Mean BMIz • A cubic polynomial trend over the ages (0-5 years) • Approximately linear in three intervals of time: 1-9 months, 9-27 months, and 27-60 months Nemours Biomedical Research
Natural Structure of the Temporal Change in BMIz of Individual Children Nemours Biomedical Research
Mean BMIz Over Time: By Demographics A cubic polynomial of mean change in BMIz over time Nemours Biomedical Research
Objectives • To group children (0-5 years of age) using a cluster analysis that captures the natural structure of the temporal changes in BMIz • To acquire the objective we did- • Perform many sets of cluster analyses using different clustering methods, algorithms, software and cluster inputs • Use a mixed model to evaluate and compare the performance of many sets of clusters • Select optimum clustering(s) based on the model fit statistics to the data Nemours Biomedical Research
Rationale of Our Work • Cluster analysis is • mainly an exploratory data analysis • with limited scope of evaluation and comparison of the results of two or more sets of clustering • A large number of clustering methods/algorithms have been developed • Different statistical software accommodate different algorithms • Cluster inputs can be raw data or some form of standardized data • No method/algorithm is uniquely best Nemours Biomedical Research
Cluster Methods and Software Used • We used the following software/cluster methods with suggested several distance/similarity measures • SAS Procedures: CLUSTER (Hierarchical), FASTCLUS (k-means), MODECLUS (clusters based on non-parametric density estimation using several algorithms) and some ancillary procedures such as VARCLUS, TREE, ACECLUS, DISTANCE, PRINCOMP, STDSIZE • SPSS Cluster Analysis: Two-step (hierarchical clustering in two steps) and k-means • R Cluster Analysis: Package mclust (model based clustering), Package cluster (hierarchical), and function kmeans (K-means) Nemours Biomedical Research
Inputs Used for Cluster Analyses We performed cluster analyses of • BMIz scores at different ages (every six-month interval ) • Selected principal component (PC) /factor scores, after PC/factor analyses of BMIz scores of 11 variables • Random coefficients of mixed effects model of BMIz • Coefficients of auto-regression (AR)/autocorrelation (AC) for each individual Nemours Biomedical Research
Model Used for Cluster Evaluation • We used a mixed model that fits the BMIz of children, nested within a cluster group, as a polynomial function of time Nemours Biomedical Research
Cluster Analysis of BMIz score ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Principal Component Analysis (PCA) • PCA is a powerful exploratory tool to identify the patterns in data • PCs reflect the interrelation of the similarities and dissimilarities between observed variables • Performed PCA of BMIz • The first 4 PCs (PC4) explain about 99.19% of the total variation in the data Nemours Biomedical Research
Cluster Analysis of PC Scores • PC4 (the first four PCs (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm • PC3 (the first three PCs (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm • PC2 (the first two PCs (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested • PC11( the all eleven PCs)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm by cluster algorithm (Perhaps cluster analysis using 11 PCs equivalent to the cluster analysis of the raw data) Nemours Biomedical Research
Cluster Analysis of PC4 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Cluster Analysis of PC3 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Cluster Analysis of PC2 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Cluster Analysis of Factor Scores • Fac4 (the first four Factors (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm • Fac3 (the first three Factors (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm • Fac2 (the first two Factors (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm • Fac11( the all eleven factors): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm (Perhaps cluster analysis using 11 Factors equivalent to the cluster analysis of the raw data)) Nemours Biomedical Research
Cluster Analysis of Fac4 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Cluster Analysis of Fac3 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Cluster Analysis of Fac2 ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients • Recall: the change in population mean of BMIz is a cubic polynomial trend over the ages 0-5 years i.e. • Approximately linear in three intervals of time • We can model this dataset with an intercept and three slopes for three intervals. • We used parameters (intercept and slopes) as both fixed and random effects in the model • Fixed effects of slopes explain the rate of change in the population level of BMIz in each interval of time • Random effects explain the individual to individual variation • at the beginning of life (age of one month) • of the change in BMIz trajectories in three intervals • That’s random effects account for the sources of heterogeneity in the change in population BMIz Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients • Exploratory analyses indicate that splits at 9 and 27 months of ages yield the best fit of the piece-wise linear mixed effects model to the individual and population levels of change in BMIz Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Estimate of the regression coefficients (population) Estimated variance of the random effects (individual) Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients • The model we just discussed, is our initial model to track the trajectories of individual and population level of BMIz • We also observed a significant difference in mean BMIz between gender, race and ethnicity. • In addition to our initial model, we also fitted a model to track the change in BMIz after adjustment for gender and race-ethnicity. Nemours Biomedical Research
Cluster Analysis of the Random Effects • We performed a cluster analysis of individual level four parameters: intercept (b1i) and three slopes (b2i, b2i +b3i, b2i +b3i+b4i ) Nemours Biomedical Research
Cluster Analysis of the Random Coefficients of Piece-wise Mixed Effects Model ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Time Series Analysis • Performed • Auto-regression (AR) on the BMIz of each individual and extract coefficients (burg, Yule-walker, MLE methods were used) of the best model suggested by AIC • Auto-correlation on the BMIz of each individual and extract autocorrelations • Spectral analysis on each individual’s BMIz and extract frequencies • Performed cluster analysis of AR coefficients, auto-correlations, and frequencies Nemours Biomedical Research
Cluster Analysis of the AR Coefficients (Yule-Walker) ? Natural structure of the temporal change in BMI Nemours Biomedical Research
Model Based Evaluation Ordered by Logliklihood (Largest to Smallest) Nemours Biomedical Research
Choosing an Optimum Clustering • The following few are the candidates of an optimum clustering- Optimum Choice: Based on the likelihood score and number of groups, prefer smaller number of groups with maximum possible likelihood score Nemours Biomedical Research
Best Clustering Based on the Model Fit Statistics Nemours Biomedical Research
Optimum Clustering Based on the Number of Groups and Model Fit Statistics Nemours Biomedical Research
Acknowledgement • Li Xie, Biostatistician Nemours Biomedical Research
Thank you very much Nemours Biomedical Research
The Fastclus Procedure • The FASTCLUS procedure combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distances from the cluster means. This kind of clustering method is often called a k-means model. • By default, the FASTCLUS procedure uses Euclidean distances. • The FASTCLUS procedure can use an Lp (least pth powers) clustering criterion instead of the least squares (L2) criterion used in k-means clustering methods. • PROC FASTCLUS uses algorithms that place a larger influence on variables with larger variance, so it might be necessary to standardize the variables before performing the cluster analysis. Nemours Biomedical Research
The Modeclus Procedure • The MODECLUS procedure clusters observations in a SAS data set by using any of several algorithms based on nonparametric density estimates. PROC MODECLUS implements several clustering methods by using nonparametric density estimation. • PROC MODECLUS can perform approximate significance tests for the number of clusters • PROC MODECLUS produces output data sets containing density estimates and cluster membership, various cluster statistics including approximate p-values, and a summary of the number of clusters generated by various algorithms, smoothing parameters, and significance levels. • For nonparametric clustering methods, a cluster is loosely defined as a region surrounding a local maximum of the probability density function. Nemours Biomedical Research
The Cluster Procedure • The CLUSTER procedure hierarchically clusters the observations in a SAS data set by using one of 11 methods. The data can be coordinates or distances. • The clustering methods are: average linkage, the centroid method, complete linkage, density linkage (including Wong’s hybrid and kth-nearest-neighbor methods), maximum likelihood for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions, the flexible-beta method, McQuitty’s similarity analysis, the median method, single linkage, two-stage density linkage, and Ward’s minimum-variance method. Nemours Biomedical Research
The SPSS TwoStep Cluster • Handles both continuous and categorical variables by extending the model-based distance • Utilizes a two-step clustering approach similar to BIRCH (Zhang et al. 1996) • Step1(Pre-cluster): Uses a sequential clustering approach (Theodoridis and Koutroumbas 1999). It scans the records one by one and decides if the current record should merge with the previously formed clusters or start a new cluster based on the distance criterion. • Step2 (group data in to sub-cluster): Use the resulting sub-clusters in step1 and groups them into the desired number of clusters • Provides the capability to automatically find the optimal number of clusters. Nemours Biomedical Research
R-mclust package • mclust is a contributed R package for model-based clustering, classification, and density estimation based on finite normal mixture modeling. • It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. • Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. • Provides the capability to automatically find the optimal number of clusters. Nemours Biomedical Research
Principal Component Analysis • PCA transformed a set of interrelated variables to a new set of uncorrelated variables called principal components (PCs). • The variance of a PC indicates the amount of total variation (information) in the original variables conveyed by that particular PC. • The transformation is taken in such a way that the PCs are ordered and the first PC accounts for as much of the variability in the original variables as possible, and then each succeeding PC in turn has the highest variance possible under the constraint that it be uncorrelated with the preceding PCs. • Thus the most informative PC is the first and the least informative is the last. • PCA is a powerful exploratory tool to identify the patterns in data because PCs reflect the interrelation of the similarities and dissimilarities between observed variables Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Estimate of the regression coefficients (population) • A significant increasing trend in population BMIz at ages 1 to 9 months and at ages 27 to 60 months • A significant decreasing trend in population BMIz at ages 9 to 27 months • The rate of changes in BMIz in three pieces of time are significantly different Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients Estimated variance of the random effects (individual) • A substantial individual to individual variability in • BMIz at the age in one month • the rate of change in BMIz at each piece of time Nemours Biomedical Research
References References: 1.Abraham S, Nordsieck M. Relationship of excess weight in children and adults. Public Health Rep 1960;75:263–73. 2.Guo S, Chumlea W Tracking of body mass index in children in relation to overweight in adulthood. Am J Clin Nutr 1999;70(suppl):145S–8S. 3.Daniels SR, Jacobson MS, MacCrindle BW, et al. American Heart Association Childhood Obesity Summit Report. Circulation. Published online March 30, 2009. DOI: 0116/CIRCULATIONAHA. 109.192116; Accessed 9/28/29. 4.Pratt CA, Stevens J, Daniels S. Childhood obesity prevention and treatment: recommendations for future research. Am J Prevent Med 2008;35:249-252. 5.Collins LM, Murphy SA, Nair VN, Strecher VJ. A strategy for optimizing and evaluating behavioral interventions. Ann Behav Med 2005;30:65-73. 6.Collins LM, Chakraborty B, Murphy SA, Strecher VJ. Comparison of a phased experimental approach and a single randomized clinical trial for developing multi-component behavioral interventions. Clin Trials 2009;6:5-15. 7.Whitlock EP, Williams SB, Gold R, Smith PR, Shipman SA. Screening and interventions for child overweight: a summary of evidence for the US Preventive Services Task Force. Pediatrics 2005; 116:e125- e144. 8.Olstad D L, McCargar L. Prevention of overweight and obesity in children under the age of 6 years. Appl Physiol Nutr Metab. 2009 Aug;34(4):551-70. 9.Taveras EM, Rifas-Shiman SL, Belfort MB, Kleinman KP, Ken E, Gillman MW. Weight status in the first 6 months of life and obesity at 3 years of age. Pediatrics 2009;123:1177-1183. 10.Serdula MK, Ivey D, Coates RJ, Williamson DF, Byers T. Do obese children become obese adults? A review of the literature. Prev Ed 1993;22:167-177. 11.Dietz WH. Health consequences of obesity in youth: childhood predictors of adult disease. Pediatrics 1998;101:518-525. Nemours Biomedical Research