740 likes | 922 Views
The Interface of Functional and Longitudinal Data. Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll. My Charge.
E N D
The Interface of Functional and Longitudinal Data Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll
My Charge • “Please feel free to talk about anything you wish” (Dangerous) • “Your thinking about longitudinal data and perhaps functional data from a wider perspective “ • “Goals of the workshop are to inspire new researchers, and to take stock of where the interface of longitudinal-functional data and dynamics is headed”
What I Want to Talk about Mother and joey, Tidbinbilla (outside Canberra), September 2010
What I Want to Talk about Namadji National Park July 2005
What I Will Talk About • I will talk about some of the problems I have worked on • No technical solutions, the other speakers look to be providing them • Investigators think marginally, statisticians think of random effects
Some Observations • In my work, there is a tension between • Providing answers to my collaborators that they can understand • Developing new general methodology publishable in statistics and that can solve more general problems • Thinking about parts of the actual problem that my collaborators would not have thought about • It’s easy to get caught up in either of the 1st two
Some Observations • When I am simply providing answers to stated questions, I find similar themes as the distinction between marginal models such as GEE and nonlinear mixed effects models for longitudinal data • GEE is simply easier • Most scientists think marginally because they are uncomfortable with the idea of variability
What I Will Talk About • Think what the typical smart biologist knows about statistics. • t-tests, ANOVA, simple linear regression • All the focus is on the mean, none on the variability
Some Observations • What we have to do is to deliver the analysis the data collectors can understand, and teach them about variability • Pictures work wonders: functions are no harder to understand than histograms, and understanding variability can help investigators tell stories
Some Observations • We need to advance the field of statistics • Deeper understanding of the underlying process, through random effects modeling, often helps inform future studies and helps investigators tell their story
An Old Colon Carcinogenesis Project • Experiment with 2 lipids (fish oil and corn oil) with and without butyrate (a fatty acid) supplementation, with p27 or MGMT repair measured as the response • Longitudinal, maybe even dynamic, hierarchical and functional. • Hierarchical because each of the treatment groups has multiple samples, and each of them have multiple functions • Functional because of the biology
Colon Cancer Data Jeff Morris Ciprian Crainiceanu Ana-Maria Staicu Naisyin Wang Veera B Yehua Li
Functional • The colonic crypts have cells, near the bottom (x=0) are the stem cells, near the top (x=1) are the differentiated cells
MGMT Repair Enzyme, 1 crypt • MGMT curve in one crypt. • Original analysis found large diet effects
MGMT Repair Enzyme, 1 crypt • The large diet effects on the MGMT repair enzyme are real. • There are also large diet effects on apoptosis
MGMT Repair Enzyme, 1 crypt • What do biologists do (define original analysis)? • They simplify the data so that they can do ANOVA, duh! • They average all the response (p27 or MGMT, about 200 observations in each analysis) in the bottom 1/3rd, Middle 1/3rd and top 1/3rd. Then they run 3 ANOVA.
MGMT Repair Enzyme, 1 crypt • They then they tell a story about all the ANOVA they have done. • We all smile about this, but my collaborator (Joanne Lupton) just got elected into the U. S. National Academy of Science.
MGMT Repair Enzyme, 1 crypt • I like to think that our more nuanced analyses help her tell her stories, which is hopefully not wishful thinking!
MGMT Repair Enzyme, 1 crypt • Wavelet functional coefficients for apoptotic index in the top 1/3 of the crypt, for fish oil and for corn oil. From Morris and Carroll (2006): “fish-oil-fed animals who had a large amount of apoptosis near their lumenal surface also had high levels of the DNA repair enzyme MGMT near their lumenal surface, meaning that the two major mechanisms for dealing with DNA damage were correlated. This relationship was not so strong for corn-oil-fed animals”.
MGMT Repair Enzyme, the stiry • We did a full-blown wavelet-based functional mixed model analysis to get these conclusions. Could it have been done marginally? • Probably Yes, but then that’s dull. • However, we (a) know much more about the pattern of variability and (b) we built up methods and software that can be used in a wide variety of settings
Longitudinal • Colon carcinogenesis is a localized phenomenon. The crypts closest to one another are highly correlated
Colon Cancer Data • The locality hypothesis says that colon cancer starts because of highly localized damage. • Longitudinal and hierarchical FDA can tell us many things about this hypothesis, e.g., where is localized damage more likely to occur? • While most research focuses on the proximal and distal portions of the colon, FDA reveals that there is as much or more in the middle
Colon Cancer Data • Lots of fun fitting this longitudinal, hierarchical functional data set • What did the investigators want to know? • They were interested in how correlated neighboring crypts are, consistent with the locality hypothesis.
Colon Cancer Data • The Bayesian analysis gives them strong point-wise evidence (can supplement with FDR) • Allows summary measures
Colon Cancer Data • Acknowledging the longitudinal nature led to much more precise inferences. This is the interaction function between diet and treatment: guess which one allows for locality?
Cell Signaling Data • Myometrial cells meant to mimic what goes on near birth were either exposed to dioxin (TCDD) or not exposed. • They were then exposed to a hormone, oxytocin, that stimulates calcium ion signaling (CA2+) • The CA2+ signal was observed at many pixels of each cell for 512 time points (85 minutes)
Cell Signaling Data Josue Martinez Jianhua Huang
Cell Signaling Data • The cells were segmented, and intensity of the signals were obtained for each pixel, each cell and all time points. • Roughly 25 cells in each treatment group (control and TCDD) • Hierarchical because of pixels within cells within treatments
Cell Signaling Data • Functional because pixels are measured over time • Possibly different levels of spatial because the cells are in spatial alignment • Lots of preprocessing: cell segmentation, adjustment for saturation, and more
Cell Signaling Data First two minutes of the experiment for the TCDD treated plate. Next comes two movies of the data
Cell Signaling Data All cells (Control and TCDD), at a basal state in which the cells were cultured, 0-4 minutes and 40-80 minutes after oxytocin exposure
Cell Signaling Data All cells (Control and TCDD), at a low estrogen state, just before pregnancy (note the delayed response due to TCDD)
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full-term in pregnancy
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full-term in pregnancy, after normalization and registration
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full-term in pregnancy, after normalization and registration. Areas under the curve (p < 0.001)
Cell Signaling Data • You should see that in this analysis, we have not made use of the structure of the data. • We have thought like GEE people, and indeed reduced the comparison of control and TCDD to single numbers, e.g., peak time and area under the curve. • We did lots of dimension reduction (4 weighted SVD) to get here
Cell Signaling Data • There was a lot of work to get the data into a format for analysis • Question: what can hierarchical, possible spatial FDA do for us here, and given the structure, how should an analysis proceed? • I feel that there is a lot more that we can learn about the process by thinking more deeply about the modeling
Bat Chirp Data • Bats of the same species, residing in Austin (city bats) and College Station (Aggie bats)
Bat Chirp Data Josue Martinez Jeff Morris
Bat Chirp Data • Bat chirps were recorded, some multiple times for each bat. • The hierarchy is species, bat, replicate • I believe this analysis is a poster child for why to think functionally and hierarchically
Bat Chirp Data • The chirp is mainly composed of frequencies that start at about 40 kilohertz (kHz) and slowly decrease to 20 kHz from 0 to 8 milliseconds into the chirp. • The bat then transitions to predominant frequencies at 60 kHz that slowly decrease back down to 40 kHz and then rise up to 60 kHz towards the end of the chirp. • Frequencies above ∼ 80 kHz are harmonics of the fundamental signal.
Bat Chirp Data • It seems clear to me that this is an inherently functional problem. • Trying to reduce it to a single number to do a t-test seems difficult to contemplate, but it is not impossible. • People have tried t-tests and classification based on measures such as duration, start frequency, end frequency, etc.
Bat Chirp Data • One could simply take each pixel of the spectrogram and do t-tests, with FDR control • This would ignore the replicate data, would ignore the correlated nature of the data, would do no dimension reduction, etc. • What did the biologist want to know? Kisi Bohn
Bat Chirp Data • She wanted to know if the bats from the same species (City Bats and Aggie Bats) evolved and have different vocalizations • What did we want to do: • Answer her question precisely, and let her tell a story (the marginal question, imprecisely framed) • Use all the data • Understand the variability
Bat Chirp Data • We wavelet transformed the spectrograms, fit a 2-D hierarchical WFFM, transformed back, and did analysis of the results (see next)