200 likes | 212 Views
Explore the intersection of theory and practical applications in biostatistics, focusing on methods like survival analysis, empirical processes, and consultative problem-solving. Discuss the importance of theory in deriving valid inferential tools for complex models and its role in scientific discovery. Gain core literacy and problem-solving skills in this advanced statistical field.
E N D
Biostatistics 760 Random Thoughts
Upcoming Classes • Bios 761: Advanced Probability and Statistical Inference • Bios 767: Longitudinal Data Analysis • Bios 780: Theory and Methods for Survival Analysis • Bios 841: Statistical Consulting
Bios 761 • Frequentist and Bayesian decision theory • Hypothesis testing: UMP tests, etc. • Bootstrap and other methods of inference • Stochastic processes: • Poisson processes • Markov chains • Martingales • Brownian motion
Bios 780 • Time-to-event data • Right censoring • Counting processes; martingales • Semiparametric approaches • Kaplan-Meier estimator • Log-rank statistic • Cox model • Data analysis
Bios 841 • Consulting versus collaboration • Bringing it all together to solve problems • Communicating about statistics • Three real problems • Three journal style reports • One final oral presentation • Real time problem solving • What is the role of statistical theory?
A Few War Stories • As a student: thesis on surrogates • As a postdoc: infectious diseases • As a new professor: cystic fibrosis (CF)* • Working on tenure: empirical processes • Empirical processes and cancer* • Chair of the DSMC for NICHD • Artificial intelligence and NSCLC
CF Neonatal Screening • 1992: Joined Phil Farrell’s CF study team • 1997: Farrell, Kosorok, Laxova, et al, published in NEJM • 2004 (Oct. 15): CDC recommended CF newborn screening: the 1997 article was judged the only valid randomized trial • States offering CF newborn screening: 3 in 1997, 12 in 2004, 45 today
What Role Did “Theory” Play? • Used state-of-the-art statistical methods that were robust (GEE) • In other CF research we have used: • Current status methods (parametric, robust) • Constrained regression estimation • Semiparametric bootstrap inference • Martingale based survival analysis • New work using artificial intelligence
Empirical Processes and Cancer • Non-Hodgkin’s Lymphoma Prognostic Factors Project (1993, NEJM) • Cox proportional hazards model employed to ascertain risks of 5 prognostic factors: Age, performance Status, serum lactate dehydrogenase Level, number of extranodal disease Sites, tumor Stage • Diagnostics show the model fits poorly
What is the Problem? • Poor survival function prediction • Possibly incorrect interpretation of risk factor effects • A model that adds a single parameter to the Cox model was developed and fit • This new model fits well (Kosorok,Lee and Fine, 2004) • Inference for the new model is complicated
What Does Theory Tell Us? • We can derive valid inferential tools for the new model: estimation and bootstrap • Robustness was also studied: we learn theoretically that the Cox model is robust to this kind of model misspecification: • The direction of the regression coefficients is preserved • Should use robust variance for Cox model
Theory Versus Applications • The title implies there is conflict between theory and applications • This isn’t true! • Theory provides a basis for correct thinking and problem solving for applications • Applications drive new theoretical development
Theory Can Be Impractical • Law of iterated logarithm: needs sample size of 108 (“asymptopia”). • Sometimes higher order approximations are needed before it becomes useful. • Sometimes computational properties of asymptotically optimal estimators are poor. • Some hard problems take years to solve.
Why Theory is Needed • Often it does work for practical sample sizes. • Can reveal properties that are universally valid: simulation studies are limited to the scenarios investigated. • Theory can lead toward methodological solutions (Cook and Kosorok, 2004 JASA). • Theory can drive scientific discovery. • Some results are beautiful.
Data Mining Versus Inference • Data mining is summarizing and representing data no matter how complicated • Inference is determining valid measures of uncertainty • Patterns obtained from data mining can be misleading • Inference without data mining may miss important structure
The Core of Statistics • Statistics is the science of science • How do we learn from our world and draw meaningful and valid conclusions from it? • Need both data mining and valid inference • Requires a unique kind of intuition • Needs many different intellectual perspectives • One of the most challenging of all fields
Everyone Needs Core Literacy • All statisticians need to know enough theory to have core literacy about statistics and to be able to problem solve • All statisticians need to know enough about applications to know what is important • All biostatisticians need to know enough statistical methods to be useful in practice • The purpose of a Ph.D. in Biostatistics is to enable the creation of new methodology
Semiparametric Inference • The study of statistical models with parametric and/or nonparametric parts • Can achieve trade-off between scientific meaning and model “robustness” • Estimation and inference are often hard • There exists an efficiency bound for parametric and some nonparametric parts • NPMLE, testing and estimating equations
Empirical Processes • Tools for complex model inference and high dimensional data • Can determine universal properties of semiparametric methods: • Consistency • Rate of convergence • Limiting distributions • Valid inference (empirical process bootstrap) • Empirical processes are everywhere
The Road Ahead • Whatever you choose to do, the core statistical theory classes will help you. • Be patient as your learn. • Be willing to work hard (struggle is good). • It takes many different kinds of thinkers with different learning styles. • There are important discoveries to be made in both applications and theory.