Exploratory Failure Time Analysis and Copy Number Variation Inference

Exploratory Failure Time Analysisand Copy Number Variation Inference Cheng Cheng Department of Biostatistics St. Jude Children’s Research Hospital

Outline Part I Background Part II Exploratory Failure Time Analysis Part III Copy Number Variation Inference

I. Background • Nucleus, nucleotides, DNA, chromosomes, SNP • SNP arrays • Genome Wide Association Study (GWAS) • Multiple tests • Cause-specific failure and Competing risk • Cumulative incidence function, Gray's test, Fine-Gray hazard rate regression model • Censor at time competing event: OK for testing stochastic independence, biased for estimation

Animal CellOrganelles Nucleus Nucleolus Endoplasmic Reticulum Centriole Centrosome Golgi Cytoskeleton Cytosol Mitochondrion Secretory Vesicle Lysosome Peroxisome Vacuole

Nucleus Functions The cell nucleus is an organelle that forms the package for our genes and their controlling factors. • Store genes on chromosomes • Organize genes into chromosomes to allow cell division. • Transport regulatory factors & gene products via nuclear pores • Produce messages (messenger Ribonucleic acid or mRNA) that code for proteins • Produce ribosomes in the nucleolus • Organize the uncoiling of DNA to replicate key genes

Chromosome inside nucleus DNA = deoxyribonucleic acid • What is a chromosome? • In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. • Each chromosome is made up of DNA tightly coiled many times around proteins called histones that support its structure.

Human chromosomes • In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. • Twenty-two of these pairs, called autosomes, look the same in both males and females. • The 23rd pair, the sex chromosomes, differ between males and females. • Females have two copies of the X chromosome • males have one X and one Y chromosome.

Chromosome Structure • Each chromosome has a constriction point called the centromere, which divides the chromosome into two sections, or “arms.” • The short arm of the chromosome is labeled the “p arm.” The long arm of the chromosome is labeled the “q arm.” • Each chromosome has two chromatids as a result of duplication of the DNA which took place during interphase. The two chromatids are linked together at a centromere.

DNA structure DNA is a double-stranded molecule twisted into a helix (think of a spiral staircase). Each spiraling strand, comprised of a sugar-phosphate backbone and attached bases, is connected to a complementary strand by non-covalent hydrogen bonding between paired bases. The bases are adenine (A), thymine (T), cytosine (C) and guanine (G).

Genetic codeis specified by the four nucleotide "letters"A(adenine),C(cytosine),T(thymine), and G (guanine). A Single Nucleotide Polymorphism (SNP) is a change of a single nucleotide, such as an T, replaces one of the other three nucleotide letters -- A, C, or G, within a person's DNA sequence. SNPs occur in human DNA at a frequency of one every 1,000 bases. These variations can be used to track inheritance in families.

SNP probe = 25 bases Perfect Match Allele ‘A’ Mismatch Perfect Match Allele ‘B’ Mismatch Quartet SNP Array Design SNP T/G 5´ 3´ Genomic Sequence

Hundreds of Millions of Pixel Intensities…..

Genotype Calling AA AB BB

Genome Wide Association Study (GWAS) Typically 400,000 to 900,000 SNPs are investigated in a single study Number of subjects in a study typically ranges from a few hundreds to 20,000 Each SNP takes three possible (generic) values “AA”, “AB”, “BB”, often coded as 0, 1, 2 Each SNP in each individual has a unique value, which is one of 0, 1, or 2 A small number of phenotypes: disease status (yes/no), or quantitative trait This lecture: time to a cause-specific failure n subjects, n observed trait values Y1, …, Yn, n observed SNP values for the ith SNP Xi1, …, Xin Inference (Test) for stochastic dependence of the ith SNP with the trait based on the dataset (Xij, Yj), j=1,…,n; do this for each SNP; thus many tests of the null hypothesis of stochastic independence.

Massive Multiple Tests “Genome-wide significance” Bonferroni-type adjustment: Declare statistical significance if P≤10-7 (0.05/500K) FDR and q value Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57, 289–300. Storey, J. D., Taylor, J. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach JRSS-B, 66, 187–205. Profile information criteria Cheng, C., Pounds, S., Boyett, J. M. et al (2004). Statistical significance threshold criteria for analysis of microarray gene expression data. Statistical Applications in Genetics and Molecular Biology 3, Article 36. URL //www.bepress.com/sagmb/vol3/iss1/art36 Cheng, C (2006) An adaptive significance threshold criterion for massive multiple hypotheses testing. IMS Lecture Notes - Monograph Series 2nd Lehmann Symposium – Optimality49, 51–76

Relapse Failure type 1 (of interest) 2nd Cancer Failure type 2 (competing risk/event) Alive Die in remission Failure type 3 (competing risk) Cause-specific failure and competing risk Klein, J. P. (2010) Competing risks. WIREs Comp Stat, www.wiley.com/wires/compstats, DOI: 10.1002/wics.83

Cumulative incidence function (CIN) (T, δ); Fj(t)=Pr(T ≤ t and δ=j) Gray’s test: Compare CIN across K groups Analog of weighted log-rank test Gray, R. J. (1988) A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1141-1154. Fine-Gray’s CIN hazard rate regression model Analog of Cox’s hazard rate regression model Fine, J. P., Gary, R.J. (1999) A proportional hazards model for the subdistribution of a competing risk. JASA, 94, 496-509. Censor at the time of competing event

II. Exploratory Failure Time Analysis • Large-scale Genomic Association Analysis • Feature (variable) screening and feature extraction • A Motivating Example from a GWAS • Correlation Profile Test (CPT) • Hypotheses • Correlation profile function • CPT statistic • Hybrid permutation test of significance • A Simulation Study: Strength and Weakness • Example: Analysis of SNPs on Chromosome 9 • Summary and Remarks • Feature Extraction (sparse regression) • Example: “Prognostic” Gene (RNA) expression • Summary and remarks

Large-scale Genomic Association Analysis • Feature (variable) screening • Find individual genomic features (factor/predictor variables) associated with one or more phenotypes (response variables) • GWAS • Association: stochastic dependence • Parametric/semi-parametric approaches: linear models, GLMs, hazard rate (Cox) regression • Feature extraction • Find (linear) combinations (or sets) of genomic features (variables) associated with one or more phenotypes • Determine sets of variables using biological knowledge (gene signaling pathways, functional/ontology groups, etc.): GSEA • Variable/Model selection methods: ridge regression, LASSO, SCAD, SEAMLESS, sparse regression

A Motivating Example • GWAS to screen SNP markers for risk of relapse in childhood leukemia patients

A Motivating Example Need: a more omnibus and algorithmically robust test procedure

Correlation Profile Test (CPT) • Model, Null and alternative hypotheses (classical survival setting)

Correlation Profile Test (CPT) • Sample correlation profile function observed event point process of individual i Can do rank transformation for continuous X

Correlation Profile Test (CPT)

Correlation Profile Test (CPT) • CPT statistic, hybrid permutation test

Back to the SNP Example

A Simulation Study • A model mimicking the SNP example Generate X: Pr(X=0)=0.98, Pr(X=1)=0.015, Pr(X=2)=0.005 Generate Censor Time TC ~ Exp(0.2) Generate failure indicator IF|X ~ Bernoulli(πF); πF = 0.2exp{-θ(X-2)} If IF = 1, generate Failure Time TF|X ~ LogNormal(βX,1) else set TF = ∞ Generate competing risk indicator IR ~ Bernoulli(0.1) If IR = 1, generate Competing Failure Time TR ~ Unif(0,7) else set TR= ∞ Observed Failure Time T = min{TCTF TR} Repeat the above n times to simulate n individuals

A Simulation Study • A model mimicking the SNP example Pwr est. s.e.

A Simulation Study Exact Proportional Hazard, continuous predicator

A Simulation Study Continuous predictor, deviation from proportional hazard

AA AB BB A Simulation Study Ordinal predictor, deviation from proportional hazard Opposite scenario of the SNP example

Relapse Failure type 1 (of interest) 2nd Cancer Failure type 2 (competing risk/event) Alive Die in remission Failure type 3 (competing risk) Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic Leukemia (ALL) 21,909 SNPs on Chr 9 obtained by Affy 100K and 500K SNP arrays were tested for association with relapse of childhood ALL

Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic Leukemia (ALL) n=707 subjects from two most recent clinical trial at SJCRH 21,909 SNPs CPT test performed on each SNP, with 200 permutations in the hybrid permutation test Significance determined by the profile info criteria Ip (Cheng et al. 2000); 200 SNPs were considered statistically significant, estimated FDR=48.7%

ρ^(tj), j=1, …, J=9 Test stat = -3.478

AA 5.1% AB 28.7% BB 66.2% P Gary’s test 0.0451 Fine-Gray regression 0.0380; coeff=-0.3905

ABL1 Gene Germline SNP AA AB BB Tot 36 (0.051) 201 (0.287) 464 (0.662) 701 (1.00) A 273 (0.195) B 1129 (0.805) AA AB BB T13B intermediate/high risk 12 27 75 (0.152) T13B Low risk 7 33 67 (0.065) T15 standard/high risk 11 74 161 (0.047) T15 Low risk 6 67 161 (0.026)

Extension to Recurrent Events Multiple event times • Model, Null and alternative hypotheses # events occurred ≤ t

N = # events occurred ≤ t N Extension to Recurrent Events

Summary and Remarks • Correlation Profile Test: • Computationally more robust • More omnibus: covers certain deviations from the semi-parametric hazard regression model • Highly competitive with other non-parametric procedures (Gray’s test, Jung’s test) • Relative deficiency vs. Cox model under PH ?? • Extension to recurrent-event phenotypes • Informative censoring in the presence of competing risk

Feature Extraction (Sparse regression) • Identify (linear) combinations of covariate variables that are associated with the failure phenotype

Feature Extraction (Sparse regression) • Sparse regression by the General Path seeking (GPS) algorithm (Friedman 2008) • Exploratory failure time analysis by weighted least square -- the association criteria • The modified GPS algorithm to find a solution • A small simulation study • Example: Gene (RNA) expression “prognostic” for relapse of childhood ALL

Lasso (Tibshirani 1996), grouped lasso (Yuan and Lin 2006), SCAD (Fan and Li 2001) Elastic net (Zuo and Hastie 2005) SEAL (Xihong Lin, 2009 JSM) Sparse Regression by General Path Seeking (GPS, Friedman 2008)http://www-stat.stanford.edu/~jhf//ftp/GPSpub.pdfGeneral Setup

Feature Extraction (Sparse regression) The general GPS algorithm

Feature Extraction (Sparse regression) • Exploratory failure time analysis: setup

Feature Extraction (Sparse regression) • Association criteria: Penalized weighted least square

Feature Extraction (Sparse regression) • The power penalty function |β|γ, 0<γ≤1 γ=0.0001 γ=0.5 γ=1

Exploratory Failure Time Analysis and Copy Number Variation Inference