530 likes | 656 Views
Data Mining in Schizophrenia Research -preliminary. Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002.
E N D
Data Mining in Schizophrenia Research -preliminary Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan
HUBIN - a project to accelerate research in human brain diseases • Carefully selected patients, relatives and controls • Each participant characterized over many domains • DNA stored in bio-bank • Each research team collects high-quality information, analyzes it, and stores in archive for inter-domain analyses.
Hubin organization Management group Håkan Hall, Assoc. Prof. (project manager) Stig Larsson, T.D. hc Göran Sedvall, Prof. Stefan Arnborg, Prof. Tom McNeil, Prof. Lars Therenius Prof. Scientific advisory board Göran Sedvall, ChairmanNancy Andreasen, Univ of Iowa Paul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst. Ethical group Göran Sedvall, Chairman Hubin ABStig Larsson, Chairman Håkan Hall, CEO Data domain responsibles Projectstaff
Leading causes of disability in the world, WHO (1990) Cause of disability Total % of millions world total 1. Unipolar major depression 50.8 10.7 2. Iron deficiency anemia 22.0 4.7 3. Falls 22.0 4.6 4. Alcohol use 15.8 3.3 5. Chronic obstructive pulmonary disease 14.7 3.1 6. Bipolar disorder 14.1 3.0 7. Congenital anomalies 13.5 2.9 8. Osteoarthritis 13.3 2.8 9. Schizophrenia 12.1 2.6 10. Obsessive compulsive disorder 10.2 2.2
Schizophrenia -Questions and Clues • Cause(s) of schizophrenia not known. • Medication effective against some symptoms - discovered by chance 100-2000 years ago. • Does not appear in animals-no experimental clues. • Explanation models vary over time. • Disturbed neuronal circuitry in schizophrenia?(currently hottest hypothesis) • Influenced by genotype or/and environment?(clustering in families)
Schizophrenia -Questions and Clues • Which processes result in disease? • Traces of disturbed development visible in MRI (anatomy) and blood tests? • Genetic risk factors? • Causal pathways? • MAIN PROBLEM: Connect psychiatry to physiology
Preliminary analysis • Test case: • 144 subjects: 61 affected, 83 controls • Variables: • Diagnosis (DSM-IV) • Demography (age, gender, ..) • Blood tests (liver, heart,…) • Genetics (20 SNP:s, receptor, growth factors, …) • Anatomy (MRI) • Neuropsychology(working memory, reactions) • Clinicaltest batteries (type of delusions, history, medication)
LAR MRI PET ISHH Types of images used in HUBIN In vivo imaging Magnetic resonance images (MRI) Functional magnetic resonance images (fMRI) Positron emission tomography (PET) Single photon emission tomography (SPECT) In vitro imaging (whole hemispheres) Autoradiography In situ hybridization
Brain boxes Picture from BRAINS II manual, Magnotta et al, University of Iowa
Manually drawn vermis regions ROIs drawn by Gaku Okugawa
Single Nucleotide Polymorphism coding SNP RNA: A U G U U C C A U U A U U G U Protein A Phe His Tyr Cys A U G U U U C A U U A U U A U Tyr Protein A’ Phe His Tyr Phe non-coding SNP Protein A can be slightly different from A´
Genes studied • DBH dopamine beta-hydroxylase • DRD2 dopamine receptor D2 + • DRD3 dopamine receptor D3 • HTR5A serotonin receptor 5A • NPY neuropeptide Y • SLC6A4 serotonin transporter • BDNF brain derived neurotrophic factor • NRG1 neuregulin +
Elementary Visualizations MRI Intracranial volume Cumulative distribution + = schizo = controls Intracranial volume (ml)
Elementary VisualizationsMRI data Cumulative distribution p < 0.0002 + = schizo = controls Total CSF volumes (ml)
Blood dataGamma GT- alcohol marker Cumulative distribution p < 0.01 + = schizo = controls Gamma GT
Gender differencesMRI + = schizo = controls Subcortical white Men Women + = schizo = controls Subcortical white
Which methods to use? • Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables • Statistical modelling required to decide significance of visible trend, and to rank effects
Statistical methods • Bayesian methods intuitive and rational - but conventional testing required for publications • Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project). • Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost • Non-parametric randomization tests - most sensitive and accommodate modern multiple testing paradigms
Bayes’ factor • Choice between two hypotheses, H1 and H2,given experimental/observational data DP(H1|D) P(D|H1) P(H1)P(H2|D) P(D|H2) P(H2) Posterior odds Bayes factor prior odds
Hypotheses in test matrix • H1: (no effect) a data column is generatedindependently of diagnosis (composite model) • H2: the data for controls are generated by one composite model, for affected by another one.
Hierarkiska modeller • Modell kontinuerlig:H: • Modell parametriserad: • Modell hierarkisk: priorfördelning f(l) för l, • Inferens för parameter
Model adequate? • Best tested with classical p-values. • Determine posterior for parameter: • Design test function • Compute p-value: • Reject model if p small, e.g., <1%, <5%
Bayes’ example • Result D from test: s heads, f tails, n=s+f • H0: Coin is balanced, P(D|H0)=2 • H : Coin has head probability p P(D|H ) = p (1-p) • H1: H with uniform prior for p , hierarchicalP(D|H1) = ∫ P(D|H ) dp = (s! f!)/((n+1)!) -n p s f p p p
Graphical models f(x,y,z)= f(x,z)f(x,y)/f(x) X Y X Y Z Z X Y f(x,y,z)=f(x)f(y)f(z) Z f(x,y,z)
Dia TemCSF BrsCSF SubCSF TotCSF Multivariate characterization by graphical models MRI volumes, blood, demography
Adding Vermis variables Dia PSV TemCSF BrsCSF
V-structures,causality B A C B X X A C B Y Y A C A C | B Indistinguishable C A f(x,y)=f(y|x)f(x) =f(x|y)f(y) A CA C | B V-structures detectablefrom observational data
Pairs associated to Diagnosis D D Y Y D Z Z Y D Y Z Z Y and Z co-vary differentlyfor Affected and Controls
Age-dependency of Posterior Superior Vermis Post sup vermis + = schizo = controls Age at MRI
No co-variation between Posterior inferior vermis and parietal white for affected Post inf vermis + = schizo = controls Parietal white
PSV has best explanatory power + = schizo = controls Posterior superior vermis affected - healthy
Decision tree for DiagnosisMRI Data A= schizC = controls () = misscls
Classification explains data!(Can Mert Thesis project) H X X Y Y Z Z W W
Autoclass1 A= schizC= controls Total gray
Weak signals in genetics data • Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis • Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized. • We try to connect SNP:s both to diagnosis and to other phenotypical variables • Multiple testing and weak signal problems.
Genetics data - weak statistics Gene SNP type genotypes DRD3 SerGly A/C 49 59 14 DRD2 Ser311Cys C/G 118 4 0 NPY Ley7Pro A/C 1 7 144 DBH Ala55Ser G/T 98 24 0 BDNF Val66Met A/G 5 37 80 HTR5A Pro15Ser C/T 109 11 2 PNOC Gln172Arg A/G 11 37 28 SLC6A4 (del(44bp)in pr) S/L 20 60 42
Empirical distribution by genotype Gene BDNF (schiz + controls) Cumulativedistribution A/AA/GG/G G/GG/AA/A Frontal CSF
Bonferroni-Hochberg-Benjamini methods MRI and lab data Observedp-values ‘no effect’ p-values FDRi 71 FDRd 62 Number of p-values Benjamini & Yekutieli, Annals of Math Stat, (ta)
multiple comparisons: what is the significance of min p-values 1,1,2,3% in 20 tests? What is the probability of obtaining more extreme result?
Compensating multiple comparisons • Bonferroni 1937: For level a and n tests, use level a/n • Hochberg 1988: step-up procedure • Benjamini,Hochberg 1996: False Discovery Rate • J. Storey, 2002: pFDRi, pFDRd • Bayesian interpretations being developed(Wasserman & Genovese, 2002)
Diagnosis-genotype 21 tests on three genes bdnf drd2 nrg1 0.1137 0.0749 0.8744 0.7293 0.7276 0.0096 bdnf drd2 nrg1 0.1136 0.0735 0.8709 0.0801 0.2213 0.7666 0.0316 0.0823 0.6426 0.5499 1.0000 0.0244 0.7314 0.7312 0.0103
p-values 3%, 2%, 1%, 1% in 20 tests 7% - not quite significant! but better than Bonferroni: 20%