480 likes | 640 Views
Data Mining in Schizophrenia Research. Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002. http://www.nada.kth.se/~stefan.
E N D
Data Mining in Schizophrenia Research Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan
Schizophrenia -Questions and Clues • Cause(s) of schizophrenia not known. • Medication effective against some symptoms - discovered by chance 100-2000 years ago. • Does not appear in animals-no experimental clues. • Explanation models vary over time. • Disturbed neuronal circuitry in schizophrenia?(currently hottest hypothesis) • Influenced by genotype or/and environment?(clustering in families)
Schizophrenia -Questions and Clues • Which processes result in disease? • Traces of disturbed development visible in MRI (anatomy) and blood tests? • Genetic risk factors? • Causal pathways? • MAIN PROBLEM: Connect psychiatry to physiology
Preliminary analysis • Test case: • 144 subjects: 61 affected, 83 controls • Variables: • Diagnosis (DSM-IV) • Demography (age, gender, ..) • Blood tests (liver, heart,…) • Genetics (20 SNP:s, receptor, growth factors, …) • Anatomy (MRI) • Neuropsychology(working memory, reactions) • Clinicaltest batteries (type of delusions, history, medication)
Brain boxes Picture from BRAINS II manual, Magnotta et al, University of Iowa
Manually drawn vermis regions ROIs drawn by Gaku Okugawa
Single Nucleotide Polymorphism coding SNP RNA: A U G U U C C A U U A U U G U Protein A Phe His Tyr Cys A U G U U U C A U U A U U A U Tyr Protein A’ Phe His Tyr Phe non-coding SNP Protein A can be slightly different from A´
Some genes studied • DBH dopamine beta-hydroxylase • DRD2 dopamine receptor D2 + • DRD3 dopamine receptor D3 • HTR5A serotonin receptor 5A • NPY neuropeptide Y • SLC6A4 serotonin transporter • BDNF brain derived neurotrophic factor • NRG1 neuregulin +
Elementary Visualizations MRI Intracranial volume Cumulative distribution + = schizo = controls Intracranial volume (ml)
Elementary VisualizationsMRI data Cumulative distribution p < 0.0002 + = schizo = controls Total CSF volumes (ml)
Gender differencesMRI + = schizo = controls Subcortical white Men Women + = schizo = controls Subcortical white
Which methods to use? • Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables • Statistical modelling required to decide significance of visible trend, and to rank effects
Statistical methods • Bayesian methods intuitive and rational - but conventional testing required for publications • Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project). • Mixture models(Valery Savcenko thesis project). • Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost
Model adequate? -Gelman’s post-predictive check • Best tested with classical p-values. • Determine posterior for parameter: • Design test function • Compute p-value: • Reject model if p small, e.g., <1%, <5%
Graphical models f(x,y,z)= f(x,z)f(x,y)/f(x) X Y X Y Z Z X Y f(x,y,z)=f(x)f(y)f(z) Z f(x,y,z)
Dia TemCSF BrsCSF SubCSF TotCSF Multivariate characterization by graphical models MRI volumes, blood, demography
Adding Vermis variables Dia PSV TemCSF BrsCSF
Learning, Intelligence, Executive vs Anatomy Pos Corr 1:RAVTA1 6:RAVLTATOT 11:TMT 14:WAIS-R 15:WCST64 Neg Corr left right CSF grey white Cerebellum Vermis CC Brainstem Frontal Parietal Vent2 Temporal Total ic Area Code Color code(0 to 1):
1RAVLTA1 Verbal inlärning 2RAVLTA2 Verbal inlärning 3RAVLTA3 Verbal inlärning 4RAVLTA4 Verbal inlärning 5RAVLTA5 Verbal inlärning 6RAVLTATOT Verbal inlärn., total 7RAVLTB Verbal inlärn, distraktion 8 RAVLTA6 Verbalt minne, om. 9 RAVLTA7 Verbalt minne, fördröjt 10 CPT d' Uppmärksamhet 11 TMT A Visuo-motorik, snabbhet 12 TMT B Visuo-motorik, flexibilitet 13 LNS Arbetsminne 14 WAIS-R Verbal begåvning, IQ. 15 WCST64 categories Exekutiv funktion 16 WCST64 total errors Exekutiv funktion 17 WCST64 pers errors Exekutiv funktion 18 WCST64 pers. respons Exekutiv funktion Covariances: Cognitive Performance Index
cpd wc mc wa ma
wc mc wa ma Executive ability
Pairs associated to Diagnosis D D Y Y D Z Z Y D Y Z Z Y and Z co-vary differentlyfor Affected and Controls
Age-dependency of Posterior Superior Vermis Post sup vermis + = schizo = controls Age at MRI
No co-variation between Posterior inferior vermis and parietal white for affected Post inf vermis + = schizo = controls Parietal white
PSV has best explanatory power + = schizo = controls Posterior superior vermis affected - healthy
Classification explains data!(Can Mert Thesis project) H X X Y Y Z Z W W But where is the model?
Model is parametrized mixture H X Y Z W Parameter is mixing vector and classification partition!
Autoclass1 A= schizC= controls Total gray
Weak signals in genetics data • Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis • Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized. • We try to connect SNP:s both to diagnosis and to other phenotypical variables • Multiple testing and weak signal problems.
Genetics data - weak statistics Gene SNP type genotypes DRD3 SerGly A/C 49 59 14 DRD2 Ser311Cys C/G 118 4 0 NPY Ley7Pro A/C 1 7 144 DBH Ala55Ser G/T 98 24 0 BDNF Val66Met A/G 5 37 80 HTR5A Pro15Ser C/T 109 11 2 PNOC Gln172Arg A/G 11 37 28 SLC6A4 (del(44bp)in pr) S/L 20 60 42
Empirical distribution by genotype Gene BDNF (schiz + controls) Cumulativedistribution A/AA/GG/G G/GG/AA/A Frontal CSF
Bonferroni-Hochberg-Benjamini methods MRI and lab data Observedp-values ‘no effect’ p-values FDRi 71 FDRd 62 Number of p-values Benjamini & Yekutieli, Annals of Math Stat, (ta)
Compensating multiple comparisons • Bonferroni 1937: For level a and n tests, use level a/n • Hochberg 1988: step-up procedure • Benjamini,Hochberg 1996: False Discovery Rate • J. Storey, 2002: pFDRi, pFDRd • Bayesian interpretations being developed(Wasserman & Genovese, 2002)
Diagnosis-genotype 21 tests on three genes bdnf drd2 nrg1 0.1137 0.0749 0.8744 0.7293 0.7276 0.0096 bdnf drd2 nrg1 0.1136 0.0735 0.8709 0.0801 0.2213 0.7666 0.0316 0.0823 0.6426 0.5499 1.0000 0.0244 0.7314 0.7312 0.0103
multiple comparisons: what is the significance of min p-values 1,1,2,3% in 20 tests? What is the probability of obtaining more extreme result when no effect?
p-values 3%, 2%, 1%, 1% in 20 tests 7% - not quite significant! but better than Bonferroni: 20%
Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, pdf support within R-sphere - - b=1
Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, all examples within R-sphere - - b=0
That’s all, folks! • High-quality databases for medical research of the HUBIN type open up for intelligent data analysis methods used in engineering and business • Already with the limited data presently available, interesting clues emerge • Multiple testing considerations are important • Long term effort - stable economy and engagement is vital.