1 / 48

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Data Mining in Schizophrenia Research. Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002. http://www.nada.kth.se/~stefan.

gabi
Download Presentation

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining in Schizophrenia Research Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan

  2. HUBIN page: 1000 hits/day

  3. Schizophrenia -Questions and Clues • Cause(s) of schizophrenia not known. • Medication effective against some symptoms - discovered by chance 100-2000 years ago. • Does not appear in animals-no experimental clues. • Explanation models vary over time. • Disturbed neuronal circuitry in schizophrenia?(currently hottest hypothesis) • Influenced by genotype or/and environment?(clustering in families)

  4. Schizophrenia -Questions and Clues • Which processes result in disease? • Traces of disturbed development visible in MRI (anatomy) and blood tests? • Genetic risk factors? • Causal pathways? • MAIN PROBLEM: Connect psychiatry to physiology

  5. Preliminary analysis • Test case: • 144 subjects: 61 affected, 83 controls • Variables: • Diagnosis (DSM-IV) • Demography (age, gender, ..) • Blood tests (liver, heart,…) • Genetics (20 SNP:s, receptor, growth factors, …) • Anatomy (MRI) • Neuropsychology(working memory, reactions) • Clinicaltest batteries (type of delusions, history, medication)

  6. Brain boxes Picture from BRAINS II manual, Magnotta et al, University of Iowa

  7. Manually drawn vermis regions ROIs drawn by Gaku Okugawa

  8. Single Nucleotide Polymorphism coding SNP RNA: A U G U U C C A U U A U U G U Protein A Phe His Tyr Cys A U G U U U C A U U A U U A U Tyr Protein A’ Phe His Tyr Phe non-coding SNP Protein A can be slightly different from A´

  9. Some genes studied • DBH dopamine beta-hydroxylase • DRD2 dopamine receptor D2 + • DRD3 dopamine receptor D3 • HTR5A serotonin receptor 5A • NPY neuropeptide Y • SLC6A4 serotonin transporter • BDNF brain derived neurotrophic factor • NRG1 neuregulin +

  10. Elementary Visualizations MRI Intracranial volume Cumulative distribution + = schizo = controls Intracranial volume (ml)

  11. Elementary VisualizationsMRI data Cumulative distribution p < 0.0002 + = schizo = controls Total CSF volumes (ml)

  12. Gender differencesMRI + = schizo = controls Subcortical white Men Women + = schizo = controls Subcortical white

  13. Which methods to use? • Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables • Statistical modelling required to decide significance of visible trend, and to rank effects

  14. Statistical methods • Bayesian methods intuitive and rational - but conventional testing required for publications • Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project). • Mixture models(Valery Savcenko thesis project). • Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost

  15. Model adequate? -Gelman’s post-predictive check • Best tested with classical p-values. • Determine posterior for parameter: • Design test function • Compute p-value: • Reject model if p small, e.g., <1%, <5%

  16. Graphical models f(x,y,z)= f(x,z)f(x,y)/f(x) X Y X Y Z Z X Y f(x,y,z)=f(x)f(y)f(z) Z f(x,y,z)

  17. Dia TemCSF BrsCSF SubCSF TotCSF Multivariate characterization by graphical models MRI volumes, blood, demography

  18. Adding Vermis variables Dia PSV TemCSF BrsCSF

  19. Learning, Intelligence, Executive vs Anatomy Pos Corr 1:RAVTA1 6:RAVLTATOT 11:TMT 14:WAIS-R 15:WCST64 Neg Corr left right CSF grey white Cerebellum Vermis CC Brainstem Frontal Parietal Vent2 Temporal Total ic Area Code Color code(0 to 1):

  20. 1RAVLTA1 Verbal inlärning 2RAVLTA2 Verbal inlärning 3RAVLTA3 Verbal inlärning 4RAVLTA4 Verbal inlärning 5RAVLTA5 Verbal inlärning 6RAVLTATOT Verbal inlärn., total 7RAVLTB Verbal inlärn, distraktion 8 RAVLTA6 Verbalt minne, om. 9 RAVLTA7 Verbalt minne, fördröjt 10 CPT d' Uppmärksamhet 11 TMT A Visuo-motorik, snabbhet 12 TMT B Visuo-motorik, flexibilitet 13 LNS Arbetsminne 14 WAIS-R Verbal begåvning, IQ. 15 WCST64 categories Exekutiv funktion 16 WCST64 total errors Exekutiv funktion 17 WCST64 pers errors Exekutiv funktion 18 WCST64 pers. respons Exekutiv funktion Covariances: Cognitive Performance Index

  21. MR brain volumes

  22. cpd wc mc wa ma

  23. wc mc wa ma Executive ability

  24. Pairs associated to Diagnosis D D Y Y D Z Z Y D Y Z Z Y and Z co-vary differentlyfor Affected and Controls

  25. Age-dependency of Posterior Superior Vermis Post sup vermis + = schizo = controls Age at MRI

  26. No co-variation between Posterior inferior vermis and parietal white for affected Post inf vermis + = schizo = controls Parietal white

  27. PSV has best explanatory power + = schizo = controls Posterior superior vermis affected - healthy

  28. Classification explains data!(Can Mert Thesis project) H X X Y Y Z Z W W But where is the model?

  29. Model is parametrized mixture H X Y Z W Parameter is mixing vector and classification partition!

  30. Autoclass1 A= schizC= controls Total gray

  31. Weak signals in genetics data • Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis • Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized. • We try to connect SNP:s both to diagnosis and to other phenotypical variables • Multiple testing and weak signal problems.

  32. Genetics data - weak statistics Gene SNP type genotypes DRD3 SerGly A/C 49 59 14 DRD2 Ser311Cys C/G 118 4 0 NPY Ley7Pro A/C 1 7 144 DBH Ala55Ser G/T 98 24 0 BDNF Val66Met A/G 5 37 80 HTR5A Pro15Ser C/T 109 11 2 PNOC Gln172Arg A/G 11 37 28 SLC6A4 (del(44bp)in pr) S/L 20 60 42

  33. Empirical distribution by genotype Gene BDNF (schiz + controls) Cumulativedistribution A/AA/GG/G G/GG/AA/A Frontal CSF

  34. Bonferroni-Hochberg-Benjamini methods MRI and lab data Observedp-values ‘no effect’ p-values FDRi 71 FDRd 62 Number of p-values Benjamini & Yekutieli, Annals of Math Stat, (ta)

  35. Compensating multiple comparisons • Bonferroni 1937: For level a and n tests, use level a/n • Hochberg 1988: step-up procedure • Benjamini,Hochberg 1996: False Discovery Rate • J. Storey, 2002: pFDRi, pFDRd • Bayesian interpretations being developed(Wasserman & Genovese, 2002)

  36. Diagnosis-genotype 21 tests on three genes bdnf drd2 nrg1 0.1137 0.0749 0.8744 0.7293 0.7276 0.0096 bdnf drd2 nrg1 0.1136 0.0735 0.8709 0.0801 0.2213 0.7666 0.0316 0.0823 0.6426 0.5499 1.0000 0.0244 0.7314 0.7312 0.0103

  37. multiple comparisons: what is the significance of min p-values 1,1,2,3% in 20 tests? What is the probability of obtaining more extreme result when no effect?

  38. p-values 3%, 2%, 1%, 1% in 20 tests 7% - not quite significant! but better than Bonferroni: 20%

  39. q-value - FDR rate in prefix

  40. Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, pdf support within R-sphere - - b=1

  41. Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, all examples within R-sphere - - b=0

  42. That’s all, folks! • High-quality databases for medical research of the HUBIN type open up for intelligent data analysis methods used in engineering and business • Already with the limited data presently available, interesting clues emerge • Multiple testing considerations are important • Long term effort - stable economy and engagement is vital.

More Related