1 / 53

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Data Mining in Schizophrenia Research -preliminary. Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002.

elvina
Download Presentation

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining in Schizophrenia Research -preliminary Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan

  2. HUBIN - a project to accelerate research in human brain diseases • Carefully selected patients, relatives and controls • Each participant characterized over many domains • DNA stored in bio-bank • Each research team collects high-quality information, analyzes it, and stores in archive for inter-domain analyses.

  3. Hubin organization Management group Håkan Hall, Assoc. Prof. (project manager) Stig Larsson, T.D. hc Göran Sedvall, Prof. Stefan Arnborg, Prof. Tom McNeil, Prof. Lars Therenius Prof. Scientific advisory board Göran Sedvall, ChairmanNancy Andreasen, Univ of Iowa Paul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst. Ethical group Göran Sedvall, Chairman Hubin ABStig Larsson, Chairman Håkan Hall, CEO Data domain responsibles Projectstaff

  4. Leading causes of disability in the world, WHO (1990) Cause of disability Total % of millions world total 1. Unipolar major depression 50.8 10.7 2. Iron deficiency anemia 22.0 4.7 3. Falls 22.0 4.6 4. Alcohol use 15.8 3.3 5. Chronic obstructive pulmonary disease 14.7 3.1 6. Bipolar disorder 14.1 3.0 7. Congenital anomalies 13.5 2.9 8. Osteoarthritis 13.3 2.8 9. Schizophrenia 12.1 2.6 10. Obsessive compulsive disorder 10.2 2.2

  5. Schizophrenia -Questions and Clues • Cause(s) of schizophrenia not known. • Medication effective against some symptoms - discovered by chance 100-2000 years ago. • Does not appear in animals-no experimental clues. • Explanation models vary over time. • Disturbed neuronal circuitry in schizophrenia?(currently hottest hypothesis) • Influenced by genotype or/and environment?(clustering in families)

  6. Schizophrenia -Questions and Clues • Which processes result in disease? • Traces of disturbed development visible in MRI (anatomy) and blood tests? • Genetic risk factors? • Causal pathways? • MAIN PROBLEM: Connect psychiatry to physiology

  7. Preliminary analysis • Test case: • 144 subjects: 61 affected, 83 controls • Variables: • Diagnosis (DSM-IV) • Demography (age, gender, ..) • Blood tests (liver, heart,…) • Genetics (20 SNP:s, receptor, growth factors, …) • Anatomy (MRI) • Neuropsychology(working memory, reactions) • Clinicaltest batteries (type of delusions, history, medication)

  8. LAR MRI PET ISHH Types of images used in HUBIN In vivo imaging Magnetic resonance images (MRI) Functional magnetic resonance images (fMRI) Positron emission tomography (PET) Single photon emission tomography (SPECT) In vitro imaging (whole hemispheres) Autoradiography In situ hybridization

  9. Brain boxes Picture from BRAINS II manual, Magnotta et al, University of Iowa

  10. Manually drawn vermis regions ROIs drawn by Gaku Okugawa

  11. Single Nucleotide Polymorphism coding SNP RNA: A U G U U C C A U U A U U G U Protein A Phe His Tyr Cys A U G U U U C A U U A U U A U Tyr Protein A’ Phe His Tyr Phe non-coding SNP Protein A can be slightly different from A´

  12. Genes studied • DBH dopamine beta-hydroxylase • DRD2 dopamine receptor D2 + • DRD3 dopamine receptor D3 • HTR5A serotonin receptor 5A • NPY neuropeptide Y • SLC6A4 serotonin transporter • BDNF brain derived neurotrophic factor • NRG1 neuregulin +

  13. Elementary Visualizations MRI Intracranial volume Cumulative distribution + = schizo = controls Intracranial volume (ml)

  14. Elementary VisualizationsMRI data Cumulative distribution p < 0.0002 + = schizo = controls Total CSF volumes (ml)

  15. Blood dataGamma GT- alcohol marker Cumulative distribution p < 0.01 + = schizo = controls Gamma GT

  16. Gender differencesMRI + = schizo = controls Subcortical white Men Women + = schizo = controls Subcortical white

  17. Which methods to use? • Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables • Statistical modelling required to decide significance of visible trend, and to rank effects

  18. Statistical methods • Bayesian methods intuitive and rational - but conventional testing required for publications • Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project). • Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost • Non-parametric randomization tests - most sensitive and accommodate modern multiple testing paradigms

  19. Bayes’ factor • Choice between two hypotheses, H1 and H2,given experimental/observational data DP(H1|D) P(D|H1) P(H1)P(H2|D) P(D|H2) P(H2) Posterior odds Bayes factor prior odds

  20. Hypotheses in test matrix • H1: (no effect) a data column is generatedindependently of diagnosis (composite model) • H2: the data for controls are generated by one composite model, for affected by another one.

  21. Hierarkiska modeller • Modell kontinuerlig:H: • Modell parametriserad: • Modell hierarkisk: priorfördelning f(l) för l, • Inferens för parameter

  22. Model adequate? • Best tested with classical p-values. • Determine posterior for parameter: • Design test function • Compute p-value: • Reject model if p small, e.g., <1%, <5%

  23. Bayes’ example • Result D from test: s heads, f tails, n=s+f • H0: Coin is balanced, P(D|H0)=2 • H : Coin has head probability p P(D|H ) = p (1-p) • H1: H with uniform prior for p , hierarchicalP(D|H1) = ∫ P(D|H ) dp = (s! f!)/((n+1)!) -n p s f p p p

  24. Graphical models f(x,y,z)= f(x,z)f(x,y)/f(x) X Y X Y Z Z X Y f(x,y,z)=f(x)f(y)f(z) Z f(x,y,z)

  25. Dia TemCSF BrsCSF SubCSF TotCSF Multivariate characterization by graphical models MRI volumes, blood, demography

  26. Adding Vermis variables Dia PSV TemCSF BrsCSF

  27. V-structures,causality B A C B X X A C B Y Y A C A C | B Indistinguishable C A f(x,y)=f(y|x)f(x) =f(x|y)f(y) A CA C | B V-structures detectablefrom observational data

  28. Pairs associated to Diagnosis D D Y Y D Z Z Y D Y Z Z Y and Z co-vary differentlyfor Affected and Controls

  29. Age-dependency of Posterior Superior Vermis Post sup vermis + = schizo = controls Age at MRI

  30. No co-variation between Posterior inferior vermis and parietal white for affected Post inf vermis + = schizo = controls Parietal white

  31. PSV has best explanatory power + = schizo = controls Posterior superior vermis affected - healthy

  32. Decision tree for DiagnosisMRI Data A= schizC = controls () = misscls

  33. Classification explains data!(Can Mert Thesis project) H X X Y Y Z Z W W

  34. Autoclass1 A= schizC= controls Total gray

  35. Weak signals in genetics data • Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis • Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized. • We try to connect SNP:s both to diagnosis and to other phenotypical variables • Multiple testing and weak signal problems.

  36. Genetics data - weak statistics Gene SNP type genotypes DRD3 SerGly A/C 49 59 14 DRD2 Ser311Cys C/G 118 4 0 NPY Ley7Pro A/C 1 7 144 DBH Ala55Ser G/T 98 24 0 BDNF Val66Met A/G 5 37 80 HTR5A Pro15Ser C/T 109 11 2 PNOC Gln172Arg A/G 11 37 28 SLC6A4 (del(44bp)in pr) S/L 20 60 42

  37. Empirical distribution by genotype Gene BDNF (schiz + controls) Cumulativedistribution A/AA/GG/G G/GG/AA/A Frontal CSF

  38. Bonferroni-Hochberg-Benjamini methods MRI and lab data Observedp-values ‘no effect’ p-values FDRi 71 FDRd 62 Number of p-values Benjamini & Yekutieli, Annals of Math Stat, (ta)

  39. multiple comparisons: what is the significance of min p-values 1,1,2,3% in 20 tests? What is the probability of obtaining more extreme result?

  40. Compensating multiple comparisons • Bonferroni 1937: For level a and n tests, use level a/n • Hochberg 1988: step-up procedure • Benjamini,Hochberg 1996: False Discovery Rate • J. Storey, 2002: pFDRi, pFDRd • Bayesian interpretations being developed(Wasserman & Genovese, 2002)

  41. Diagnosis-genotype 21 tests on three genes bdnf drd2 nrg1 0.1137 0.0749 0.8744 0.7293 0.7276 0.0096 bdnf drd2 nrg1 0.1136 0.0735 0.8709 0.0801 0.2213 0.7666 0.0316 0.0823 0.6426 0.5499 1.0000 0.0244 0.7314 0.7312 0.0103

  42. p-values 3%, 2%, 1%, 1% in 20 tests 7% - not quite significant! but better than Bonferroni: 20%

  43. q-value - FDR rate in prefix

More Related