Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,

Data Mining in Schizophrenia Research Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet ’Principles of Data Mining and Knowledge Discovery’, Helsinki, Aug 2002 http://www.nada.kth.se/~stefan

HUBIN page: 1000 hits/day

Schizophrenia -Questions and Clues • Cause(s) of schizophrenia not known. • Medication effective against some symptoms - discovered by chance 100-2000 years ago. • Does not appear in animals-no experimental clues. • Explanation models vary over time. • Disturbed neuronal circuitry in schizophrenia?(currently hottest hypothesis) • Influenced by genotype or/and environment?(clustering in families)

Schizophrenia -Questions and Clues • Which processes result in disease? • Traces of disturbed development visible in MRI (anatomy) and blood tests? • Genetic risk factors? • Causal pathways? • MAIN PROBLEM: Connect psychiatry to physiology

Preliminary analysis • Test case: • 144 subjects: 61 affected, 83 controls • Variables: • Diagnosis (DSM-IV) • Demography (age, gender, ..) • Blood tests (liver, heart,…) • Genetics (20 SNP:s, receptor, growth factors, …) • Anatomy (MRI) • Neuropsychology(working memory, reactions) • Clinicaltest batteries (type of delusions, history, medication)

Brain boxes Picture from BRAINS II manual, Magnotta et al, University of Iowa

Manually drawn vermis regions ROIs drawn by Gaku Okugawa

Single Nucleotide Polymorphism coding SNP RNA: A U G U U C C A U U A U U G U Protein A Phe His Tyr Cys A U G U U U C A U U A U U A U Tyr Protein A’ Phe His Tyr Phe non-coding SNP Protein A can be slightly different from A´

Some genes studied • DBH dopamine beta-hydroxylase • DRD2 dopamine receptor D2 + • DRD3 dopamine receptor D3 • HTR5A serotonin receptor 5A • NPY neuropeptide Y • SLC6A4 serotonin transporter • BDNF brain derived neurotrophic factor • NRG1 neuregulin +

Elementary Visualizations MRI Intracranial volume Cumulative distribution + = schizo = controls Intracranial volume (ml)

Elementary VisualizationsMRI data Cumulative distribution p < 0.0002 + = schizo = controls Total CSF volumes (ml)

Gender differencesMRI + = schizo = controls Subcortical white Men Women + = schizo = controls Subcortical white

Which methods to use? • Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables • Statistical modelling required to decide significance of visible trend, and to rank effects

Statistical methods • Bayesian methods intuitive and rational - but conventional testing required for publications • Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project). • Mixture models(Valery Savcenko thesis project). • Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost

Model adequate? -Gelman’s post-predictive check • Best tested with classical p-values. • Determine posterior for parameter: • Design test function • Compute p-value: • Reject model if p small, e.g., <1%, <5%

Graphical models f(x,y,z)= f(x,z)f(x,y)/f(x) X Y X Y Z Z X Y f(x,y,z)=f(x)f(y)f(z) Z f(x,y,z)

Dia TemCSF BrsCSF SubCSF TotCSF Multivariate characterization by graphical models MRI volumes, blood, demography

Adding Vermis variables Dia PSV TemCSF BrsCSF

Learning, Intelligence, Executive vs Anatomy Pos Corr 1:RAVTA1 6:RAVLTATOT 11:TMT 14:WAIS-R 15:WCST64 Neg Corr left right CSF grey white Cerebellum Vermis CC Brainstem Frontal Parietal Vent2 Temporal Total ic Area Code Color code(0 to 1):

1RAVLTA1 Verbal inlärning 2RAVLTA2 Verbal inlärning 3RAVLTA3 Verbal inlärning 4RAVLTA4 Verbal inlärning 5RAVLTA5 Verbal inlärning 6RAVLTATOT Verbal inlärn., total 7RAVLTB Verbal inlärn, distraktion 8 RAVLTA6 Verbalt minne, om. 9 RAVLTA7 Verbalt minne, fördröjt 10 CPT d' Uppmärksamhet 11 TMT A Visuo-motorik, snabbhet 12 TMT B Visuo-motorik, flexibilitet 13 LNS Arbetsminne 14 WAIS-R Verbal begåvning, IQ. 15 WCST64 categories Exekutiv funktion 16 WCST64 total errors Exekutiv funktion 17 WCST64 pers errors Exekutiv funktion 18 WCST64 pers. respons Exekutiv funktion Covariances: Cognitive Performance Index

MR brain volumes

cpd wc mc wa ma

wc mc wa ma Executive ability

Pairs associated to Diagnosis D D Y Y D Z Z Y D Y Z Z Y and Z co-vary differentlyfor Affected and Controls

Age-dependency of Posterior Superior Vermis Post sup vermis + = schizo = controls Age at MRI

No co-variation between Posterior inferior vermis and parietal white for affected Post inf vermis + = schizo = controls Parietal white

PSV has best explanatory power + = schizo = controls Posterior superior vermis affected - healthy

Classification explains data!(Can Mert Thesis project) H X X Y Y Z Z W W But where is the model?

Model is parametrized mixture H X Y Z W Parameter is mixing vector and classification partition!

Autoclass1 A= schizC= controls Total gray

Weak signals in genetics data • Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis • Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized. • We try to connect SNP:s both to diagnosis and to other phenotypical variables • Multiple testing and weak signal problems.

Genetics data - weak statistics Gene SNP type genotypes DRD3 SerGly A/C 49 59 14 DRD2 Ser311Cys C/G 118 4 0 NPY Ley7Pro A/C 1 7 144 DBH Ala55Ser G/T 98 24 0 BDNF Val66Met A/G 5 37 80 HTR5A Pro15Ser C/T 109 11 2 PNOC Gln172Arg A/G 11 37 28 SLC6A4 (del(44bp)in pr) S/L 20 60 42

Empirical distribution by genotype Gene BDNF (schiz + controls) Cumulativedistribution A/AA/GG/G G/GG/AA/A Frontal CSF

Bonferroni-Hochberg-Benjamini methods MRI and lab data Observedp-values ‘no effect’ p-values FDRi 71 FDRd 62 Number of p-values Benjamini & Yekutieli, Annals of Math Stat, (ta)

Compensating multiple comparisons • Bonferroni 1937: For level a and n tests, use level a/n • Hochberg 1988: step-up procedure • Benjamini,Hochberg 1996: False Discovery Rate • J. Storey, 2002: pFDRi, pFDRd • Bayesian interpretations being developed(Wasserman & Genovese, 2002)

Diagnosis-genotype 21 tests on three genes bdnf drd2 nrg1 0.1137 0.0749 0.8744 0.7293 0.7276 0.0096 bdnf drd2 nrg1 0.1136 0.0735 0.8709 0.0801 0.2213 0.7666 0.0316 0.0823 0.6426 0.5499 1.0000 0.0244 0.7314 0.7312 0.0103

multiple comparisons: what is the significance of min p-values 1,1,2,3% in 20 tests? What is the probability of obtaining more extreme result when no effect?

p-values 3%, 2%, 1%, 1% in 20 tests 7% - not quite significant! but better than Bonferroni: 20%

q-value - FDR rate in prefix

Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, pdf support within R-sphere - - b=1

Beyond Robustness: SVM Support Vector Methods are - distribution-independent- insensitive to dimension- error of classifier no more than + + + - + + - - - - with probability d, training set size l, unclassified examples b, marging, all examples within R-sphere - - b=0

That’s all, folks! • High-quality databases for medical research of the HUBIN type open up for intelligent data analysis methods used in engineering and business • Already with the limited data presently available, interesting clues emerge • Multiple testing considerations are important • Long term effort - stable economy and engagement is vital.

Stefan Arnborg, KTH, SICS Ingrid Agartz, Håkan Hall, Erik Jönsson,