Identifying Potential Biomarkers for Chronic Fatigue Syndrome via

Identifying Potential Biomarkers for Chronic Fatigue Syndrome via Classification Model Ensemble Mining Ben Goertzel, PhD Biomind LLC www. biomind.com

General Methodology Using machine learning to find nonlinear combinations of genes, mutations or clinical indicators that are associated with diseases, toxic reactions, symptoms, or other phenotypic qualities

Summary • CAMDA 2006 data was analyzed using a novel “classification model ensemble mining” methodology • Genetic programming and heuristic search are applied to learn ensembles of classification rules that distinguish CFS from Control • one set of classifiers based on microarray data • one set based on SNP data) • These ensembles are then statistically analyzed to identify genes, gene categories, and combinations thereof that appear to play important roles in characterizing CFS. • The results of this analysis include • potential microarray and SNP based diagnostic rules for CFS • lists of SNP’s, genes and gene categories that are potentially significant biomarkers for CFS • and are different from those found via simple statistical category-differentiation analysis).

Summary Overall, our results appear compatible with a system-theoretic view of CFS which views the disorder as a complex pattern of activity across the organism including interlinked disturbances in neural and endocrine systems.

Conceptual Hypothesis

Recent Related Work Goertzel BN, Pennachin C, de Souza Coelho L, Maloney EM, Jones JF, Gurbaxani B. Allostatic load is associated with symptoms in chronic fatigue syndrome patients. Pharmacogenomics. 2006 Apr;7(3):485-94. Goertzel BN, Pennachin C, de Souza Coelho L, Gurbaxani B, Maloney EM, Jones JF. Combinations of single nucleotide polymorphisms in neuroendocrine effector and receptor genes predict chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):475-83. Maloney EM, Gurbaxani BM, Jones JF, de Souza Coelho L, Pennachin C, Goertzel BN. Chronic fatigue syndrome and high allostatic load. Pharmacogenomics. 2006 Apr;7(3):467-73. Gurbaxani BM, Jones JF, Goertzel BN, Maloney EM. Linear data mining the Wichita clinical matrix suggests sleep and allostatic load involvement in chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):455-65.

Overview • Microarray Data Analysis • SNP Data Analysis

Analyzing Microarray Data via Supervised Categorization • Often more statistically meaningful than clustering • and allows one to do clustering of features based on whether they’re used in the same categorization models • The researcher must divide the data into two or more categories, e.g. • Case vs. Control • Early vs. Late (in a time series experiment) • Multiclass categorization: which kind of cancer? • Algorithms learn rules (“models”) that predict which category a microarray gene expression profile falls into, via combining expression values in an automatically learned mathematical formula

Supervised Categorization Algorithms Many supervised categorization algorithms exist, each with strengths and weaknesses Unlike with clustering, a choice may be made based on rigorous validation methodology • Decision trees • Neural networks • Logistic regression • Support vector machines • Genetic programming • Etc.

Applications of Supervised Categorization Analysis • Classification models may be used as diagnostic rules • Classification models may be studied to yield intuitive insight • particularly interesting in the case of model ensembles

Example Classification Rule Learned via Genetic Programming if (NM_005110 + NM_001614)/NM_002230 - .3* NM_002297 > 1 then Case else Control

One may extend these values with entries representing the inferred expression levels of interesting categories of genes or proteins, as well as clinical data. These enhanced feature vectors are then fed into machine learning algorithms carrying out categorization, clustering, etc. Using Ontologies to Make Enhanced Feature Vectors Traditional statistical and machine learning methods characterize individuals by expression values alone.

Gene Expression Data Enhancement

Example Classification Rule Learned via Genetic Programming “Enhanced Feature” based on GO “Enhanced Feature” based on PIR

Example Ontology-Based Classification Rule in Biomind ArrayGenius User Interface

Classification Accuracy Using Biomind Tools

Categorization Model Ensembles • Generally there will be many qualitatively different “classification models” distinguishing one category from another • For diagnostics, one needs only a single good rule • though voting across a model ensemble may give better accuracy than any individual learned rule • For gaining qualitative understanding, statistical analysis of feature usage across models in an ensemble appears to be quite valuable • Genetic programming is a particularly useful technique here, because each learned model tends to use a relatively small number of features

Important Features Analysis • Given a classification model ensemble, one can list the features that occur in the greatest number of models • These are NOT necessarily the same features that provide the greatest differentiation the two categories, considered individually

Classification Model Utilization Based Clustering • Associate each feature (gene, GO, etc.) with the set of successful classification models that use this feature • Interpret these sets as “meta feature vectors”, one for each direct or enhanced feature in the original dataset • Cluster these meta feature vectors • The resulting clusters are sets of genes or GO’s that have interesting interactions in the context of the classification problem at hand

Example Application: Analysis of CFS Data • Collaboration with Dr. Suzanne Vernon at CDC • Specific Aims • To determine which genes are consistently expressed in the peripheral blood • To determine if genes and background knowledge could be used to classify CFS • To determine if there was a common (perturbed) CFS pathway • To understand differences between post-EBV fatigue and other types of CFS

Details of CFS Data • We analyzed • CAMDA (Wichita) dataset • Exercise dataset • Post-Infectious Fatigue dataset • Gene expression: • Noise filtering: started with 30,000 genes • Exercise dataset reduced to 1,921 • Wichita dataset reduced to 10,812 • Log-transformed and Z-score normalized • Background knowledge • Exercise dataset added 377 GO and 145 PIR features • Wichita dataset added 1405 GO and 1413 PIR features

Results Confusion Matrix on CAMDA Data (Leave-One-Out Validation)

10 Most Important Features

Expression-Based Clustering GO:0000118 histone deacetylase complex GO:0004407 histone deacetylase activity GO:0005667 transcription factor complex GO:0006476 protein amino acid deacetylation GO:0016570 histone modification GO:0016575 histone deacetylation GO:0019213 deacetylase activity NM_001527 Homo sapiens histone deacetylase 2 (HDAC2), mRNA Model Usage Based Clustering GO:0004407 histone deacetylase activity AC016882 GO:0042221 response to chemical substance GO:0008628 induction of apoptosis by hormones ENSG00000086758 AB053232 GO:0009991 response to extracellular stimulus Comparison of Clustering Approaches(Example Clusters)

Interpretation of Histone Cluster • The relation between histone deacetylase and apoptosis is now well known • It was demonstrated that caspase-2 and -3, which are part of the superfamily of caspases, the major group of protein responsible for apoptosis triggering (Cryns and Yuan, 1998), are able to interact and cleave the amino terminal portion of the histone deacetylase 4, which accumulates in the nucleus and interacts physically with the transcription factor MEF2C, thus preventing this factor from activating anti-death signals that would allow cell survival (Paroni et al, 2004). • Coexposure of cells to HDIs in conjunction with STI571 have been observed to down-regulate proteins related to response to extracellular stimulus, such as phospho-extracellular signal-regulated kinase (ERK) (Yu et al, 2003).

Signal interaction map for cluster derived from CFS data

Signal interaction map for cluster derived from CFS data 13 of 17 ID’ed genes in the “hairball” (320 nodes, 1603 edges) Produced in MetaCore product by GeneGo

Measuring Clustering Quality • The quality of a clustering was measured as the product homogeneity x separation. • Homogeneity is calculated as the average of the distances of all members of the cluster to their nearest cluster-mates. • Separation is simply the minimum distance from any given member of the cluster to elements outside the cluster. • These particular definitions were used in order to minimize the influence of the size of the cluster on its quality.

Model Utilization Based versus Conventional Clustering

RNA Pol II Phosphorylation of platelet sec-1 and Syntaxin-4 RNA Pol I Caspases Lipid metabolism (triglyceride biosyn) Recognition and binding of core promoter elements by TFIID Ribosome formation and Translation initiation Useful feature map – Exercise Challenge mRNA processing, ribosomes, translation RAD (DNA repair) • Exercise challenge – 62 of 100 most useful features were genes that map to the following pathways: • Exercise challenge – 31 of 100 most useful features were in GO that map to the following pathways:

Metabolism apoptosis DNA repair hemostasis Cell cycle Binding of TFIID Useful feature map – Wichita • Wichita – 80 of 100 most useful features were GO categories and genes that map to several pathways. • Noticeable absent are features in DNA repair initiation, transcription, gene expression and mRNA processing.

Useful feature map – Post-Infective Fatigue spliceosome formation and mRNA processing • Post-infective fatigue – 79 of 100 most useful features were GO categories and genes that map to mRNA processing and splicesome formation.

Interpretation of CFS Microarray Results • Features that make up these models indicate widespread disruption of cell homeostasis in both the Exercise and Wichita study. • Features that make up the PIF classification model identify mRNA processing pathways known to be disrupted by EBV.

CFS SNP data analysis • CFS SNP data publicly available through the CAMDA 2006 challenge (see http://www.camda.duke.edu/camda06 ) • Genes pre-selected for SNP analysis because of possible CFS involvement • SNP data processed with Biomind software

SNP-Sets as Pattern Strength Classifiers Each Pattern Strength Classifier is simply a list of SNPs and a threshold. For a given individual being evaluated by a given rule, the “sum of SNP incidences” is computed in the following way: • if the individual has a SNP s (present in the SNP list of the rule) in heterozygosis, then the value 2 is summed for s; • if s is present in homozygosis, then 1 is summed; • finally, if s is undetermined for that individual, then 0 is summed. After this sum is computed for all SNPs in the rule list, the value is compared with the rule threshold: if it is greater than the threshold, the individual is classified as CFS, otherwise Control. For each SNP-set, the threshold value is selected that allows the SNP-set to achieve the maximum accuracy for distinguishing Case vs. Control.

SNP Based Classifiers for CFS vs. Control

Important Genes for Differentiating CFS vs. Control

Conceptual Hypothesis

Identifying Potential Biomarkers for Chronic Fatigue Syndrome via