The metabonomics

Combination of Independent Component Analysis and statistical modeling for the identification of metabonomic biomarkers in 1H-NMR spectroscopy Réjane Rousseau (Institut de Statistique, UCL, Belgium)

The metabonomics specific region = biomarker Biofluid (Urine) 1H-NMR spectroscopy Whithout contact METABOLITES TISSUES Organs After contact Frequency domain • One molecule = several peaks ( 1 to 3 ) with specific positions in the frequency domain. • The concentration of a molecule is proportional to the area under the curve in its peaks. Spectral alterations = Altered metabolites = detection Identification of biomarkers « which part of the spectrum to examine? » • How? • In an experimental database • with a methodology based on Principal Component Analysis (“PCA”) Objective: propose a methodology combining ICA and statistical modeling

Identification of biomarkers: • Experimental data: • Biologists have a pool of n rats with a control pathological state. • They collect n samples of urine . TheH-NMR produces n spectra of m values = Spectral data • Each sample or spectrum is described by l variables in a matrix ofdesign Y. One of these variables describes the characteristic related to the biomarkers: yk • Biomarker identification= statistical methods to answer to the question: « In these multivariate data, which are the most altered variables xj when ykchanges? » Spectral data: X(n x m) Design dataY(n x l) y1= age of the rat yk=y2= severity of diabetis xj

Example: controlled data • Advantage of controlled data: we know the spectral regions that should be identified as biomarkers. • The controlled data : • 28 spectra of 600 points: X(28 x 600) • Each spectrum = a sample of urine + a chosen concentration of Citrate + a chosen concentration of Hippurate X(28x600) Y (28x2) y1= concentration of citrate y2= concentration of hippurate • We need a biomarker to detect changes of the level of citrate described by y1 « Which are the spectral regions xj the most altered when the y1 changes?» Spectral regions corresponding to Citrate = the biomarkers to identify. hippurate (y2) citrate (y1)

spectrum 3000 points … A spectrum of 600 values xj with ∑ xj = 1 • Urine • Citrate • Hippurate = Natural urine Hippurate y2 + Hippurate + Citrate y1 • 14 mixtures in 2 replications • 28 samples Citrate The biomarkers to identify.= spectral regions corresponding to Citrate

NEW USUAL I. Reduction of the dimension: PCA I. Reduction of the dimension: ICA XC = TP • XTC = SAT • Components : • are independent • with a biological meaning • Examination of the ALL components: • to visualize unconnected • molecules in samples • Principal components are : • uncorrelated • in the direction of maximum of variance • Examination of the 2 first components: Score plot Loadings L1 L2 II. Biomarker discovery through Statistical modelling ex: Citrate plays an important role Identificationof biomarker Comparison of the intensities of biomarkers between spectra from ≠ conditions Identificationof biomarker This is only powerful if the biological question is related to the highest variance in the dataset!

The proposed methodology: • Resulting components: • meaningfullcomponent • - with some advantages over • the principal components Part I : Dimension reduction with ICA on the spectral data: XTC= S.AT Part II: Biomarker discovery through statistical modelling Identification of biomarker Comparison of the values in biomarker between spectra from different conditions

What is Independent component analysis (ICA)? • The idea: • Each observed vector of data is a linear combination of unknown independent (not only linearly independent) components • The ICA provides the independent components (sources, sk) which have created a vector of data and the corresponding mixing weights aki. • How do we estimate the sources? with linear transformations of observed signals that maximize the independence of the sources. • How do we evaluate this property of independence? Using theCentral Limit Theorem(*),the independence of sources components can be reflect by non-gaussianity. Solving the ICA problem consists of finding a demixing matrix which maximises the non-gaussianity of the estimated sources under the constraint that their variances are constant. • Fast-ICA algorithm: - uses an objective function related to negentropy - uses fixed-point iteration scheme. * almost any measured quantity which depends on several underlying independent factors has a Gaussian PDF

I. Dimension reduction by ICA : X(nxm)n spectra defined by m variables ex: (28x600) Each spectrum is a weighted sum of the independent spectral expressions which each one can correspond to an independent (composite) metabolite contained in the studied sample. (aT , weight  quantity) Transposition XT(mxn) Centering mixture and sjhave zero mean XTC= S.AT XTC(mxn) “Whitening”: PCA • we obtain uncorrelated scores with unit variance → demixing matrix is orthogonal • possibility to discard irrelevant scores = chose the number of sources to estimate T(mxq) = XTC. P ICA S (mxq) = XTC. P.W = XTC. A

Example: I. Dimension reduction by ICA XTC= S.AT XTC(600 x 28) = S (600 x 6) AT (6x28) s1 s2 s3 s4 s5 s6 xTC1 xTC28 at1 at2 at3 at4 at5 at6 Urine + citrate + hippurate

S (600 x 6) AT 28 spectra Natural urine aTi,8 Citrate Hippurate

Hippurate Hippurate Hippurate Citrate Citrate Citrate Note: Comparison with the usual PCA • Similarities:projection methods linearly decomposing multi-dimensional into components. • Differences: • The number of sources, q, has to be fixed • Sources are not naturally sorted according to their importances • The independence condition = the biggest advantage of the ICA: - independent components are more meaningful than uncorrelated components - more suitable for our question in which the component of interest are not always in the direction with the maximum variance . PCA ICA 1 2 Natural urine

PCA ICA Hippurate & Citrate Natural urine Loading 1 s1 Citrate Loading 2 s2 Hippurate & Citrate Hippurate Loading 3 s3 PC2 aT3 PC1 aT2

The proposed methodology: Part I : Dimension reduction with ICA on the spectral data: XTC= S.AT q sources representing the spectra of independent (unrelated) composite metabolites contained in the samples. Part II: Biomarker discovery through statistical modeling - on the mixing matrix AT - with covariates chosen in the design matrix Identification of biomarkers Comparison of the intensities in biomarkers between spectra from different conditions

PART II: Biomarker discovery with statistical model The idea: Among the q recovered sj , we suppose that some sources: • present biomarker regions for a chosen factor yk • are interpretable as the spectra of pure or composite independent metabolite which has a concentration in the samples influenced by a chosen factor yk • have weights influenced by a chosen factor yk AT (q x n) Modelisation of the relation between the weight vector and the design variables

PART II: Part I : ICA on the spectral data: XTC= S.AT Part II: Biomarker discovery through statistical modelling Step 1: Fit a linear model on AT • relation between the weight vector and the covariates in design variables • different models. Step 2: Biomarker identification • apply statistical tests on the parameters • of the models • selection of sources with • significant effects. Step 3: comparison of the intensities in biomarkers between spectra from different conditions • prediction of mixing weights by the model • reconstruction with biomarkers sources • comparison between factor levels.

Step 1: Fit a model on AT • The design matrix Y is rewritten into 2 separate matrixs: • Z1 the (n x p1) incidence matrix for the p1 covariates with fixed effects • Z2the (n x p2) incidence matrix for the p2 covariates with random effects • For each of the q recovered sj, we assume a linear relation between its vector of weights and the design variables: aj= Z1βj+ Z2 γj +εj • Models with only fixed effects covariates : aj= Z1β j + εj • Case 1: categorical covariates: ANOVA →ex1: biomarker to discriminate 2 groups of subjects: disease & sane. ex2: biomarker to discriminate 3 groups of subjects: disease1, disease2 & sane • Case 2: quantitative covariates : linear regression → ex: biomarker to explore the severity of an illness, the concentration of a drug • Models with only random effects covariates : aj= Z2 γj+ εj → ex: biomarker to explore variance component (machines, subjects, laboratories) • Models with fixed and random effects covariates : Mixed model: aj= Z1β j+ Z2 γj +εj

Step 1: Fit a model: example • For each of the q = 6recovered sj,we construct a multiple linear regression model with 2 quantitative covariates ( p = 3) and no interaction: aj= βj0 + βj1 y1 + βj2 y2 +εj with βj0 = the intercept y1= the citrate concentration in mg (quantitative) y2= the hippurate concentration in mg (quantitative) β1= the effect of citrate on the mean aj for a fixed value of hippurate β2= the effect of hippurate on the mean aj for a fixed value of citrate εj= the vector of independent random error ~ N (0,σ2) • For each of the q recovered sj, the fitted model by least square technique is : âj= bj0 + bj1 y1 + bj2 y2 • In this example, we want to identify biomarkers for the concentration of citrate. The covariate of interest yk= y1

Step 1: Fit a model: example s3:Hippurate s2: Citrate a2 a3 Citrate (y1) Citrate (y1) hippurate (y2) hippurate (y2) (y1) (y1)

Step 2: Biomarker identification • Goal: • Among the q sources, we want to select the onespresenting a significant effect of the chosen covariate yk on their weights. • These “discriminant sources” represent the spectrum of an independent metabolite with a concentration depending on the chosen covariate = biomarkers. • For each source sj, test the significance of the parameter βjk of the covariate of interest yk : (ex: research of biomarkers for the dose of citrate y1, we testeach ofthe 6 βj1) • H0: βjk = 0 vs H1: βjk ≠ 0 • compute the following statistic: tj = bjk / s(bjk) ~ t(n-p) • take the corresponding p-value: pj= P( t (n-p) |tj| ) • We are in a multiple tests situation: the selection of a significant set of r coefficients βjk based on q pj obtained from q individual tests. → Bonferroni correction: select, in a (m x r) matrix S*, the r sources with pj < 0.05/q

Step 2: Biomarker identification: example We research of biomarkers for the dose of citrate y1=yK → we testeach ofthe 6 βj1 P-values α = 0.05/6 Sources 9.18 x 10-13 1.84x10-15 2.86 x 10-31

Step 3: Comparison of the intensities in biomarkers • Goal:comparison of the effects on the biomarker caused by ≠ changes in yk. • Choose 3 or more values of yk: • yk1 : a first value of reference of yk • yk2 : a new value of interest of yk • yk3 : a second new value of interest of yk • Compute: • The effect on the biomarker of the change of yk from yk1 to yk2 : C1= S* βk* (yk2- yk1 ) • The effect on the biomarker of the change of yk from yk1 to yk3 : C2= S* βk* (yk3- yk1 )

Hippurate Step 3: example yk1 yk2 yk3 yk4 Citrate yk

Conclusions: • With the presented methodology combining ICA with statistical modeling, • we visualize the independent metabolites contained in the studied biofluid (through the sources) and their quantity (through the mixing weights) • we identify biomarkers or spectral regions changing according to a chosen factor by a selection of source. • we compare the effects on this spectral biomarkers caused by different changes of this factor. • In comparison with the PCA, ICA: • gives more biologically meaningful and natural representations of this data.

Thank you for your attention

Hippurate Citrate Example2: the data • 18 spectra of 600 values • 1 characteristic in Y X(18x600) Y (18x1) y1= disease group of the rat (qualitative) • We want biomarkers for group of disease described in y1. → a model with qualitative covariates Group 1= disease 1 Group 2= disease 2 Group 3= no disease

Example2 :Part I. Dimension reduction by ICA XTC= S.AT S (600 x 5) AT (5x18)

Example 2: Part II: biomarkers discovery through statistical modeling Step 1: Fit a model on ATModels with only a categorical covariate with fixed effects: ANOVA I aj= Z1β j + εj Step 2: Biomarker identification: • For each of the q recovered sj, test the effect of y1 → Fj statistics→ pj • Bonferroni correction: select, in a (m x r) matrix S*, the r sources with pj < 0.05/q 0.009797431 0.0002412604 0.005710213

Step 3: Comparison of the intensities in biomarkers • Goal:comparison of the effects on the biomarker caused by ≠ changes in yk. • Choose 3 or more values of yk: • yk1 : a first value of reference of yk • yk2 : a new value of interest of yk • yk3 : a second new value of interest of yk • Compute: • The effect on the biomarker of the change of yk from yk1 to yk2 : C1= S* βk* (yk2- yk1 ) • The effect on the biomarker of the change of yk from yk1 to yk3 : C2= S* βk* (yk3- yk1 )

Hippurate Step 3: Comparison of the intensities in biomarkers Goal:comparison of the effects on the biomarker caused by the changes of group. Citrate

Others slides

Example 1 : Step 4: Comparison of the intensities in biomarkers • For a first chosen value of interest of yk (not necessary observed): yk0 • Choose a value of reference for the other factors: y0≠k → Z01 vector of values : (1, yk0,y0≠k) • For each of the r source selected in S*, use the model to predict weights for the biomarkers: âj* (yk0) = bj*Z01 → â* (yk0) a vector of r new weights • Reconstruction of the values to expect in the biomarkers: S*â*(yk0)

Example 2: the reconstructed spectra

The metabonomics