Laboratorio Bioinformatica

Laboratorio Bioinformatica

Obbiettivi • Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: • Marcatori prognosti/diagnostici di patologie

Esempio • Analizzeremo il modo con cui si identificano marcatori molecolari di patologie dissezionando l’approccio presentato in: PNAS 2005, 102:11023-28 PNAS 2007, 104:14424-29

La domanda biologica • Huntington’s disease (HD) is an autosomal dominant disorder caused by an expansion of glutamine repeats in ubiquitously distributed huntingtin protein. • Mutant huntingtin interferes with the function of widely expressed transcription factors, suggesting that gene expression may be altered in a variety of tissues in HD, including peripheral blood. • Highly quantitative biomarkers of neurodegenerative disease remain an important need in the urgent quest for disease-modifying therapies. • For Huntington’s disease (HD), a genetic test is available (trait marker), but necessary state markers are still in development. • Tested hypothesis: • Two studies exists: • Boroveckiet al. Detectingbiomarkersprofiling complete bloodfrom HD patients (hd), pre-HDpatients (pre) and normaldonors (n). • Runneet al. Detectingbiomarkersprofilinglymphocytesfrom HD patients (hd), and normaldonors (n). • Isitpossibletoidentifydiseasebiomarkersusingthese data sets?

Experimentalgroups • Borovecki : • HD group: • 12 HD-affected (stage I-II) subjects • 5 early presymptomatic carriers of the gene mutation, as determined by genetic testing. • Normal group: • 14 healthycontrolsubjects • Affymetrix hgu133a • Runne: • HD group: • 12 HD-moderate stage HD subjects • Normal group: • 10 healthycontrolsubjects • Affymetrix hgu133plus2

Experimental design

Recognition and statement of the problem • The problem should be specified enough and the conditions under which the experiment will be performed should be understood so the appropriate design for the experiment can be selected.

Example • We are investigating the effect of a drug, by BrdU incorporation, considering three concentrations (10 nM, 100 nM, 1 mM), over 3 different tumor cell lines (CL). • In this example the factors are two: • CL, qualitative factor with 3 levels • Drug concentration, quantitative factor with 3 levels

Identicare i fattori coinvolti nello studio di Borovecki • Lo studio è costituitoda: • pazienti HD, pazientipreHD e donatori • Quantifattorisonocoinvolti? • 1 • Quali: • pazienti • I fattorisonoquantitativi o qualitativi? • Qualitativi • Quantilivellicisono? • 3 (HD, preHD, N) Fattore HD Pre HD Livelli N

Come posso ottenere i dati sperimentali? • Recentemente per l’accettazione di un articolo su riviste internazionali viene richiesto che dati siano depositati su banche dati pubbliche: • Europa: arrayexpress • USA: GEO

E’ possibile scaricare i dati: in formato tipo excel (tabulato) contenente tutte le informazioni dell’esperimento le immagini dell’array (in questo caso i .CEL files dell’Affymetrix)

Header Matrix series file

Affymetrix geneChips

Probe set (Affymetrix) Probe pair cell PM MM Gene sequence PM ACCAGATCTGTAGTCCATGCGATGC MM ACCAGATCTGTAATCCATGCGATGC

Per analizzare i dati di microarray è necessario disporre di softwares dedicati • I dati da microarray non possono essere analizzati con un semplice foglio excel ma necessitano di strumenti statistici alquanto sofisticati. • Esistono software commerciali od open-source. • In questo corso le esercitazioni verranno fatte utilizzando un software open-source: • Bioconductor

Platform specific devices Bioconductor Analysis pipe-line Sample Preparation Scanning + Image Analysis Hybridization Filtering Quality control statistical analysis Array Fabrication Normalization Annotation Biological Knowledge extraction

Come si inizia ad analizzare i dati? • Se i .CEL files sono disponibili si esegue un approfondito controllo di qualità. • In mancanza dei .CEL files, se è solo disponibile il matrix series file, è possibile eseguire un numero più limitato di controlli di qualità.

Analysis pipe-line Quality control Statistical analysis Filtering Normalization Biological Knowledge extraction Annotation

Perché si fanno i controlli di qualità (QC)? • I QC sono un punto molto importante di un analisi di dati di microarray. • Questo perché solitamente il numero di esperimenti disponibili è limitato e la presenza di uno o più arrays caratterizzati da un alto numero di artefatti sperimentali potrebbe inficiare l’analisi. • Il QC permette di identificare gli arraysoutliers e permettere al ricercatore di valutare se è necessario rimuoverli o no.

Controllo di qualità per identificare la presenza di arrayoutliers • Avendo a disposizione solo MSF per valutare la presenza di arrays outliers si ispezionano: • Box plot delle frequenze di intensità dei vari arrays.

Controllo di qualità per valutare l’omogeneità dei gruppi sperimentali • Principal component analysis • Clustering gerarchico

Principal component analysis • Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. • The first principal component accounts for as much of the variability in the data as possible • Each succeeding component accounts for as much of the remaining variability as possible. • The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.

PCA2 PCA1 PCA

2 1 2° PC will be orthogonal to the 1st In general the first three components account for nearly all the variability. Therefore, PCA can be reasonably represented in a 3D space.

Hierarchical Clustering (HCL) • HCL is an agglomerative/divisive clustering method. • The iterative process continues until all groups are connected in a hierarchical tree.

s1 s1 s1 s8 s2 s8 s3 s4 s2 s2 s3 s4 s5 s4 s3 s5 s6 s5 s7 s6 s6 s7 s8 s7 Hierarchical Clustering (agglomerative) s1 is most like s8 s4 is most like {s1, s8} Modified by TMEV presentation (www.tigr.org)

s1 s1 s1 s8 s8 s8 s4 s4 s4 s2 s5 s2 s3 s3 s7 s5 s2 s5 s6 s7 s3 s7 s6 s6 Hierarchical Clustering s5 is most like s7 {s5,s7} is most like {s1, s4, s8} Modified by TMEV presentation (www.tigr.org)

s1 s8 s4 s5 s7 s2 s3 s6 Hierarchical Tree Modified by TMEV presentation (www.tigr.org)

Hierarchical Clustering • During construction of the hierarchy, decisions must be made to determine which clusters should be joined. • The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods.

Agglomerative Linkage Methods • Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. • Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage Modified by TMEV presentation (www.tigr.org)

t4 is clearly an outlier!

Exercise • Usare target file target.GSE8762.classif.txt e il file esperimental.design.names.gse8762.txt per valutare con la PCA ilcomportamentodeifattori disease status e gender nel dataset in esame.

Exercise • Open R • Load the oneChannelGUI • Start a new project: • Change the working dir in dataset.huntington • Load the target file • Set as project name: ronne

Exercise • Starting from the data set you have loaded • check the data box plotplots • Answer the following questions: • Is there any array characterized by a very narrow probe intensity distribution? • YES (which? …………………………….) NO • Is there any array which is significantly different with respect to the others? • YES (which? …………………………….) NO

Exercise • Inspect if the experimental groups of our ronne data set (HD, N) are relatively homogeneous using PCA and hierachical clustering. • Is it easy to discriminate on the basis of disease status? • Yes • No

Quality control Statistical analysis Filtering Normalization Biological Knowledge extraction Annotation Analysis pipe-line

Raggruppareidatideisingoli probes in un unicovalore per ilprobeset • Analysis steps: • Calculating probe set summaries: • RMA • GCRMA • Normalization: • Quantile method • L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOG2(INTENSITA’)

Brief summary about probe set intensity calculation • RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. • GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).

Why Normalization ? • Sample preparation • Variability in hybridization • Spatial effects • Scanner settings • Experimenter bias To remove systematic biases, which include, Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray

Analysis pipe-line Quality control Statistical analysis Filtering Normalization Biological Knowledge extraction Annotation

Multiple testing errors • Performing multiple statistical tests two types of errors can occur: • Type I error (False positive) • Type II error (False negative) • Reduction of type I errors increases the number of type II errors. • It is important to identify an approach that reduces false positives with the minimum loss of information (false negative)

Filtering can be performed at various levels: • Annotation features: • Specific gene features (i.e. GO term, presence of transcriptional regulative elements in promoters, etc.) • Signal features: • % intensities greater of a user defined value • Interquantile range (IQR) greater of a defined value

Bg level probe sets Intensity distributions RMA GCRMA

How to define the efficacy of a filtering procedure? • This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.

Laboratorio Bioinformatica

Laboratorio Bioinformatica

Presentation Transcript

Bioinformatica I

Bioinformatica BioPerl

laboratorio

Bioinformatica I