220 likes | 346 Views
Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011. Classification of microarray samples. We are given a set (called Learning set) of Microarrays expressions data coming from several classes of samples (patients)
E N D
BioinformaticaCorso di Laurea Specialistica in InformaticaMicroarray e Biomarcatori06/05/2011
Classification of microarray samples • We are given a set (called Learning set) of Microarrays expressions data coming from several classes of samples (patients) • To simplify the problem we consider only two classes: Case/Control. So we have a set of pairs case/control . • For example cancer/normal metastatic/non metastatic etc.., • Build a classifier able to decide to which class a new unclassified sample belongs .
Expression profiling data analysis • A supervised approach to classification: • Identify genes (or microRNAs) that are differentially expressed in the two classes of samples. • Discretize the set of discriminant genes • Use these genes to build a classifier able to classify new (unknown) samples
Two classes/1 • Rank Product • Rank Product is a non-parametric statistical method based on ranks of fold changes. Given n genes and k replicates, let eg,i be the fold change(ratio case/control) and rg,i the rank of gene g in the i-th replicate. • The rank product is computed through the geometric mean: • Simple permutation-based estimation is used to determine how likely a given RP has been obtained by chance. R. Breitling, P. Armengaud, A. Amtmann, P. Herzyk. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments FEBS Letters, Volume 573, Issue 1, Pages 83-92.
Two classes/2 • Identification of differentially expressed genes between two classes. The identification consists of two parts the identification of up-regulated and down-regulated genes in the class a compared to class b, respectively. • These results have been obtained using the Rank Product package (v. 2.16.0) of the BioConductor Library under the R System.
More than two classes • Many statistical tests are available • Kruskal-Wallis • ANOVA (for Gaussian only) • SAM (?) • Linear model (R limma package)
Discretization • Discretization algorithms play an important role in data mining and knowledge discovery. • They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. • Discretization algorithms can be classified into five diffrent groups: • supervised versus unsupervised; • static versus dynamic; • global versus local; • top-down (splitting) versus bottom-up (merging); • direct versus incremental;
Class-Attribute Contingency Coefficient • Given the quanta matrix, usually contingency coefficient is used to measure the strength of dependence between the variables • qir (i = 1,2,...,S,r = 1,2,...,n) denotes the total number of examples belonging to the i-th class that are within interval (dr-1,dr]; • Mi+ is the total number of examples belonging to the i-th class; • M+r is the total number of examples that are within the interval (dr-1,dr]; • n is the number of intervals; C.J. Tsai, C.-I. Lee, W.-P. Yang. A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences 178:3 (2008) 714-731.
Associative classification • Associative classification mining is a successful approach that uses association rule discovery techniques to build classification systems.
Maximal Frequent Itemset (i.e. MAFIA algorithm) • Given the set of discretized discriminant genes. Consider all the pairs [gene,interval] as the Items of our data mining analysis . We compute , for each class k, a set of maximal frequent itemsets (MFI). Where a frequent itemset for a class k is a set of items which appear together in a number of elements of the class greater than a given percentage threshold t. It is maximal if no proper superset of it is frequent. • For each class k=0,…,K−1, the set of all MFI, MFI(k)={mfi1(k),...,mfihk(k)} is computed. Then assign to k the set of rules &mfi1(k)- class k . . &mfihk(k) class k Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T: MAFIA: A Maximal Frequent Itemset Algorithm. IEEE Transactions on Knowledge and Data Engineering 2005, 17:1490–1504.
Evaluation • Unknown phenotypes are properly discretized and then assigned to a class k with a score, by using association rules. The assignment which yields the highest score establishes the class. • Let x = {I1,...,Im} be an unknown discretized phenotype, we evaluate how many rules are satisfied, even partially, in each Rk. The sample is assigned to the class whose satisfied rules are maximal. Fixed a class k, we evaluate x under a generic rule rvk = {Ii , ..., Ij } assigning a score in the following way:
General schema Profiling data Discretization Filtering (i.e. discriminant genes) Binary strategy Model validation (KFCV) Filtering based on permutation test Genes patterns (data mining: max freq itemsets) Bayesian Networks Construction (reverse engineering) Pathway Perturbation microRNAs analysis Superset of robust biomarkers
Bayesian networks • Two components: • G directed aciclic graph in which nodes are random variables X1,…..,Xn • For each variable the conditional probability distribution is given by its precursor. • These two components represent a unic distribution on X1,…..,Xn.
Markov assumption • Each joint distribution satisfies the assumption that each variable Xi is influenced by the values of the state that preceds it . Where: parents(Xi) = set of precursors of Xi in G
Tools for Bayesian networks construction • Banjo • Biolearn • Dana Pe’er Lab • http://www.c2b2.columbia.edu/danapeerlab/html/biolearn.html Hartemink, A., Gifford, D., Jaakkola, T., & Young, R. (2001) “Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks.” In Pacific Symposium on Biocomputing 2001 (PSB01), Altman, R., Dunker, A.K., Hunter, L., Lauderdale, K., & Klein, T., eds. World Scientific: New Jersey. pp. 422–433.
Build a Bayesian network PKC MFI(K) set PKA Raf Jnk P38 Mek Erk Akt
Pathway Perturbation • Our goal is to apply an analysis model using both • statistically significant number of differentially expressed genes (or miRNAs) • biologically meaningful changes on a given pathway. A set of pathways describing sub‐systems of the given organism involving the given variables (genes). S. Draghici, P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero. A systems biology approach for pathway level analysis. Genome Research, 17:1537-1545, 2007.
Output • Rank the sub‐systems in the decreasing order of the amount of disruption suffered • If possible, identify those sub‐systems for which the disruption is “significant”
Gene perturbation factor • PF(g) = perturbation factor of g: • α = a priori type of impact expected from that gene • ΔΕ(g) = change in expression level for g(fold change) • USg = Set of genes directly upstream of g in the pathway • Nds(u) = number of genes directly downstream of u in pathways • βug = efficiency of the connection between u and g
Pathway perturbation factor • Nde (Pi) = number of Differentially Expressed genes on the given pathway Pi • PF(g) =perturbation of the gene g • mean fold change of differentially expressed genes.
In this model, the impact factor IF of a set of genes (for example those of a MFI belonging to Pi) on a pathway Pi can be estimated (p-value) by replacing that set by a random set of genes in Pi of the same cardinality . • The perturbation factor of Pi and this p-value give the measure of the relevance of the MFI on that Pathway.