270 likes | 405 Views
Marianna De Santis, Francesco Rinaldi, Emmanuela Falcone, Stefano Lucidi, Giulia Piaggio, Aymone Gurtner and Lorenzo Farina BIOINFORMATICS - Systems biology Vol. 30 no. 2 2014, pages 228–233.
E N D
Marianna De Santis, Francesco Rinaldi, Emmanuela Falcone, Stefano Lucidi, Giulia Piaggio, Aymone Gurtner and Lorenzo Farina BIOINFORMATICS - Systems biology Vol. 30 no. 2 2014, pages 228–233 Combining optimization and machine learning techniques forgenome-wide prediction of human cell cycle-regulated genes
Outline • Background • Motivation • Problem • Proposed Solution • Experiment & Results • Comment
Background • Cell division cycle has 4 phasesl:
Background • Cell division cycle is triggered by expressions of various genes • The genes are transcribed: • In a interdependent sequence • At peak levels when they are needed • To discover these genes, microarrays are applied on synchronously divising cells
Motivation • Identification of cycling genes in a genome-wide scale is difficult: • Cell synchronization loss • Intrinsic microarray noise • Computational methods are developed to filter out the noise. • Even noise is reduced, the result still have low reproducibility
Problem Definition • Input: • A microarray time series dataset with n time sample. • Each dataset has m genes • Each gene expression profile is represented in a vector in n dimension • Output: • A cyclicity score for each gene. (m scores in total) • Range: [0, 1]
Proposed Solution • Solution proposed by authors: • LEON (LEarning and OptimizatioN) algorithm: • Learning Step • Identify the cycling gene by a pre-trained SVM classifier • Feature used: Intensity of expression levels at different time samples • Optimization Step • Approximate a function of expression level intensity in terms of time for each gene by optimization • Evaluate cyclicity based on time between minima and maxima • Used to address cell synchronization problem • Combine results above to calculate the final cyclicity score
Details of learning step • Step 1: Train the SVM: • Instance used in training: • Positive Instance: 50 literature known cycling gene • Regulated the whole cell cycle by change in mRNA level • Verified by traditional experimental methods • Negative Instance: 50 genes with a random time shuffling expression level profile • Feature used in training: • 17 expression level recorded at 17 different time
Details of learning step(1) • Software package used: • LIBSVM (A SVM library developed by the Machine Learning Group at National Taiwan University) • Dimension Space Conversion Kernel: • Radial basis kernel. • k-fold cross-validation (k=5) and grid search are used to prevent over-fitting problems • Step 2: Evaluate the gene by the SVM above • If gene is classified as cycling, its partial cyclicityscore is 1, otherwise its score is 0;
Details of optimization step • How cyclicity can be measured based on the expression data profile: • Naive approach: First Fourier coefficient • Problem on this approach: • The profile may not be a pure sinusoid. • Not robust cell synchronization error • No consistent function shape between two subsequent cycles of the same transcript. • This approach is not good
Details of optimization step • The new approach author proposed: • Plot the expression level into chart • Measure the 2 feature below: • Distance in time dmin between two subsequent minima • Distance in time dmax between two subsequent maxima • If dmin and dmax is close to duplication period, then gene has a high chance to be cyclicity. • Duplication period is measured by flow cytometry (Fluorescence-Activated Cell Sorting) analysis
Details of optimization step • Step1: Data is preprocessed to: • Reduce noise • Extrapolate values outside sampling time range • Step2: Plotting the chart • Approximate expression level function as there are not enough data points • Express the expression level function in terms of linear combination of Radial basis functions
Details of optimization step • Minimize error in the function approximation by formulating the regularized problem below: • Solve the optimization problem by PRICE algorithm • Search the optical parameters that give the best ROC curve. • dmin and dmax are measured. • Step3: Evaluate cyclicity:
Final Cyclicity Score Calculation • Final cyclicity score of a gene can be calculated by: • Where: • c is the partial cyclicity score calculated by learning step • c is the partial cyclicity score calculated by optimization step
Validation Experiment • To valid their approach, the data below are used: • Synthetic data • Generated using the algorithm developed by Zhao et al • The genes are transcribed at one invariant time • The cell de-synchronized and the peaks are smoothen over time • Having multiplicative white Gaussian noise with noise standard deviation : 10% for positive samples and 20% for negative samples • Real data • Microarray data from Bar-Joseph et al. • Cell line used: Synchronized foreskin fibroblast cells • Synchronized using double-thymidine block arrest
Validation Experiment (Synthetic) • Dataset generated: • Positive Instance: • 1000 synthetic time courses covering two cell cycles • Negative Instance: • 1000 randomly fluctuating profiles, obtained by random time shuffling of cyclic data • The building of SVM in learning step has used: • 50 extra positive examples • 50 extra negative examples
Validation Experiment (Synthetic) • Evaluation on the results on synthetic data • Ratios of genes scored > 0.5 and < 0.5: • Cyclic: 1 / 0.997 = 1.003 • Non-cyclic: 1/0.993 = 1.007 • Conclusion • A high differentiating power is observed • Cyclic genes is slightly favored.
Validation Experiment (Real) • 480 literature reported cycling genes are considered: • Their pcomb score are calculated and the distribution are plotted below:
Validation Experiment (Real) • From the graph in the previous slide: • Frequency of flat profile genes are uniformly distributed • p score doesn’t add information and consistent with c score. • Score of fluctuating profile genes skewed toward max. value • p score provides additional information about cyclicity. • The combination of the two scores performs better then each single one. • These proved author’s approach performs well
Validation Experiment (Real) • Database Cyclebase is used: • Classified gene as cyclic by experiments on HeLacells • Among 91 low pcomb genes, they are classified: • 18 as cyclic, 44 as non-cyclic, 29 as not classified • Not considering the unclassified genes, 71%of the genes classified by Cyclebase as non-cyclic: • This match with the analysis of the authors. • A gene PRKD1 is confirmed to be non-cyclic
Discovery • 50 genes have the highest pcomb are selected • 5 known cycling genes were excluded [Bar-Joseph et al. (2008)] • 9 of them are chosen to be validated by experiment • Experiment on cell cycle-dependent expression: • Human fibroblasts are prepared from human foreskin • They were grown to 50% confluence • They were synchronized in G0 by serum deprivation. • Cultures were then released from arrest • Sampling(RT-PCR) is done regularly to cover a cell cycle.
Discovery(2) • In the experiment: • All 9 genes are cell cycle-regulated • Their expression level are maximized on S phase for six and on G1 phase for four of them respectively. • NCOR1 and EDF-1 are already known to be cell cycle-regulated an other literature. • The predictive power of the LEON algorithm is confirmed
Discovery(3) • Four low combined score gene are considered: • Their expression is not regulated during the cell cycle • LEON algorithm identifies cell cycle gene expression only. • Further analysis on another 4 genes are performed: • Cyclin A and B1 gens as positive • GAPDH and aldolase genes as negative • CyclinA and B1 are cell cycle dependent • The expression of GAPDH and aldolase genes is constant • Cell cycingegene expressed along cell division only. • These results demonstrate that our approach is successful in identifying cell cycle-regulated genes.