290 likes | 410 Views
Analysis and Integration of Large-scale Molecular and Clinical Data in Cancers. Sampsa Hautaniemi, DTech Systems Biology Laboratory Institute of Biomedicine Genome-Scale Biology Research Program Centre of Excellence in Cancer Genetics Faculty of Medicine University of Helsinki.
E N D
Analysis and Integration of Large-scale Molecular and Clinical Data in Cancers Sampsa Hautaniemi, DTech Systems Biology Laboratory Institute of Biomedicine Genome-Scale Biology Research Program Centre of Excellence in Cancer Genetics Faculty of Medicine University of Helsinki
Table of Contents • The essence of systems biology: Iteration and collaboration. • Iteration in ovarian cancer. • The essence of systems biology II: Multi-level data. • Multi-levelity of breast cancer. • The essence of systems biology III: Computation. • Anduril computational framework & glioblastoma multiforme.
Systems Biology: Iteration Adapted from a slide by Peter Sorger
Ovarian Cancer • Epithelial ovarian cancer is the fifth most frequent cause of female cancer deaths, with an overall 5-year survival rate below 50%. • The standard chemotherapy for high-grade serous ovarian cancer (HGS-OvCa) is platinum-taxane combination. • Majority of patients suffer relapse <18 months. • No clinically applicable methods to predict the prognostic outcome or even to identify the patients unresponsive to current therapies.
Aims of the HGS-OvCa Study • To identifypoorresponse and goodresponsesubtypes of HGS-OvCa. • Report biomarkersthatallow to identifywhether a HGS-OvCa patient responds to the platinum treatment. • Wedeveloped a computationalmethodthatintegrates transcriptomics and clinical data in subtypefindingstep. • Weused transcriptomics and clinical data from 184 HGS-OvCa patients treated with platinum and taxanefrom TCGA repository.
Three Subtypes of HGS-OvCa Chen et al. In preparation.
Validation, validation, validation • We also used an independent prospective HGS-OvCa cohort of 29 patients. • Data measured with qRT-PCR. Chen et al. In preparation.
Pathway Analysis • Our pathway analysis (too) identified TR3 as a potential driver for platinum resistance.
TR3 Inhibition with Two Drugs • We identified two signaling pathway regulators for TR3 and associated inhibitors. • The use of two inhibitors should transform the HGS-OvCa cells sensitive to platinum. AKT inh + AKT inh + ERK5 inh Chen et al. In preparation.
eAtlas of Pathology Systems Biology II: Multi-level Data • Whilecancercellsareclearlyvisible the exactmolecularcauses for arestillunknown. • Need to studycancersamples at multiplelevels.
Genetics Transcriptome Proteomics Epigenetics Clinical Multiple Levels of Data 100 samples lead to ~200 million data points.
Multiplelevel data: EstrogenReceptor Nuclear receptor: Estrogen receptor Gene regulation Transcription factor Non-genomic action Genomic action
Why Is This Important? • Estrogen receptor is the most important clinical variable in determining how to treat a breast cancer patient. • There are several anti-cancer drugs targeting estrogen receptor pathway. • Currently unknown which tumors do not response to therapy. • Finding genes respond to estrogen receptor stimulus may give clues which genes are important in ER inhibition resistance. Hugo Simberg: Garden of Death
Data • We used chromatin immunoprecipitation combined with massive parallel sequencing (ChIP-seq) to determine genome-wide occupancy (eight time points) after estradiol stimuli in MCF-7 breast cancer cell line: • Estrogene receptor a • RNA polymerase II • Histone marks (H3K4me3, H2A.Z) • These experiments resulted in >2.0 billion data points to the initial analysis.
SYNERGY database • SYNERGY database is available and fully operational. • http://csblsynergy.fimm.fi/
Results • We identified 777 estrogen receptor early responding genes. • Interestingly, the major estrogen receptor related changes in cells were due to non-genomic action.
Results • Next we searched for genes that have survival association in a breast cancer cohort of 150 ER+/HER2-/postmenopausal patients in The Cancer Genome Atlas (TCGA) cohort. • Based on Kaplan-Meier analysis we identified 23 genes with survival p<0.05. • The best survival associated gene was ATAD3B.
Intermission • Pol2 activity is much better way of searching for responsive genes to a cue that mRNA. • In deep sequencing, the sequencing depth is important (with our 200 mill. short-read Pol2 data, we found many ER responsive genes not found in 20 mill. short-read GRO-seq). • How to systematically analyze multi-level data?
Multi-levelCancerResearchRequiresComputationalMethods • Storing the data and computing power are the first (but relatively small) hurdles. • Analysis of large-scale, heterogeneous data is much more challenging than single genomics or proteomics data analysis. • There is a need for computational infrastructure. • Writing an analysis program fast without proper infrastructure will lead to delays and errors in larger projects.
Infrastructure: Anduril • Anduril is a computational framework to integrate large-scale and heterogeneous data, knowledge in bio-databases and analysis tools. • The main design principles are: • Modular pipeline analysis approach • Scalable • Open source, thorough documentation • http://www.anduril.org/ • Method written in any programming language executable from the command prompt can be included. • Produces automatically the result PDF and website containing the results.
Glioblastoma Multiforme (GBM) • Glioblastomamultiforme (GBM) is one of the deadliest cancers. • The Cancer Genome Atlas (TCGA) has published data from >500 GBM patients: • comparative genomic hybridization arrays • single nucleotide polymorphism arrays • exon and gene expression arrays • microRNA arrays • methylation arrays • clinical data • Which genes or genetic regions have survival effect?
(Sequence) Component Libraries • Over 400 Anduril components already available. • Pipelines: • ChIP-seq (EMBO J 2011, Cancer Res 2012, ...) • RNA-seq (not published) • miRNA-seq (not published) • DNA methylation-seq (not published) • Whole-genome sequence & exome-sequence (not published) • Image analysis (manuscript)
Summary • Characterization of a complex disease first requires identifying the key variables. • This requires integration data from multiple levels, iterative mode of research and collaboration. • Multi-level data integration requires computational infrastructure and data-intensive computing. • We have developed Anduril to organize large-scale data analysis projects (imaging, deep sequencing, database usage, conversions, etc.) • The need for computational infrastructure is evident in particular when analyzing deep sequencing data. • All our methods are (will be) freely available. http://research.med.helsinki.fi/gsb/hautaniemi/software.html
Acknowledgements Systems Biology Lab Funding Academy of Finland Finnish Cancer Organizations Sigrid Jusélius Foundation EU FP7 ERA-NET SysBio+ Biocenter Finland Biocentrum Helsinki Collaborators Olli Carpén Henk Stunnenberg George Reid Jukka Westermarck