1 / 48

龙星课程 — 肿瘤 生物 信息学上机课程

龙星课程 — 肿瘤 生物 信息学上机课程. 曹莎 Email:scaorobin@sina.com. 课程安排. 各类数据类型的介绍,简单的 R 入门 ; 基因表达数据 和 蛋 白表达数据的相关 性; 差异性表达的检验 , 假阳性检验 (FDR), 批次效应 (batch effect) ; 基因突变数据以及表达通路的富集分析 基因表达数据的相关性以及双聚类分析 各类数据的整合 基因表达数据和 metabolic profiling 的数据;基因表达数据和表观遗传数据的整合. 数据类型的介绍 — 基因表达数据. Microarray 高通 量测量几万个探针

Download Presentation

龙星课程 — 肿瘤 生物 信息学上机课程

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 龙星课程—肿瘤生物信息学上机课程 曹莎 Email:scaorobin@sina.com

  2. 课程安排 • 各类数据类型的介绍,简单的R入门; • 基因表达数据和蛋白表达数据的相关性; • 差异性表达的检验, 假阳性检验(FDR), 批次效应(batch effect); • 基因突变数据以及表达通路的富集分析 • 基因表达数据的相关性以及双聚类分析 • 各类数据的整合 基因表达数据和metabolic profiling的数据;基因表达数据和表观遗传数据的整合

  3. 数据类型的介绍—基因表达数据 • Microarray • 高通量测量几万个探针 • 精度较低 • 如何获取? • GEO Dataset, array-express, TCGA • 这些数据有何信息?

  4. 使用microarray数据须知 • Organism • Experimental design • Sample list (Sample distribution, sample size) • Platform • Important!!!!

  5. 数据类型的介绍—基因表达数据 • RNA-seq • 如何获取? • TCGA, SRA • 这些数据测有何信息?

  6. Data levels and data types • https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp

  7. 数据类型的介绍—基因组数据 • Somatic point mutation • 如何获取? • TCGA, GEO SRA • 这些数据测的是什么?有何信息?

  8. 数据类型的介绍—表观遗传数据 • DNA甲基化数据 • 如何获取? • TCGA, GEO Dataset • 这些数据测的是什么,有何信息?

  9. 数据类型的介绍—表观遗传数据 • Histone modification数据 • 如何获取? • Very limited • 这些数据测的是什么,有何信息?

  10. 数据类型的介绍—蛋白质组学数据 • Protein array • 如何获取? • TCGA, literature search • 这些数据测的是什么?有何信息?

  11. 数据类型的介绍—代谢组学数据 • Metabolic profiling • 如何获取? • literature search • 这些数据测的是什么?有何信息?

  12. 简单的R入门 • 简单的数据处理 • 统计检验 • 统计建模(回归,矩阵分解等) • 可视化

  13. Print • print(matrix(c(1,2,3,4), 2, 2)) • print(list("a","b","c"))

  14. Basis functions • ls() • rm() • c() #creating a vector, c() is a function • mode() # • class() # • mean(x) • median(x) • sd(x) • var(x) • cor(x, y) # • cov(x, y)

  15. Creating Sequences • 1:5 • 5:1 • seq(from=0, to=20, by=5) • 1.1:10.1 • 1.1:10.3 • a<-rep(0,3) • rep(c(1,2,a),2)

  16. Basic calculations • + • - • * • / • %% • ^ • %*% #matrix multiply • log(x) • sin(x) • exp() • e • Pi • Inf • NA

  17. Data mode: Physical Type mode(3.1415) # Mode of a number [1] "numeric" > mode(c(2.7182, 3.1415)) # Mode of a vector of numbers [1] "numeric" > mode("Moe") # Mode of a character string [1] "character"

  18. Data Class: Abstract type • scalar • array (vector) • matrix • From array to matrix • factor (looks like a vector, but has special properties, for Categorical variables or grouping) • data.frame

  19. data.frame matrix • Same data mode in each column • Unique Row/column names (rownames, colnames) • One row of a data.frame is a data.frame • as.data.frame(****) • Same data mode in the whole matrix • Can have repeated Row/column names • One row of matrix is an array (vector) • as.matrix(****)

  20. 这门课处理的数据类型 • Clinical data-> data.frame • Experimental data-> data.frame or matrix • Microarray data • RNA seq data • Somatic mutation data • Protein array • DNA methylation data

  21. Data combining • cbind • Combine data by column • rbind • Combine data by row • Eg. a<-matrix(0,2,2) b<-matrix(1,2,2) cbind(a,b) rbind(a,b)

  22. length • a<-c(1:5) • length(a)

  23. apply • Apply Functions Over Array Margins • apply(DATA, MARGIN, FUNCTION, ...) • MARGIN= 1 for rows; 2 for columns • Eg. m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2) apply(m, 1, mean) apply(m, 2, mean)

  24. Pattern寻找 • Which command • which(****),**** should be a logical operation • which(****), return the index of TRUE elements in the logical operation • Eg x<- floor(10*runif(10)) x which(x<5) x[which(x<5)]

  25. For loop For loop: http://en.wikipedia.org/wiki/For_loop In computer science a for loop is a programming language statement which allows code to be repeatedly executed Question: Calculate the sum of all the values in the vector x<- floor(10*runif(10))

  26. For loop Real computer program! Eg. for(i in 1:100){ print("Hello world!") print(i*i) }

  27. For loop for(*** in ***){} for(VARIABLE in TARGETSET){} for(i in 1:100){} x <-floor(10*runif(10)) total_x<-0 for(i in 1:length(x)) { print(i) print(x[i]) total_x<-total_x+x[i] }

  28. Working directory • getwd() • setwd(“****”) • list.files() • load(“****”) • save.image(“****”)

  29. 实例 • 摘出colon cancer的clinical information中所有二期和三期的样本

  30. 步骤 • 将数据load进来 • 找到数据中所有的期的信息 • 用for循环将所有的一期,二期的样本摘出来,并且合并所有的数据

  31. R code setwd("D:\\DragonStar\\dragon_star_data\\TCGA_colon_cancer_data") rm(list=ls()) list.files() load("COAD_clinical_data.RData") data_clinical<-COAD_clinical_data data_clinical$pathologic_stage<-as.character(data_clinical$pathologic_stage) head(data_clinical) table(data_clinical$pathologic_stage) all_stages<-unique(sort(data_clinical$pathologic_stage)) all_I_II_stages<-all_stages[c(2,3,4,5,6,7)] all_I_II_stages<-c("Stage I","StageIA","StageII","StageIIA","StageIIB","Stage IIC") data_stage_I_II<-c() for(i in 1:length(all_I_II_stages)){ id<-which(data_clinical$pathologic_stage==all_I_II_stages[i]) data_stage_I_II<-rbind(data_stage_I_II,data_clinical[id,]) }

More Related