400 likes | 724 Views
Single cell RNAseq Kathie Mihindukulasuriya , PhD Senior Scientist, Cruchaga Lab Department of Psychiatry Washington University in St. Louis. Plan: Single cell RNA- seq vs bulk RNA- seq C urrent single cell protocols and platforms Processing single cell RNA- seq data
E N D
Single cell RNAseq Kathie Mihindukulasuriya, PhD Senior Scientist, Cruchaga Lab Department of Psychiatry Washington University in St. Louis
Plan: Single cell RNA-seq vs bulk RNA-seq Current single cell protocols and platforms Processing single cell RNA-seq data Biology based analysis Current challenges in single cell RNA-seq processing and analysis
What are some types of questions that can be answered by scRNAseq?
Droplet-based Methods of single-cell isolation: Limiting dilution: not very efficient Micromanipulation: Time consuming; low throughput FACS: highly purified single cells IF cells express cell surface marker
Droplet-based Methods of single-cell isolation: Microfluidic technology low sample consumption low analysis cost precise fluid control Decreased risk of external contamination CellSearch Antibody conjugated to magnetic particles To isolate desired cells Good for rare cell types Laser capture microdissection isolate cellsfrom solid samples
Droplet-based cell lysis -> reverse transcription into first-strand cDNA -> second-strand synthesis -> cDNA amplification • UMIs: • - 4–10 random nucleotides that are introduced with the primer used for cDNA generation before amplification • multiple reads with the same UMI sequence map to the same gene = one molecule • Cell barcodes: • labeling of cDNA by a cell-specific DNA sequence that allows multiplexing at an early stage
Droplet-based cell lysis -> reverse transcription into first-strand cDNA -> second-strand synthesis -> cDNA amplification • UMIs: • - 4–10 random nucleotides that are introduced with the primer used for cDNA generation before amplification • multiple reads with the same UMI sequence map to the same gene = one molecule • Cell barcodes: • labeling of cDNA by a cell-specific DNA sequence that allows multiplexing at an early stage
Plate-based Template Switching Oligonucleotide
Processing scRNA-seq data • Map reads to genome, not transcriptome • Decreases multi-mapping reads • Critical for snRNA-seq • Splice-aware aligners (STAR) • Pseudoaligners (faster) Associate reads with genes or transcripts - featureCounts - HTSeq remove PCR noise using UMIs demultiplexing to identify cells Remove barcodes from cell-free mRNA (much lower average read count than barcodes derived from intact cells)
Processing scRNA-seq data • Remove low-quality ‘cells’ based on mapping statistics: • overrepresentation of mitochondrial RNAs, ribosomal RNAs (>40%), spike-ins, adapters • and/or reads that map outside of exons Normalization to correct for unwanted variation among cells caused by technical variation remove batch effects Biology-based analysis (like differential expression)
Some examples of biology-based analysis Purpose: to directly investigate AD brain changes in cell proportion and gene expression using single cell resolution Del-Aguila, J.L. et al. A single- nuclei RNA sequencing study of Mendelian and sporadic AD in the human brain. bioRxiv. Mar. 30, 2019. doi: http://dx.doi.org/10.1101/593756.
To identify different cell types in brain samples by a CGS approach (unsupervised graph-based clustering) and then annotated by cell type using marker genes t-distributed Stochastic Neighbor Embedding (tSNE) plot is a dimensionality reduction technique Differences with PCA: tSNE always produces a 2D separation tSNE is non-deterministic (you won't get exactly the same output each time you run it) tSNE tends to cope better with non-linear signals in your data, (less impact of outliers; visible separation between relevant groups is improved) 4. After tSNE input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE NOTE: very computationally intensive (may need to apply another dimensionality reduction technique like PCA first)
To identify different cell types in brain samples: Classic Gene Set (CGS) from Pooled Subjects: (Seurat FindVariableGenes -> 2,360 genes -> calculate 100 PCs -> identified the optimal number of PCs (65) 25 clusters 6 cell types
To identify different cell types in brain samples: Consensus Gene Set (ConGen) from each subject: (Seurat FindVariableGenes -> 2,447 (S1); 2,354 (S2); 1,972 (S3) -> R function intersection to identify common genes (1,434) -> calculate 100 PCs -> identified the optimal number of PCs (25) 14 cell types; better resolution
Cluster annotation Evaluating the expression of maker genes for neurons, astrocytes, oligodendrocytes, microglia, oligodendrocyte precursor cells, endothelial cells, excitatory and inhibitory neurons (from literature) -> Seurat DotPlot to visualize the average gene expression for the marker genes in each cluster
Single cell analysis: current challenges • - Biggest challenge: missing data (excess zeros) “Dropout” • - technical (not captured) • - biological (really no expression) • sampling (just not deep enough sequencing) • can’t distinguish between these • dropout = largest source of variation • How to deal with missing data? • Increase read depth • Impute the missing data based on clustered cells (DrImpute, CIDR, MAGIC, scimpute) • Impute the missing data based on bulk RNAseq data (SCRABBLE) • Use biological knowledge – gene-gene coexpression (netNMF-sc)
Single cell analysis: current challenges Explosion of methods and software, but not yet clear best practices https://github.com/seandavi/awesome-single-cell • Doublet Identification • demuxlet - [shell] - Multiplexed droplet single-cell RNA-sequencing using natural genetic variation • DoubletFinder - [R] - Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. BioRxiv • DoubletDecon - [R] - Cell-State Aware Removal of Single-Cell RNA-Seq Doublets. [BioRxiv](DoubletDecon: Cell-State Aware Removal of Single-Cell RNA-Seq Doublets) • DoubletDetection - [R, Python] - A Python3 package to detect doublets (technical errors) in single-cell RNA-seq count matrices. An R implementation is in development. • Scrublet - [Python] - Computational identification of cell doublets in single-cell transcriptomic data. BioRxiv
Single cell analysis: current challenges • Assigning cell types to clusters of cells: • - dimensionality reduction (tSNE, PCA, UMAP) -> unsupervised clustering -> annotation of clusters • Use of marker genes • Known marker genes • Expression high enough to be measured (not always true for known cell surface markers) • Subjective (different researchers choose different markers) • Novel cell types? • Use of annotated training data (e.g. reference atlas) • comparisons with annotated reference data using automatically chosen genes that optimally discriminate • between cell types (scmap, SingleR) • - allow the assignment of cells to an intermediate or unassigned type (CHETAH) Challenge: human data often clusters by individual, rather than cell type
Single cell analysis: current challenges • How to combine datasets for analysis: • scmap: projection of single-cell RNA-seq data across data sets • scMerge: using genes that do not to change across all samples and a robust algorithm to infer pseudoreplicates between datasets.
Single cell analysis: current challenges Look to see advances in single cell RNA seq cancer research for solutions to problems