Single cell RNAseq Kathie Mihindukulasuriya , PhD Senior Scientist, Cruchaga Lab

Single cell RNAseq Kathie Mihindukulasuriya, PhD Senior Scientist, Cruchaga Lab Department of Psychiatry Washington University in St. Louis

Plan: Single cell RNA-seq vs bulk RNA-seq Current single cell protocols and platforms Processing single cell RNA-seq data Biology based analysis Current challenges in single cell RNA-seq processing and analysis

Bulk RNAseq vs single cell RNASeq

What are some types of questions that can be answered by scRNAseq?

Fluidigm C1

Droplet-based Methods of single-cell isolation: Limiting dilution: not very efficient Micromanipulation: Time consuming; low throughput FACS: highly purified single cells IF cells express cell surface marker

Droplet-based Methods of single-cell isolation: Microfluidic technology low sample consumption low analysis cost precise fluid control Decreased risk of external contamination CellSearch Antibody conjugated to magnetic particles To isolate desired cells Good for rare cell types Laser capture microdissection isolate cellsfrom solid samples

Droplet-based cell lysis -> reverse transcription into first-strand cDNA -> second-strand synthesis -> cDNA amplification • UMIs: • - 4–10 random nucleotides that are introduced with the primer used for cDNA generation before amplification • multiple reads with the same UMI sequence map to the same gene = one molecule • Cell barcodes: • labeling of cDNA by a cell-specific DNA sequence that allows multiplexing at an early stage

Plate-based Template Switching Oligonucleotide

Processing scRNA-seq data • Map reads to genome, not transcriptome • Decreases multi-mapping reads • Critical for snRNA-seq • Splice-aware aligners (STAR) • Pseudoaligners (faster) Associate reads with genes or transcripts - featureCounts - HTSeq remove PCR noise using UMIs demultiplexing to identify cells Remove barcodes from cell-free mRNA (much lower average read count than barcodes derived from intact cells)

Processing scRNA-seq data • Remove low-quality ‘cells’ based on mapping statistics: • overrepresentation of mitochondrial RNAs, ribosomal RNAs (>40%), spike-ins, adapters • and/or reads that map outside of exons Normalization to correct for unwanted variation among cells caused by technical variation remove batch effects Biology-based analysis (like differential expression)

Some examples of biology-based analysis Purpose: to directly investigate AD brain changes in cell proportion and gene expression using single cell resolution Del-Aguila, J.L. et al. A single- nuclei RNA sequencing study of Mendelian and sporadic AD in the human brain. bioRxiv. Mar. 30, 2019. doi: http://dx.doi.org/10.1101/593756.

To identify different cell types in brain samples by a CGS approach (unsupervised graph-based clustering) and then annotated by cell type using marker genes t-distributed Stochastic Neighbor Embedding (tSNE) plot is a dimensionality reduction technique Differences with PCA: tSNE always produces a 2D separation tSNE is non-deterministic (you won't get exactly the same output each time you run it) tSNE tends to cope better with non-linear signals in your data, (less impact of outliers; visible separation between relevant groups is improved) 4. After tSNE input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE NOTE: very computationally intensive (may need to apply another dimensionality reduction technique like PCA first)

To identify different cell types in brain samples: Classic Gene Set (CGS) from Pooled Subjects: (Seurat FindVariableGenes -> 2,360 genes -> calculate 100 PCs -> identified the optimal number of PCs (65) 25 clusters 6 cell types

To identify different cell types in brain samples: Consensus Gene Set (ConGen) from each subject: (Seurat FindVariableGenes -> 2,447 (S1); 2,354 (S2); 1,972 (S3) -> R function intersection to identify common genes (1,434) -> calculate 100 PCs -> identified the optimal number of PCs (25) 14 cell types; better resolution

Cluster annotation Evaluating the expression of maker genes for neurons, astrocytes, oligodendrocytes, microglia, oligodendrocyte precursor cells, endothelial cells, excitatory and inhibitory neurons (from literature) -> Seurat DotPlot to visualize the average gene expression for the marker genes in each cluster

Workflow Analysis Plan

Single cell analysis: current challenges • - Biggest challenge: missing data (excess zeros) “Dropout” • - technical (not captured) • - biological (really no expression) • sampling (just not deep enough sequencing) • can’t distinguish between these • dropout = largest source of variation • How to deal with missing data? • Increase read depth • Impute the missing data based on clustered cells (DrImpute, CIDR, MAGIC, scimpute) • Impute the missing data based on bulk RNAseq data (SCRABBLE) • Use biological knowledge – gene-gene coexpression (netNMF-sc)

Single cell analysis: current challenges Explosion of methods and software, but not yet clear best practices https://github.com/seandavi/awesome-single-cell • Doublet Identification • demuxlet - [shell] - Multiplexed droplet single-cell RNA-sequencing using natural genetic variation • DoubletFinder - [R] - Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. BioRxiv • DoubletDecon - [R] - Cell-State Aware Removal of Single-Cell RNA-Seq Doublets. [BioRxiv](DoubletDecon: Cell-State Aware Removal of Single-Cell RNA-Seq Doublets) • DoubletDetection - [R, Python] - A Python3 package to detect doublets (technical errors) in single-cell RNA-seq count matrices. An R implementation is in development. • Scrublet - [Python] - Computational identification of cell doublets in single-cell transcriptomic data. BioRxiv

Single cell analysis: current challenges • Assigning cell types to clusters of cells: • - dimensionality reduction (tSNE, PCA, UMAP) -> unsupervised clustering -> annotation of clusters • Use of marker genes • Known marker genes • Expression high enough to be measured (not always true for known cell surface markers) • Subjective (different researchers choose different markers) • Novel cell types? • Use of annotated training data (e.g. reference atlas) • comparisons with annotated reference data using automatically chosen genes that optimally discriminate • between cell types (scmap, SingleR) • - allow the assignment of cells to an intermediate or unassigned type (CHETAH) Challenge: human data often clusters by individual, rather than cell type

Single cell analysis: current challenges • How to combine datasets for analysis: • scmap: projection of single-cell RNA-seq data across data sets • scMerge: using genes that do not to change across all samples and a robust algorithm to infer pseudoreplicates between datasets.

Single cell analysis: current challenges Look to see advances in single cell RNA seq cancer research for solutions to problems

Single cell RNAseq Kathie Mihindukulasuriya , PhD Senior Scientist, Cruchaga Lab

Single cell RNAseq Kathie Mihindukulasuriya , PhD Senior Scientist, Cruchaga Lab

Presentation Transcript

Single Cell Informatics

Single Cell Protein

Single-Cell Organisms

Single Cell Variability

Cell lab

RNAseq

RNAseq analysis

Program Theory and Logic Models Gareth Parry PhD Senior Scientist

Teresa J. Brady, PhD Senior Behavioral Scientist Arthritis Program

R.M.Bhardwaj Senior Scientist

Single Cell Biosensor

Jose Ordovas PhD Professor/Senior Scientist

Single Cell Thunderstorms

Cell Lab

Linda J. Koenig, PhD, MS Senior Scientist, Prevention Research Branch

Claude Beigel, PhD. Exposure Assessment Senior Scientist Research Triangle Park, USA

Kristy Morris PhD, Senior Scientist Council for Watershed Health watershedhealth

Claude Beigel, PhD. Exposure Assessment Senior Scientist Research Triangle Park, USA

presented by Liju Fan, PhD Senior Scientist

Single Cell Thunderstorms

The Single Cell:

Single Cell RNAseq at PF2