Course Expectations High-throughput sequencing technology Very large datasets

6/13/19 Course Expectations High-throughput sequencing technology Very large datasets

Goals for the course • Understand how next-generation sequencing technologies are used in biomedical research • Learn how to use publicly available databases/websites to find specific information about genes • Learn how to analyze gene lists to form hypotheses that can be tested experimentally • Learn to write a results section for a manuscript

Logistics • Course website: • http://biochem.slu.edu/bchm628/ • Contact: • Phone: 977-8858 • Email: Maureen.donlin@health.slu.edu • Office – DRC 503 • Call or email. • Usually at WashU on Thursday afternoons • Lab – DRC 551

Grading • Grading: • Exercises 65 % • Final exam 25 % • Class attendance 10 % • Grading policy handout • Details about late assignment and tests • Computer exercise handout • Example computer exercise

Exercise format • There will be 7 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks. • You’ll provide the answer in the same format as you would write for the results section of a paper • Why did you do this experiment or analysis? • What did you actually do? • What did you observe? • What does it mean? • Include supporting data • Figures with figure legends • Correctly formatted tables of data.

Exercise due dates Exercises handed out on Thursday are due the following Wednesday at 4:00 pm Exercises handed out on Tuesday are due on Friday of the same week at 4:00 pm All exercises are to be sent by email There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not.

Exercises, cont • Exercise in Word or PDF format • Supplemental data in Excel, Word or PDF format. • The exercise should print in portrait orientation. • The exercise should include a header with your name at the top. • All files should have a name that includes your last name: • Your Name-Ex # or Name_SuppData#

Final project • This will be a project summary of the analyses that you will have done in exercises 1-7. • You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. • You will be asked to give a rationale for making the choices that you did. • You will analyze the three genes virtually using some of the tools from the exercises • You will be asked to propose hypothetical bench experiments for the genes • Final project will be due July 18th at 4:00 pm.

A few tips on data presentation

Data tables Table 1: Gene expression for WT cells under conditions X,Y, Z. Table 2: Comparison of clinical parameters for groups 1 and 2. 1 Statistical significance was determined by a Mann-Whitney test 2 Statistical significance was determined by 2-tailed t-test Columns describe attributes Rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns.

Data tables, cont • For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. • If your table is so wide that it forces the page into landscape orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment. • Refer to supplemental tables in your write-up and number then and the file as YourName_SuppTable1, ect. • Supplemental tables can be in Excel format.

Figures • Export the figure from whatever program in jpeg or png format; those can be inserted into a Word document easily. • PDFs can be converted to other formats using Illustrator • There are some online converters • http://www.wikihow.com/Convert-PDF-to-JPEG • Screen capture and placement may also work. • Talk to me if you have issues. • Super high resolution is not necessary.

Figures, cont. • Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used. • Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data. • Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary. • Again, talk to me if you have issues.

Remainder of this lecture Overview of sequencing a genome Next generation sequencing High-throughput experiments by sequencing Biomolecular databases

Genome sequencing Approach depends on the source, size, complexity and goals for a given organism • Goal? • De novo sequencing • Re-sequencing for annotation • Sequencing to identify variations • Size and complexity • Virus, bacterial, single-celled eukaryote, mammal, plant • Quasi-species or repetitive sequences • Sample prep • Can it be cultured? • Tissue source: unlimited or limited quantities? • Virus levels, RNA or DNA

Genome sizes Homo sapiens 3 Bbp Hepatitis C virus 10,000 bp Arabidopsis thaliana 135 Mbp Saccharomyces cerevisiae 12 Mbp Axoloti 32 Bbp

Types of sequencing Throughput Accuracy Read-length Cost Library prep

Sanger sequencing technology 1970s thru 1980s: >SEQ ATAGCCGTACTTAGCTGAGGAGTCGATAAC 1990s to today: Long read lengths (500-900 bp) & >99.9% correct Need to clone or PCR amplify the DNA to obtain enough for sequencing reaction, no library preparation Very high accuracy, relatively long reads, very low throughput

Illumina NGS • Read lengths up to 150 bp • Need to make bar coded libraries, which can be technically challening • Longer run time • Very high throughput: UP to 1TB of data per run • High accuracy • Very high throughput • Short reads

PacBio Single molecule sequencing Very long read lengths (up to 10 Kb) High error rate, but stochastic and can be dealt with by multiple passes No cloning Very long read lengths, low-medium accuracy, medium throughput

Ion Torrent: semi-conductor sequencing

Illuminavs Ion Torrent • Illumina has greater capacity but longer run times • Ion torrent has longer read lengths (~200 bp) • Library prep similar to Illumina in complexity • SLU has an Ion Torrent machine • Cost is ~$250/sample, including the sequencing • Get strand specific sequencing without additional library prep

Oxford nanopore sequencing A protein nanopore is set in an electrically resistant polymer membrane. An ionic current is passed through it to generate a charge. As analyte passes through the pore it creates a characteristic disruption in current which is different depending on the base. MinION flow cell Attaches directly to computer for data analysis Long read lengths High error rate No cloning and direct RNA sequencing Medium to high-throughput Cheapest of the current technologies with simplest library prep

Bioinformatics challenges • Each flow cell in the IlluminaHiseq 2500 can generate a billion bases of sequence • Raw read files are Tb in size • Processed read files are several 700-800 Mb • Alignment files 150-300 Mb • Assembly of millions of short (75-100 bp) reads into vertebrate genome • Need high-performance compute (HPC) cluster for vertebrate sized genomes* • What biomolecular species to interrogate? • 25,000 genes • 160,000 transcripts • miRNA, non-coding RNA

Sequencing has become a standard technique • RNA sequencing for expression • ChIP sequencing for TF site identification • DNA sequencing for variants • Identification of populations/genetic changes in highly variable viruses and bacteria • Single cell RNA sequencing (Rich DiPaolo) • Metagenomics • Identification of unknown/non-culturable communities of bacteria/viruses/fungi

Which technology? • De novo genome sequencing and assembly: • Combination of Illumina and PacBio • Or PacBio and Nanopore • Resequencing for variant analysis • Illumina, Ion Torrent (smaller genomes) • RNAseq: • Illumina or Ion Torrent • Nanopore for direct RNA sequencing (no cDNA step) • Exome sequencing • Illumina or Ion Torrent • Metagenomics (16S sequencing) • Nanopore

Where is all this data stored? • National Center for Biotechnology Information (NCBI): • >45,000 genomes • >26,000 RNAseq datasets • European Bioinformatics Institute (EBI): • >51,000 genomes • No simple web interface to expression data • Joint Genome Institute (JGI, DOE funded): • >150,000 genomes • No expression data • Genome data by download or programming interface

Who analyzes all of this data? Our ability to generate sequencing data vastly outstripping our ability to analyze and annotate new genomes Annotation: prediction or identification of all functional genetic elements including protein coding genes, ncRNAs, ect. Two major public databases store and annotate or validate annotation for all genomes

Main sequence archives https://www.ncbi.nlm.nih.gov/ https://useast.ensembl.org/index.html

Pros and Cons of different archives • NCBI: National Center for Biotechnology Information • Databases are well integrated • Well integrated with literature (PubMed) • EBI: European Bioinformatics Institute • Same base data as NCBI, but offers different front-end • Much better list-based searching • Not as well integrated with literature • Transcript variants differ from NCBI because of different annotation pipelines • UNIPROT • All protein information from EBI is hosted here

Exercise 1: Finding gene related information • Gene are annotated by: • NCBI/EBI for certain organisms (Human, Chicken, Dog) • Organism specific groups for model organisms like yeast, mouse, C. elegans, Drosophilia • Genome sequencing centers • Ideally, genes have an official gene symbol and gene name and that is what is used in manuscripts, ect. • In reality, genes are identified by different groups at same time and named different things • The Human Genome Organization (HUGO), Mouse Genome Informatics (MGI), ect. are organizations that define official gene nomenclature

Exercise 1: Transcripts • Transcripts: • NCBI & EBI use different computational pipelines for predicting and annotating transcripts • There can be differences between them but typically at least the verified transcripts agree • Transcript isoforms have different accession numbers • BRACA1 has 4 transcript isoforms:

Human TDP-43 gene HGNC: HUGO gene nomenclature committee RNA binding protein with role in neurodegenerative disorders HGNC official symbol: TARDBP HGNC official name: TAR DNA binding protein However: A search of PubMed using TARDBP returns 434 records while using TDB-43 returns 2543 records Use the most common but refer to the official gene symbol at least once in your manuscript and especially in the abstract

Genome viewers • Provides chromosomal context to the gene(s) of interest • See transcript variants in graphical view • Have “tracks” of additional information: • Variants (SNPs) • Expression data • Repetitive sequences • Comparative data (with other species) • Download genomic sequence • Ensembl genome viewer (useast.ensembl.org) • UCSC genome viewer (genome.ucsc.edu)

TARDBP (TDP-43) in UCSC Genome Browser UCSC has their own annotation pipeline and so have different annotated transcripts

Take home points Rapid, high-throughput sequencing has opened up new ways to interrogate biological systems Generates Tb (Fb, Pb) of data Need computers to find it and analyze it Hypothesis generating (usually) with follow-up in bench experiments on limited number of genes Public databases treasure trove of data that can be mined with different questions

Today in computer lab Exercise 1 is due on Wednesday, June 19th Finding genes and transcripts using NCBI and EBI Finding gene specific information using NCBI & EBI Visualization of genes and transcripts with the EBI and UCSC genome browsers Extra credit: write a 500-700 word abstract on why you would study Lyme disease and/or B. burgdorferi

Source of data for the exercises Lyme Borreliosis Ixodesscapularis Borrelia burgdorferi LB most prevalent arthropod-borne infectious disease in N. America Photo Credit: Content Providers(s): CDC - This media comes from the Centers for Disease Control and Prevention's Public Health Image Library (PHIL), with identification number #6631. Carreras-Gonzalez A. et. al. “A multi-omics analysis reveals the regulatory role of CD180 during the response of macrophages to Borrelia burgdorferi” Emerging Microbes and Infections (2018) 7: 1-13

A few background papers (optional) Available for download or linked from the course website https://www.ncbi.nlm.nih.gov/pubmed/27976670 “Lyme Borreliosis”, Nat. Rev. Dis. Primers (2016) https://www.ncbi.nlm.nih.gov/pubmed/27900646 “The Potential of Omics Technologies in Lyme Disease Biomarker Discovery and Early Detection”, Infect. Dis. Ther. (2017) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5986905/ “Ixodes Immune Responses Against Lyme Disease Pathogens” Frontiers in Cell. and Infec. Microbiology (2018) https://www.ncbi.nlm.nih.gov/books/NBK532894/ Borrelia Burgdorferi. StatPearls (2018). Full text available.

Course Expectations High-throughput sequencing technology Very large datasets