390 likes | 403 Views
Learn how next-generation sequencing technologies are used in biomedical research, analyze gene lists, and write a results section. Course website: http://biochem.slu.edu/bchm628/
E N D
6/13/19 Course Expectations High-throughput sequencing technology Very large datasets
Goals for the course • Understand how next-generation sequencing technologies are used in biomedical research • Learn how to use publicly available databases/websites to find specific information about genes • Learn how to analyze gene lists to form hypotheses that can be tested experimentally • Learn to write a results section for a manuscript
Logistics • Course website: • http://biochem.slu.edu/bchm628/ • Contact: • Phone: 977-8858 • Email: Maureen.donlin@health.slu.edu • Office – DRC 503 • Call or email. • Usually at WashU on Thursday afternoons • Lab – DRC 551
Grading • Grading: • Exercises 65 % • Final exam 25 % • Class attendance 10 % • Grading policy handout • Details about late assignment and tests • Computer exercise handout • Example computer exercise
Exercise format • There will be 7 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks. • You’ll provide the answer in the same format as you would write for the results section of a paper • Why did you do this experiment or analysis? • What did you actually do? • What did you observe? • What does it mean? • Include supporting data • Figures with figure legends • Correctly formatted tables of data.
Exercise due dates Exercises handed out on Thursday are due the following Wednesday at 4:00 pm Exercises handed out on Tuesday are due on Friday of the same week at 4:00 pm All exercises are to be sent by email There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not.
Exercises, cont • Exercise in Word or PDF format • Supplemental data in Excel, Word or PDF format. • The exercise should print in portrait orientation. • The exercise should include a header with your name at the top. • All files should have a name that includes your last name: • Your Name-Ex # or Name_SuppData#
Final project • This will be a project summary of the analyses that you will have done in exercises 1-7. • You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. • You will be asked to give a rationale for making the choices that you did. • You will analyze the three genes virtually using some of the tools from the exercises • You will be asked to propose hypothetical bench experiments for the genes • Final project will be due July 18th at 4:00 pm.
Data tables Table 1: Gene expression for WT cells under conditions X,Y, Z. Table 2: Comparison of clinical parameters for groups 1 and 2. 1 Statistical significance was determined by a Mann-Whitney test 2 Statistical significance was determined by 2-tailed t-test Columns describe attributes Rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns.
Data tables, cont • For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. • If your table is so wide that it forces the page into landscape orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment. • Refer to supplemental tables in your write-up and number then and the file as YourName_SuppTable1, ect. • Supplemental tables can be in Excel format.
Figures • Export the figure from whatever program in jpeg or png format; those can be inserted into a Word document easily. • PDFs can be converted to other formats using Illustrator • There are some online converters • http://www.wikihow.com/Convert-PDF-to-JPEG • Screen capture and placement may also work. • Talk to me if you have issues. • Super high resolution is not necessary.
Figures, cont. • Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used. • Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data. • Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary. • Again, talk to me if you have issues.
Remainder of this lecture Overview of sequencing a genome Next generation sequencing High-throughput experiments by sequencing Biomolecular databases
Genome sequencing Approach depends on the source, size, complexity and goals for a given organism • Goal? • De novo sequencing • Re-sequencing for annotation • Sequencing to identify variations • Size and complexity • Virus, bacterial, single-celled eukaryote, mammal, plant • Quasi-species or repetitive sequences • Sample prep • Can it be cultured? • Tissue source: unlimited or limited quantities? • Virus levels, RNA or DNA
Genome sizes Homo sapiens 3 Bbp Hepatitis C virus 10,000 bp Arabidopsis thaliana 135 Mbp Saccharomyces cerevisiae 12 Mbp Axoloti 32 Bbp
Types of sequencing Throughput Accuracy Read-length Cost Library prep
Sanger sequencing technology 1970s thru 1980s: >SEQ ATAGCCGTACTTAGCTGAGGAGTCGATAAC 1990s to today: Long read lengths (500-900 bp) & >99.9% correct Need to clone or PCR amplify the DNA to obtain enough for sequencing reaction, no library preparation Very high accuracy, relatively long reads, very low throughput
Illumina NGS • Read lengths up to 150 bp • Need to make bar coded libraries, which can be technically challening • Longer run time • Very high throughput: UP to 1TB of data per run • High accuracy • Very high throughput • Short reads
PacBio Single molecule sequencing Very long read lengths (up to 10 Kb) High error rate, but stochastic and can be dealt with by multiple passes No cloning Very long read lengths, low-medium accuracy, medium throughput
Illuminavs Ion Torrent • Illumina has greater capacity but longer run times • Ion torrent has longer read lengths (~200 bp) • Library prep similar to Illumina in complexity • SLU has an Ion Torrent machine • Cost is ~$250/sample, including the sequencing • Get strand specific sequencing without additional library prep
Oxford nanopore sequencing A protein nanopore is set in an electrically resistant polymer membrane. An ionic current is passed through it to generate a charge. As analyte passes through the pore it creates a characteristic disruption in current which is different depending on the base. MinION flow cell Attaches directly to computer for data analysis Long read lengths High error rate No cloning and direct RNA sequencing Medium to high-throughput Cheapest of the current technologies with simplest library prep
Bioinformatics challenges • Each flow cell in the IlluminaHiseq 2500 can generate a billion bases of sequence • Raw read files are Tb in size • Processed read files are several 700-800 Mb • Alignment files 150-300 Mb • Assembly of millions of short (75-100 bp) reads into vertebrate genome • Need high-performance compute (HPC) cluster for vertebrate sized genomes* • What biomolecular species to interrogate? • 25,000 genes • 160,000 transcripts • miRNA, non-coding RNA
Sequencing has become a standard technique • RNA sequencing for expression • ChIP sequencing for TF site identification • DNA sequencing for variants • Identification of populations/genetic changes in highly variable viruses and bacteria • Single cell RNA sequencing (Rich DiPaolo) • Metagenomics • Identification of unknown/non-culturable communities of bacteria/viruses/fungi
Which technology? • De novo genome sequencing and assembly: • Combination of Illumina and PacBio • Or PacBio and Nanopore • Resequencing for variant analysis • Illumina, Ion Torrent (smaller genomes) • RNAseq: • Illumina or Ion Torrent • Nanopore for direct RNA sequencing (no cDNA step) • Exome sequencing • Illumina or Ion Torrent • Metagenomics (16S sequencing) • Nanopore
Where is all this data stored? • National Center for Biotechnology Information (NCBI): • >45,000 genomes • >26,000 RNAseq datasets • European Bioinformatics Institute (EBI): • >51,000 genomes • No simple web interface to expression data • Joint Genome Institute (JGI, DOE funded): • >150,000 genomes • No expression data • Genome data by download or programming interface
Who analyzes all of this data? Our ability to generate sequencing data vastly outstripping our ability to analyze and annotate new genomes Annotation: prediction or identification of all functional genetic elements including protein coding genes, ncRNAs, ect. Two major public databases store and annotate or validate annotation for all genomes
Main sequence archives https://www.ncbi.nlm.nih.gov/ https://useast.ensembl.org/index.html
Pros and Cons of different archives • NCBI: National Center for Biotechnology Information • Databases are well integrated • Well integrated with literature (PubMed) • EBI: European Bioinformatics Institute • Same base data as NCBI, but offers different front-end • Much better list-based searching • Not as well integrated with literature • Transcript variants differ from NCBI because of different annotation pipelines • UNIPROT • All protein information from EBI is hosted here
Exercise 1: Finding gene related information • Gene are annotated by: • NCBI/EBI for certain organisms (Human, Chicken, Dog) • Organism specific groups for model organisms like yeast, mouse, C. elegans, Drosophilia • Genome sequencing centers • Ideally, genes have an official gene symbol and gene name and that is what is used in manuscripts, ect. • In reality, genes are identified by different groups at same time and named different things • The Human Genome Organization (HUGO), Mouse Genome Informatics (MGI), ect. are organizations that define official gene nomenclature
Exercise 1: Transcripts • Transcripts: • NCBI & EBI use different computational pipelines for predicting and annotating transcripts • There can be differences between them but typically at least the verified transcripts agree • Transcript isoforms have different accession numbers • BRACA1 has 4 transcript isoforms:
Human TDP-43 gene HGNC: HUGO gene nomenclature committee RNA binding protein with role in neurodegenerative disorders HGNC official symbol: TARDBP HGNC official name: TAR DNA binding protein However: A search of PubMed using TARDBP returns 434 records while using TDB-43 returns 2543 records Use the most common but refer to the official gene symbol at least once in your manuscript and especially in the abstract
Genome viewers • Provides chromosomal context to the gene(s) of interest • See transcript variants in graphical view • Have “tracks” of additional information: • Variants (SNPs) • Expression data • Repetitive sequences • Comparative data (with other species) • Download genomic sequence • Ensembl genome viewer (useast.ensembl.org) • UCSC genome viewer (genome.ucsc.edu)
TARDBP (TDP-43) in UCSC Genome Browser UCSC has their own annotation pipeline and so have different annotated transcripts
Take home points Rapid, high-throughput sequencing has opened up new ways to interrogate biological systems Generates Tb (Fb, Pb) of data Need computers to find it and analyze it Hypothesis generating (usually) with follow-up in bench experiments on limited number of genes Public databases treasure trove of data that can be mined with different questions
Today in computer lab Exercise 1 is due on Wednesday, June 19th Finding genes and transcripts using NCBI and EBI Finding gene specific information using NCBI & EBI Visualization of genes and transcripts with the EBI and UCSC genome browsers Extra credit: write a 500-700 word abstract on why you would study Lyme disease and/or B. burgdorferi
Source of data for the exercises Lyme Borreliosis Ixodesscapularis Borrelia burgdorferi LB most prevalent arthropod-borne infectious disease in N. America Photo Credit: Content Providers(s): CDC - This media comes from the Centers for Disease Control and Prevention's Public Health Image Library (PHIL), with identification number #6631. Carreras-Gonzalez A. et. al. “A multi-omics analysis reveals the regulatory role of CD180 during the response of macrophages to Borrelia burgdorferi” Emerging Microbes and Infections (2018) 7: 1-13
A few background papers (optional) Available for download or linked from the course website https://www.ncbi.nlm.nih.gov/pubmed/27976670 “Lyme Borreliosis”, Nat. Rev. Dis. Primers (2016) https://www.ncbi.nlm.nih.gov/pubmed/27900646 “The Potential of Omics Technologies in Lyme Disease Biomarker Discovery and Early Detection”, Infect. Dis. Ther. (2017) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5986905/ “Ixodes Immune Responses Against Lyme Disease Pathogens” Frontiers in Cell. and Infec. Microbiology (2018) https://www.ncbi.nlm.nih.gov/books/NBK532894/ Borrelia Burgdorferi. StatPearls (2018). Full text available.