1 / 15

Genomics data

Genomics data. Production and databasing of genome and transcriptome sequences. Sequencing projects. Human genome : consortium of international centers Bacterial genomes : usually single centers Non-human eukaryotes C. elegans , yeast , Drosophila , Arabidopsis : finished

jimbo
Download Presentation

Genomics data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics data Production and databasing of genome and transcriptome sequences

  2. Sequencing projects • Human genome: consortium of international centers • Bacterial genomes: usually single centers • Non-human eukaryotes • C. elegans, yeast, Drosophila, Arabidopsis: finished • In progress: mouse, Fugu, many parasites, ...

  3. Human Genome Project • Starting material for sequencing: genome or chromosome libraries • BAC, PAC, YAC, cosmid, fosmid… • Clone map based on STS content and end sequencing (STC) • Tiling path based on clone map • Sequencing based on tiling path

  4. Mapping resources • Genetic and physical maps: Genethon, GDB, NCBI • Radiation hybrid map: Sanger • BAC production & mapping: Oakland, Caltech, others • Clone information and retrieval: RZPD (Germany) • Physical maps in ACEDB format from chromosome coordinators

  5. Sequencing • Create shotgun library from BAC/PAC • Sequence individual clones to get a ten-fold coverage • Phases: • 0 = single sequence (like STS) • 1 = unordered contigs • 2 = ordered, oriented contigs • 3 = finished, annotated sequence

  6. Chromosome size sequences • Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications • Solution: split the sequence in overlapping “chunks” • New problem: have to reassemble chunks if you want to analyze the whole sequence • GenBank provides “meta-entries” (CON division) with assembly instructions

  7. What genome data are available? • Deposited in EMBL/GenBank • BAC/PAC sequences • Tile path contigs (overlapping chunks) • HTG sequences: phases 0,1,2 • Available on FTP sites • Full bacterial genomes • Large eucaryotic contigs • Web interfaces • Many, usually graphical and species-specific

  8. What is the transcriptome? • The set of all functional mRNAs that can be transcribed from the genome • The documentation of the localisation (cell type) and conditions under which these mRNAs are expressed • The documentation of the biological function(s) of each mRNA species

  9. The public draft transcriptome • “Full” cDNA sequences of well-characterized genes • “Full” cDNA sequences of genes of unknown function (e.g. KIAA or DKFZ collections) • Traditional ESTs from oligo(dT) primed libraries • ORESTES sequences • Limited set of expression data

  10. ESTs - the basics • cDNA libraries prepared from various tissues and cell lines, using directional cloning • Gridding of individual clones using robots • For each clone, sequencing of both ends of insert in single pass • Deposit readable part of sequence in database

  11. DNA Primary transcript Splice variants cDNA clones EST sequences 5’ 3’ 5’ 3’ Anatomy of EST sequences

  12. EST databases • EMBL/GenBank have separate sections for EST sequences • ESTs are the most abundant entries in the databases (>60%) • ESTs are now partially separated by species in the databases • EST sequences are submitted in bulk, but do have to meet minimal quality criteria

  13. The ORESTES project • Sponsors: Ludwig Institute for Cancer Research, FAPESP (São Paulo State funding agency) • Goal: to obtain EST sequences from the under-represented, often coding, central portions of mRNAs • Methodology: use low-stringency semi-random priming followed by PCR, producing low complexity libraries • Results: 1’000’000 ESTs produced, of which about half produce novel information

  14. Anatomy of ORESTES sequences DNA Primary transcript Splice variants cDNA clones= EST sequences

  15. How to organize EST collections? • Clustering: associate individual EST sequences with unique transcripts or genes • Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster • Mapping: associate ESTs (or EST contigs) with exons in genomic sequences • Interpreting: find and correct coding regions

More Related