150 likes | 299 Views
Genomics data. Production and databasing of genome and transcriptome sequences. Sequencing projects. Human genome : consortium of international centers Bacterial genomes : usually single centers Non-human eukaryotes C. elegans , yeast , Drosophila , Arabidopsis : finished
E N D
Genomics data Production and databasing of genome and transcriptome sequences
Sequencing projects • Human genome: consortium of international centers • Bacterial genomes: usually single centers • Non-human eukaryotes • C. elegans, yeast, Drosophila, Arabidopsis: finished • In progress: mouse, Fugu, many parasites, ...
Human Genome Project • Starting material for sequencing: genome or chromosome libraries • BAC, PAC, YAC, cosmid, fosmid… • Clone map based on STS content and end sequencing (STC) • Tiling path based on clone map • Sequencing based on tiling path
Mapping resources • Genetic and physical maps: Genethon, GDB, NCBI • Radiation hybrid map: Sanger • BAC production & mapping: Oakland, Caltech, others • Clone information and retrieval: RZPD (Germany) • Physical maps in ACEDB format from chromosome coordinators
Sequencing • Create shotgun library from BAC/PAC • Sequence individual clones to get a ten-fold coverage • Phases: • 0 = single sequence (like STS) • 1 = unordered contigs • 2 = ordered, oriented contigs • 3 = finished, annotated sequence
Chromosome size sequences • Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications • Solution: split the sequence in overlapping “chunks” • New problem: have to reassemble chunks if you want to analyze the whole sequence • GenBank provides “meta-entries” (CON division) with assembly instructions
What genome data are available? • Deposited in EMBL/GenBank • BAC/PAC sequences • Tile path contigs (overlapping chunks) • HTG sequences: phases 0,1,2 • Available on FTP sites • Full bacterial genomes • Large eucaryotic contigs • Web interfaces • Many, usually graphical and species-specific
What is the transcriptome? • The set of all functional mRNAs that can be transcribed from the genome • The documentation of the localisation (cell type) and conditions under which these mRNAs are expressed • The documentation of the biological function(s) of each mRNA species
The public draft transcriptome • “Full” cDNA sequences of well-characterized genes • “Full” cDNA sequences of genes of unknown function (e.g. KIAA or DKFZ collections) • Traditional ESTs from oligo(dT) primed libraries • ORESTES sequences • Limited set of expression data
ESTs - the basics • cDNA libraries prepared from various tissues and cell lines, using directional cloning • Gridding of individual clones using robots • For each clone, sequencing of both ends of insert in single pass • Deposit readable part of sequence in database
DNA Primary transcript Splice variants cDNA clones EST sequences 5’ 3’ 5’ 3’ Anatomy of EST sequences
EST databases • EMBL/GenBank have separate sections for EST sequences • ESTs are the most abundant entries in the databases (>60%) • ESTs are now partially separated by species in the databases • EST sequences are submitted in bulk, but do have to meet minimal quality criteria
The ORESTES project • Sponsors: Ludwig Institute for Cancer Research, FAPESP (São Paulo State funding agency) • Goal: to obtain EST sequences from the under-represented, often coding, central portions of mRNAs • Methodology: use low-stringency semi-random priming followed by PCR, producing low complexity libraries • Results: 1’000’000 ESTs produced, of which about half produce novel information
Anatomy of ORESTES sequences DNA Primary transcript Splice variants cDNA clones= EST sequences
How to organize EST collections? • Clustering: associate individual EST sequences with unique transcripts or genes • Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster • Mapping: associate ESTs (or EST contigs) with exons in genomic sequences • Interpreting: find and correct coding regions