990 likes | 1.36k Views
Genomics Databases and Bioinformatics Applications Wailap Victor Ng Institute of Biotechnology in Medicine Institute of Bioinformatics Dept Biotechnology and Lab Science in Medicine National Yang Ming University wvng@ym.edu.tw March 22, 2005. Goals of the Human Genome Project (1990 ~)
E N D
Genomics Databases and Bioinformatics ApplicationsWailap Victor NgInstitute of Biotechnology in MedicineInstitute of BioinformaticsDept Biotechnology and Lab Science in MedicineNational Yang Ming Universitywvng@ym.edu.twMarch 22, 2005
Goals of the Human Genome Project (1990 ~) • Map and sequence the 3,000 Mb human genome • Map and sequence the genomes of model organism • - The bacterium E. coli (4.6 Mb) • - The yeast S. cerevisiae (12 Mb) • - The roundworm C. elegans (100 Mb) • - The fruit fly D. melanogaster (180 Mb) • - The mouse M. musculus (3,000 Mb) • Collect and distribute data • Study the ethical, legal, and social implications of genetic research • Train researchers • Develop technologies • Transfer technology to the private sector • http://www.genome.gov/Pages/EducationKit/online.htm
Milestones of Genome Projects • 1995 Haemophilus influenzae (1.83 Mb; 1,742 genes) • 1996 Saccharomyces cerevisae (12 Mb; 6,000 genes) • 1998 Caenorhabditis elegans (97 Mb; 19,000 genes) • 2000 Arabidopsis thaliana (115/125 Mb; 25,000 genes) • 2000 Drosophila melanogaster (~120 Mb; 13,600 genes) • 2001 Homo sapiens(90%; 2,900 Mb; ~30k genes) • 2002 Mus musculus (96%; 2,500 Mb; ~30K genes) • 2002 Oryza sativa L. ssp. indica (92%; 466 Mb; 46-56k genes) • 2002 Fugu rubripes (95%; 365 Mb; 33,609 genes) • 2004 H. sapiens (99% euchromatin; 2,850 Mb; 20,000-25,000 genes)
Homo sapiens • Number of cells: ~1x1014 • Number of genes encoded by the genome: 20,000 – 25,000 • Number of Expressed genes per cell type: 10,000-15,000
Proteome (Proteins) Transcriptome (mRNAs) Genome Complexity 25-30K genes (Human) Alternative splicing Post-translational modifications
Genome Sequencing Strategies • Top-down approach • - Clone large genomic DNA fragments into special vector, • e.g. BAC (bacterial artificial chromosome) • - Create an ordered array of BAC clones • - Carry out full-length BAC clone sequencing • - Assemble the BAC insert sequences • - Identify the next BAC for full length sequencing • (Hybridization method or searching BAC end sequence library) • Bottom-up approach • - Whole genome shotgun sequencing
Small insert library in plasmid Large insert library in BAC/YAC ordered cosmid library Medium insert library in cosmid Genomic DNA Top-down genome sequencing method • Method I. Systematic sequencing of ordered clones • Construct shotgun genomic library in YAC (yeast artificial chromosome) or BAC vector • Use the YAC or BAC clone DNAs to construct smaller insert shotgun cosmid DNA library (~45 kb inserts) • Multiple Complete Digest (MCD) mapping of cosmid DNAs ordered cosmid clone library • Choose the minimal overlap set of cosmid DNA to construct shotgun libraries in M13 or plasmid vector DNA sequencing Assembly
Multiple Complete Digest Mapping (YAC DNA) Flow chart of wet bench procedures for YAC → cosmid and BAC → cosmid MCD mapping. The main difference is that, while BAC DNA can readily be purified from bacterial chromosomal DNA, there is no good preparative method to separate YAC DNA from yeast chromosomal DNA. In the YAC case, the few percent of the cosmids that are derived from the YAC are identified by a hybridization-based colony-screening protocol. With BAC-derived cosmids, this step is unnecessary because the mapping software can readily eliminate the small number of cosmids that do not originate from the BAC. Proc Natl Acad Sci U S A. 94: 5225 (1997)
Schematic representation of MCD mapping process. • Gel image. • (b) List of fragment sizes for each enzyme domain in each clone. Lanes labeled with a number identify the clone as c01 or c02. Lanes labeled with the letter M identify size markers. • (c) Three single-enzyme maps are independently constructed (Right). Synchronization across enzyme domains results in a composite map (Left). Long tick marks indicate boundaries between ordered groups of fragments; short tick marks demarcate unordered fragments within a group, arbitrarily drawn in order of decreasing size. • Proc Natl Acad Sci U S A. 94: 5225 (1997)
Gray scale image of a typical mapping gel poststained with SYBR–green I. There are five marker lanes, at positions 1, 8, 15, 22, and 29. Two clones, each independently digested with EcoRI, HindIII, and NsiI (and loaded in that order) are placed between every pair of marker lanes. Proc Natl Acad Sci U S A. 94: 5225 (1997)
Representative MCD map from chromosome 7 Proc Natl Acad Sci U S A. 94: 5225 (1997)
Identify neighboring BAC clones for sequencing DNS sequencing and assembly Large insert library in BAC Small insert shotgun library in plasmid Genomic DNA Top-down genome sequencing method • Method II.BAC by BAC sequencing • Choose BAC clone seeds • Construct BAC shotgun library in plasmid vector • Sequence the shotgun plasmid DNAs • Assemble the shotgun reads • Look for adjacent BAC clones for sequencing – • - By colony array hybridization or • - BAC end sequence library
BAC colony array hybridization assay Array E. coli on nylon membrane and grow cells agar plate E. coli transformants Large insert library in BAC Small insert shotgun library in plasmid vector Genomic DNA Lyze E. coli colonies on nylon membrane Autoradiogram Hybridize with PCR amplified BAC end probes Fix the DNA onto nylon membrane 25x25 cm2
BAC colony array hybridization BAC clone genomic DNA insert (sequenced) PCR-1 PCR-2 Restriction fingerprinting
How many reads is needed to determine a genome sequence? • Usually ~8X coverage of each base pair • # reads = ( 8 x genome_size ) / (av._read_length) • e.g. Haloarcula marismortui (4,274,315 bp) • # reads = (8 x 4,274,315 bp) / (550 bp) = 62,172 sequencing reactions
USB Principle of Sanger Dideoxy DNA sequencing http://genetics.nbii.gov/basic2.html
DNA Primer -ddATP Taq DNA Pol -ddCTP -ddGTP -ddTTP Reaction buffer Thermocycling 2-propanol precipitation DNA analyzer Simple one step fluorescent dye-terminator DNA cycle sequencing
ABI 3730 xl GATCAGGGTTACATGCTACGGCTTCACACGTCGACCCATATTAC................... Electropherogram (chromatogram) Applied Biosystems Capillary DNA Sequencer
phred • Function – base calling and quality assignment • chromat files (input) phd files (output)
Example of phd file q value: numbers in middle column q = -10 log (P) q, quality value P, estimated error rates q20 1 error in 100 bases (p=0.01) q40 1 error in 10,000 bases (p=0.0001)
Sequence Assembly Software • phredPhrap (Phil Green) • cap3 (Xiaoqiu Huang) • TIGRAssembler (TIGR) • ATLAS (BCM) • SPS phrap (Geospiza) • Genome Assembler (Paracel) • Celera Assembler (Celera) • BGI Assembler (BGI)
Basic Functional Genomic Analysis • Gene Prediction (P: Prokaryotes; E: Eukaryotes) • - Glimmer (P) • - GenMark (PE) • - Genscan (E) • - X-grail (E) • - Fgenes (E) • - est2genome (E; EST driven prediction) • * others (http://www.cs.jhu.edu/~salzberg/appendixa.html#Gene_finders) • Gene Functional Analysis • - Blast searches • - Motif analysis • - Structure prediction and homology searches
Sources of genomics databases and bioinformatics applications • Public Data Banks - NCBI, EMBL-EBI, and DDBJ • Genome Centers • - DOE Joint Genome Institute • - Baylor College of Med. Human Genome Sequencing Center • - The Welcome Trust Sanger Institute • - Washington Univ. School of Med. Genome Sequencing Center • - Whitehead Institute/MIT Center for Genome Research • - Others (www.ornl.gov/sci/techresources/Human_Genome/research/centers.shtml)
NCBI Map Viewer Human Genome Resources