350 likes | 467 Views
CSCE555 Bioinformatics. Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Roadmap. DNA, Chromosomes, Genomes
E N D
CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.
Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary
Tools to Learn Concepts Quickly • Wikipedia.org • Search “Genome” bringing up many related information • In google, type “keywards wiki” • Google search tips • Find info from university websites • Genome, site:edu • Find info as powerpoint files • Genome, tutorial, filetype:ppt
DNA Bases A: adenosine C: cytidine G: guanosine T: thymidine • Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:sugars and phosphate groups DNA is a long polymer of simple units called nucleotides
Microbial Genome: Clostridium sp. OhILAs CTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT Complementary Base Pairing: A T C G Write a program to export complementary sequence?
Genome of organisms • genome of an organism is a complete DNA sequence of one set of chromosomes
Sequencing: Basic Ideas • Current lab techniques can sequence small (say 700 base pairs) DNA pieces. • Use restriction enzymes to cut DNA pieces • Sort pieces of different sizes using gel electrophoresis and use the sorting to read them • Mapping and Walking • Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone • Estimate for human genome sequencing using this method: 100 years • Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes • Obtain random sequence reads from a genome • Assemble them into contigs on the basis of sequence overlaps • Straightforward for simple genomes (with no or few repeat sequences) • Merge reads containing overlapping sequence • Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches
How Sequencing Works Beckman CEQ 8000
Sequencing small DNA pieces • Use DNA cloning or PCR to make multiple copies. • Put in 4 testtubes marked G, A, T and C • In testtube G use restriction enzymes that cuts at G. • Do the above step for the other testubes. • Use gel electrophoresis separately for the content in each testtube. • The data results in the table on the left. • Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17. • This gives us the sequence.
Methods for very large scale sequencing • A hierarchical approach • Map on a large scale (physical mapping), sequence specific clones whose position in the genome is known • Shot gun sequencing • “Tear up” the genome and sequence random fragments until it is done • Sequence tagged connectors (STC) • Sequence the ends of many clones and use this info to pick overlapping clones
“Shotgun” sequencing Sub- clone Copy Clone to sequence Sequence and “assemble” ….GTCTACCTGTACTGATCTAGC... …. CCTGTACTGATCTAGCATTA... …. GTACTGATCTAGCATTACG...
Emerging Sequence Methods • Sequencing by Hybridization (SBH). • Mass Spectrophotometric Sequences. • Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM ) • Single Molecule Sequencing Techniques • Single nucleotide Cutting • Nanopore sequencing • Readout of Cellular Gene Expression
Whole Genomes of Species • Bacterial Genomes • Eukaryotic Genomes • Human Genome Project • Other Animal and Plant Genomes • Model Genomes The genomes of more than 180 organisms have been sequenced since 1995 http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml
Sizes of Genomes You will learn to download all these genomes into your computer’s harddrive Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.
Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary
DNA Sequence Representation • DNA Sequence: a string of letters with alphabet {A, C, G, T} • Protein sequence: a string of amino acids with alphabet {ARNDCEQGHILKMFPSTWYV} • 20 standard amino acids • Genetic code:
Genetic Code: Condon • DNA (ATCG) RNA (AUCG) • Three bases of DNA encode an amino acid
Representation of Sequences • Single DNA sequence • ATCCTTAAGGAAA • Multiple sequences with similarity • Regular Expression • ATAAA • ACAAAA • ATAAAAAA • A[TC]A+
Representation of Sequences • Probablistic Model: Position-specific scoring matrices (PSSM)
Representation of Sequence: FASTA format • text-based format for representing either nucleic acid sequences or peptide sequences, • allows for sequence names and comments to precede the sequences.
Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary
Sequence Retrieval, Manipulation • Where to download genome/sequence data • Online databases: EMBL, GenBank • Entrez cross-database search (life science search engine) • Goolge -
Example: Download H. influenzae Genome • First bacterial genome: H. influenzae, 1830Kb • http://www.ncbi.nlm.nih.gov/sites/entrez • NC_007146LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27
Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary
Simple Questions and Analysis of Genome Sequence • Frequencies of Bases A/C/G/T by simple counting • Sliding windows to check local density • AT AG AC TA TG TC • K-mers frequent/unusual words • 2-mers AT AG AC TA TG TC etc. • 3-mers
Genomic landscape: GC content analysis • The overall GC content of the human genome is 41%. • A plot of GC content versus number of 20 kb windows shows a broad profile with skewing to the right. Page 627
GC content of the human genome: mean 41% Fig. 17.15 Page 628 Source: IHGSC (2001)
Genomic landscape: CpG islands • Dinucleotides of CpG are under-represented in genomic DNA, occuring at one fifth the expected frequency. • CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine). • Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions. • Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression. • They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.
Broad genomic landscape: CpG islands • Findings: • 50,267 CpG islands in human genome • 28,890 after masking repeats with RepeatMasker • 5-15 CpG islands per megabase • (about <40 genes per megabase)
Summary • DNA, Chromosome, Genome • Sequence models • Sequence database, retrieval • Whole genome sequence analysis
Slides Credits • Slides in this presentation are partially based on the work of slides from Internet.