Genome Sequences

Genome Sequences Ka-Lok Ng Asia University

History of genome sequencing • 1995, led by Craig Venter’s group, at the Institute of Genomic Research (TIGR) in Maryland • Reported the complete DNA seq. of the bacterium Haemophilus influenzae • The first viral genome seq. (phage phiX174) was produced by Fred Sanger’s group at 1978 • Insulin A, B chains(胰島素) – the first determined amino acid sequence in 1951 by F. Sanger (Cambridge U) • Sanger was awarded two Nobel prizes, the first one in 1958 on the structure of insulin, and the second one in 1980 (both in chemistry) for developing DNA sequencing techniques (with Paul Berg and Walter Gilbert)

Genome sequencing up to year 2001 http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html

Timeline of genome sequencing http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html

First draft of human genome F. Collins and C. Venter

Biological sequence space • DNA sequence • a seq. of symbols from the alphabet A, T, C, and G • IUPAC notation • R denotes A or G • Y denotes C or T • - denotes Gap • RNA sequence • a seq. of symbols from the alphabet A, U, C, and G • IUPAC notation • R denotes A or G • Y denotes C or U • - denotes Gap • Protein sequence • a seq. of symbols from 20 alphabets (except U,X, “J,O,B”, Z) RNA secondary structure

Biological sequence space • Convenient to model biological seq. as a one-dimensional (1D) object • It is also incorrect • It neglects all the information that might be contained in the 3D structure of the molecule • We make this approximation in this course

Building blocks of DNA sequences • Backbone • Pyrimidines – single ring • Thymine • Cytosine • Purines – double rings • Adenosine • Guanin • Complementary (A,T), (C,G)

Building blocks of protein sequences N-terminius, C-terminus (reading protein sequences from N to C) peptide bond  O==C –N-H, alpha carbon, the R group

Central dogma of molecular biology More with coding DNA DNA is a double strands, there are a total of 6 open reading frame (ORF)

Codon translation

Alternative splicing

Genome sequences • Prokaryotic genomes • Eubacteria and archaes are the two major groups of prokaryotes organisms without nuclei • Generally have a single, circular genome between 0.5 and 1.3 Mbp long • Simple genes and genetic control seqs. • Viral genomes • Not free-living organisms • Can be either single or double-stranded, and either DNA or RNA, that is ssDNA, ssRNA, dsDNA ro dsRNA • HIV, SARS • Eukaryotic genomes • Ranging in size from 8 Mb for some fungi to 670 Gbp • Human genome is about 3 Gbp long • Baker’s yeast, worm, zebra-fish, fruit-fly, mosquito; mammalian such as human, mouse, and plants such as rice • Organellar genomes • Mitochondrion (mtDNA) and chloroplast genome • Only hundreds or tens of thousand of bases long, circular, and contain a few essential genes

Working with whole Genomes Below is a circular representation of the E. coli.

DNA and Protein Sequences Databases Protein Sequence Databases NCBI  Molecular databases http://www.ncbi.nlm.nih.gov/Database/  RefSeq UniProthttp://www.pir.uniprot.org/ UniProt = Swiss-Prot + TrEMBL + PIR-PSD UniProt = UniProt Archive (UniParc) + UniProt Knowledgebase (UniProtKB) + UniProt nonredundant reference database (UniRef) ExPasyhttp://us.expasy.org/ PIRhttp://www-nbrf.georgetown.edu/

The Entrez system • Redundancy in GenBank • Many different GenBank entries are relevant to a specific gene, esp. for human, E.coli, yeast, fruit fly • 4 entries encompass the same E.coli dUTPase gene

Entrez Gene • Example: MEN1 AND human[ORGN] • where ORGN = organism

Entrez Gene • Read the summary - Summary • Official Symbol • Gene type • Gene name • Gene description • RefSeq status • Organism • Lineage • Gene aliases • Summary • Reference • Protein-protein interaction

FASTA format

Batch Entrez Gene • NCBI  site map 

Batch Entrez Gene • Retrieve multiple sequences information at one time • Uniprot seq. ID, prepare a text file, and upload (use database = protein) Q9XX00 Q8MQ56 Q9XWS4 Q9XU77 Q9XWH5 Q9N2K7

Eukaryotic entry example: AF018430 Use CoreNucleotide to search for the seq.

Retrieving GenBank entries without accession number • Entrez - human[organism] AND dUTPase[protein name] • AND must be in capital letters !

Whole Genome DB • NCBI home page  Genome Biology  Entrez Genome  Viral genome DB, Microbial genome ..etc )

Microbial genome – TIGR • http://www.tigr.org/tdb/ • Comprehensive Microbial Resource (CMR)

Genome databases • allow you to browse genomes starting from chromosome down to a single gene, an individual exons or a nucleotide. • Ensembl database • http://www.ensembl.org • UCSC database • http://genome.ucsc.edu

Microbial Database : GOLD • http://www.genomesonline.org

Statistical analysis of biological sequences • Look for sequence structures in biological sequences, either DNA, RNA or protein seqs. • Assuming one starts from 1D structure • Take DNA as an example, one expects the frequency of appearance of nucleotide A, T, C and G are equal  random sequence, %A = %T = %C = %G = 25% • In actual DNA seq., this is not true !

Statistical analysis of DNA sequences • Study the base composition • GC content • Frequent or rare words – words of length k • Biological relevance of unusual words (motifs)

Counting words in DNA seqs. http://www.genomatix.de/cgi-bin/tools/tools.plcreate seq. statistics

Counting words in DNA seqs. • NCBI  Genome (complete genome sequences)  microbial  Haemophilus influenzae Rd KW20 , NC_000907.1 (TIGR, dated on 1995)  Link: RefSeq FTP or GenBank FTP (L42023.fna)

Counting words in Haemophilus influenzae genome Total number of bp GC content agree with NCBI record

Counting words in Haemophilus influenzae genome • (%A) strand + = (%T) strand -, • (%C) strand + = (%G) strand -, • …. • Because of the complementary principle, i.e. A-T, and C-G

Counting words in Haemophilus influenzae genome Percentage of dinucleotide Use L-k+1

Counting words in Haemophilus influenzae genome • Nucleotide words of length 2 (called dimer) or higher (trimers, k-mers) • Words of length k are called k-grams or k-tuples in computer science, or k-mer in biological science Frequency of 3-mers

Finding unusual DNA words • A simple statistical analysis can be used to find under- and over-representation of motifs (主題,基本花紋) (i.e. k-mers) • Help us to decide when an observed bias is significant For the case of 2-mers • Compare the observed probability N of the 2-mers with the one expected under a background model, typically a multi-nomial model. The ratio between the two quantities indicates how much a certain word deviates from the background model and is called the odds ratio; where N(xy) is the frequency of the dinucleotide xy, N(x) and N(y) denote the frequency of the nucleotide x and y respectively. rxy > 1 or rxy < 1 the xy nucleotide is considered of high or lower relative abundance compared with a random seq.

Finding unusual DNA words • Clearly dimer deviate from value 1 are unusually represented, although the amount of deviation needed to consider this as a significant patterns needs to be analyzed with the tools discussed later in this course. • The dimer GG looks extremely infrequent in that table but this analysis reveals that this is not likely to be a significant bias because the nucleotide G is low in frequency to begin with. AA and TA seems to be unusual

Finding unusual DNA words • the odds ratio can be generalized to a k-mers • For k-mers there are 4 to the k-th power, 4k, possible different patterns Frequent words in H. influenzae, The words AAAGTGCGGT and ACCGCACTTT both appearing more than 500 times.

Biological relevance of unusual motifs • Frequent words may be due to repetitive elements • Rare motifs include binding sites for transcription factors • Words such as CTAG that have undesirable structural properties, because they lead to “kinking” of the DNA Virus vs. Bacteria • Words that are not compatible with the internal immune system of a bacterium. Bacterial cells can be infected by viruses, and I response they produce restriction enzymes, proteins that are capable of cutting DNA at specific nucleotide words, known as restriction sites. The nucleotide motifs recognized by restriction enzymes are under-represented in many viral genomes, so as to avoid the bacterial hosts’ restriction enzymes.

Analyzing DNA seq. http://bioweb.pasteur.fr/intro-uk.html#dna

Analyzing DNA seq. GC composition • Calculates the fractional GC content of nucleic acid sequences • C+G content, C ≡ G has a triple bond • GEECEE http://bioweb.pasteur.fr/seqanal/interfaces/geecee.html

Counting long words in DNA seqs. • http://bioweb.pasteur.fr/intro-uk.html • Use AK003076 >gi|12833508|dbj|AK003076.1| Mus musculus adult male spleen cDNA, RIKEN full-length enriched library, clone:0910001I10 product:DUTPASE homolog [Mus musculus], full insert sequence GGCTTTTTCCACGCCCGCCGCCATGCCCTGCTCGGAAGATGCCGCGGCCGTCTCTGCCTCCAAGAGGGCT CGAGCGGAGGATGGCGCTTCTCTGCGCTTCGTGCGGCTCTCGGAGCACGCCACGGCGCCCACCCGCGGGT CCGCGCGCGCTGCCGGCTACGACCTATTCAGTGCCTATGATTATACAATATCACCCATGGAGAAAGCCAT CGTGAAGACAGACATTCAGATAGCTGTCCCTTCTGGGTGCTATGGAAGAGTAGCTCCACGTTCTGGCTTG GCTGTAAAGCACTTCATAGATGTAGGAGCTGGTGTCATAGACGAGGATTACAGAGGAAACGTTGGGGTCG TGCTGTTTAACTTTGGGAAAGAGAAGTTTGAAGTGAAAAAAGGTGATCGGATTGCGCAGCTCATCTGTGA GCGGATTTCTTATCCAGACTTAGAGGAAGTGCAGACCCTGGATGACACCGAGAGAGGCTCAGGAGGCTTC GGCTCCACCGGGAAGAATTAGAACTTTGCTGGAAGTATCTCGCTGTTTCAACACTGGAAACCAGAAGCTC TAACTTCGGAAGCATTTGGTGTTCTAGGATGCAGGAAAGGAGACCTCGATCACATCACGTTGGAACGATT CTGTTCCCTGGTTGAGGTCGCCTGTAAGTCTGCACTGTGAGCATGGCATTGACATGCAGACTTGGTAAAA CCCAGGGTACAGTTAGATTTTTTGTTGTTGTTGTATTATTTAAATTATAGCCTTCCAAAAACTGTTTTTG ATCATAATTGCTGTATCATTTGTAATTTTTTTTAATCCAATAAAGTTGCTTTTAGC

Analyzing DNA seq. composition

Unusual words in different organisms or chromosomes • The measure rxy is suitable for a single seq.. • In comparing seqs. from different organisms or chromosome account for the complementary anti-parallel structure of DNA  modify rxy • Reference: Burge, Campbell and Karlin (1992), PNAS, 89, 1358 Double helix Sa = 5’-ATCG....-3’ Sb = 5’-CAGT….-3’ SaI = 3’-TAGC….-5’ SbI = 3’-GTCA….-5’ • Let I = inverted complementary seq., • X = A, T, C, G • a, b = species • faX = freq. of X for species a Observation • Chargaff’s rule  double strands  total number of A/C = total number of T/G

Unusual words in different organisms or chromosomes • Question: compare faX and fbX • need to consider the union of S and SI • why ? Let us consider the case in which one seq. with lots of A, and the other with lots of T  in fact it has lots of A in the complementary seq. ! Sa = 5’-AAAACGT....-3’ Sb = 5’-TTTTCGA….-3’ SaI = 3’-TTTTGCA….-5’ SbI = 3’-AAAAGCT….-5’ • Need to symmetrize 對稱化 the nucleotide frequencies, take into account of complementary seq. • Define S* = S +SI  fX* = (fX + fI(X))/2 • * means the union, that is count the freq. of X in both strand and take the average I = inverted complement of X Work with single DNA only, no need to find out the complementary seq. • Compare the double strand quantity f*, • that is compare f*aX and f*bX

Unusual words in different organisms or chromosomes How about counting frequency of 2-mers ? I = inverted complement of XY

Unusual words in different organisms or chromosomes How about the odd ratio for 2-mers ? A conservative estimation of low and high odd ratios are less than 0.78 and higher than 1.22 respectively.

Unusual words in different organisms or chromosomes How about the odd ratio for 3-mers ?

Compare statistical properties (1-mer and 2-mers) of human and chimp complete mitochondrial DNA NC_001807 and NC_001643 Human Chimp Both species have similar fX

Compare statistical properties (1-mer and 2-mers) of human and chimp complete mitochondrial DNA Human Chimp 4x4 = 16, symmetric  only need to compute 8 numbers not 16 ! symmetric

Genome Sequences

Genome Sequences

Presentation Transcript

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner

SEQUENCES

Sequences

SEQUENCES

Sequences

The foundation Full Genome Sequences And Their Annotations

What do genome sequences reveal?

Extracting genetic variation from human genome sequences Stephen Sherry, PhD

Sequences

Mapping NGS sequences to a reference genome

From Genome Sequences to Regulatory Network Phenotypes

Indexing Genome Sequences

Computational Analysis of Genome Sequences

Extracting homoeologous genomic sequences – the challenge of the wheat genome

SEQUENCES

Chapter 2 3. Genome sequences and gene numbers

SEQUENCES

Genome Sequences/ the Human Genome Project Dr. Chris Evelo

WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

Sequences

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner

Sequences