CSCI6904 Genomics and Biological Computing

CSCI6904Genomics and Biological Computing Lecture 2 – Genomics Encoding Molecules Sequencing DNA Genome Projects Finding Genes

How to encode information into molecules Adenosine A Guanine G Thymine T Cytosine C

Nucleotides also are building blocks for energy and signaling pathways Adenosine A ATP AMP

The tale of a structure A structure for Deoxyribose Nucleic Acid J. D. WATSON F. H. C. CRICK 2 April 1953MOLECULAR STRUCTURE OF NUCLEIC ACIDS http://www.chemheritage.org/EducationalServices/chemach/ppb/cwwf.html

A double helix as encoding medium Protection against environment The informative unit is stowed inside, away from water soluble toxins. Cancer-causing agents typically can penetrate this defense. Redundancy Proof reading by comparing to complementary strand Mechanical Protection Torque, stretching… Control of information flow “Archive” the information when not in use Fancy control structures Hairpins, turns, twists and other frills.

Transcription Retrieve copies of accessible genes All genes on the chromosome that are exposed, for Any reasons, are transcribed into mobile and unstable molecules called RNA messengers [A, U, C, G]. Exporting, editing, processing These are exported out of the nucleus, some are edited according to some control scheme. Taken up by the translation machinery Enters a complex molecular machine to translate the gene into a 20-characters alphabet protein. Destroyed quickly As to avoid to require a control mechanism at this step

Translation Convert in words of three characters to protein chains These three-character words are called “codon”.

Translation The code is universal and degenerated Most organisms are using the universal code.

Translation The code is universal and degenerated Different organism have different codon frequencies.

Translation From there, the new chain spontaneously adopt a 3D structure and start to do something. Other protein require Further edition and are exported to their destination.

Protein Alphabet • 20 Universal characters. • Genetically encoded extra AA are very rare. • Each character has a set of properties: • Electostatic charge (+/-) • “hydrophobicity” (don’t mix with water) • Chemical reactivity

Sequencing DNA Why All genetic information is encoded in the DNA molecule. Sequencing DNA is necessary to create a representation of the information in which computation can be performed. Principle Reading cannot be done directly as the individual nucleotides cannot be visually resolved. Using a natural protein cloned from a bacteria, a collection of molecules of every possible length is generated. These artificial replicates are separated on the basis of their size in a gel matrix using a powerful electric field (electrophoresis). Individual replicates are then resolved because they are tagged with either fluorescent of radioactive markers.

Replication in vivo Principle A protein, called a DNA polymerase, step through a single strand of the DNA, finding the complementary character to a nucleotide and attaching it to the growing new chain. Polymerase enzyme (proteins)

Replication in vitro and sequencing

Sequencing DNA Nowadays Create a mixture of chains of all possible length by using un-reactive capped ends. Separate together the four mixtures on the basis of their size. Read the sequence as a string (anywhere between 300 – 900 characters*). *includes 0!

Sequencing DNA Errors Error in reading are more frequent at the extremity of the readable sequence. Very compact reads (early) Less defined reads (late) due to “smearing” Polymerase has an intrinsic rate of replication error. In nature, these would be called mutations. In a lab, these are just called annoying. Proof reading DNA Since there are two single strand, both are usually sequenced and cross-checked for inconsistencies.

Representation FASTA format Most basic format to store sequence information only. This is what is usually downloaded from a database. >Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK >Example2 synthetic peptide HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS

Representation Seq-entry ::= set { class pop-set , descr { pub { pub { gen { cit "Unpublished" , authors { names std { { name name { last "Burda" , first "Sherri" , initials "S.T." } } , { name name { last "Konings" , first "Frank" , initials "F.A.J." } } , … ASN.1 format This is the flat file format for Genebank. I’m not sure who directly use this. XML formats There are a collection of markup language derivatives for sequence information which may be more convenient to deal with. GenePep format Human readable formatting of the content, this is the default representation is one uses the online portal to the Genebank database.

Size of genomes

Size of genomes Variance in size It cost energy to replicate a genome. Organisms with a short generation time (~minutes) will have a strong pressure to dump the garbage and duplicate DNA as to maximize their efficiency. This is not a problem with higher life forms for which the availability of energy isn’t a problem. Some plants have the largest genomes. Although only a small fraction of these is actually encoding genes.

Vocabulary Plasmid Artificial construct used to manipulate sequences. Cloning Make a copy of a segment of DNA

Reading a whole genome BAC and YAC Are artificial chromosomes or plasmids from respectively bacteria or Yeast. Library Extract whole cell DNA, clean it up, break it into random fragments: 150 kbp (BAC) 0.15-1.5 Mbp (YAC) Paste the frags into BAC or YAC. Introduce BAC or YAC into host cell.

Chromosome walk Sequencing Principle Isolate a DNA fragment / chromosome. Create a specific replication primer. Sequence as far as possible. Use a region near the end of the current read to design a new primer. The initial primers are know because they are located on the BAC/YAC. Slow and expensive.

Shotgun Sequencing Principle Start with a BAC/YAC construct from a library, again. Create random replication primers. Sequence an arbitrary large number of samples. Assemble based on sequence identity.

Shotgun Sequencing Assembly into CONTIGS An case of the Shortest-common substring problem. Aided with the knowledge of Sequence-Tagged-Sites (STS). STS are pretty much just unique substring to a genome which have been mapped to a chromosome.

Shotgun Sequencing Assembly into CONTIGS The main caveat with the method is that is would tend to delete repetitive regions. Or get into local minima in situations like the following: ...ADGHKJGKJXXXXXXXXXSDGDKJHDGFXXXXXXXSADGUYDSSDGK…

Public vs Private Genomes Shotgun vs. Systematic walk The Human genome project broke into two components somewhere along the road: Private company: Shotgun sequencing only Public project: Chromosome mapping.

Public vs Private Genomes Tigr: Non-profit. First full genome: Haemophilus Influenzae Human Genome Draft 1 (Public data, 2001)

Public vs Private Genomes CELERA: Drosophila Genome (Public collaboration) Bacterial genomes (Proof of concept) Human Genome Draft 1 (Proprietary data, 2001) Both still have gaps and typos.

Public vs Private Genomes Which one is the best? CELERA’S draft has been shown to collapse regions of high sequence identity. CELERA has access to the public database to correct this problem. CELERA charges a high price for access to their data!

Who’s DNA was sequenced Nine persons (Anonymous) 8 Males 1 Female Males have a Y chromosome, females don’t. 3/9 were from germ line cells (sperm) Some genes are known to be pre-processed in non-germ line cells directly on the DNA. Craig Venter, CELERA’s CEO, admitted that ~3/5 of CELERA’s DNA is his! Sigh.

What is in the HGP

OK, so what is a gene. STOP codon TAA, TGA, TAG START codon ATG (also code for protein character M)

Open Reading Frame (ORF). Definition Any segment of DNA which starts with a Start codon and end with a stop codon in phase.

Open Reading Frame (ORF). Definition Any segment of DNA which starts with a Start codon and end with a stop codon in phase. The purple protein in this figure is responsible for finding stop codons.

Open Reading Frame (ORF). There are six possible translational frames to worry about Sequence as in the DB TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG Three first frames TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGA TC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA But DNA is a double strand… 5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’

What is 5’ and 3’? This is derived from the chemical notation for the sugar molecule ribose. Directionality of the Chain

Open Reading Frame (ORF). • But DNA is a double strand • 5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ • 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’ • Principle (in bacteria) • Find the Longest possible sequence beginning with an ATG, and terminating by a TAA, TAG, TGA. • There may be multiple ATG inside the gene, but only a single stop codon. • Real genes will have a regulatory regions upstream of ORF. Use pattern detection to do this. • Real genes are typically 100-500 codon long.

Open Reading Frame (ORF). The regulatory regions cannot be searched using string, or even regular expressions. The following slides will give you an idea of how this can be done.

Promoter regions

Calculating a Sequence Logo Information theory (Sequence Logos)

Promoter regions Information theory (Sequence Logos)

Finding Real Start codon Something like a HMM can be trained to classify whether an ATG codon really is a start codon. Yin and Wang, GeneScout paper, see course website

Things get complicated with eukaryotes Eukaryote genes contain sub-strings of self-splicing junk called introns.

Things get complicated with eukaryotes However, the splicing sites are made of statistically correlated sub-strings.

Things get complicated with eukaryotes A similar HMM strategy can be used to find all splicing sites.

Things get complicated with eukaryotes • The weights in the model are calculated based on a so-called coding potential: • No stop codon • Codon preferences in an organism (right frame will give a much better score)

Open Reading Frame (ORF). Things that can go wrong with ORFs The N-terminus part of the gene is truncated because of the presence of downstream ATG. Random occurrence of the ORF causing patterns. These would not have a “promoter” pattern upstream from the ATG. Eukaryotic genes are internally spliced which complicates the story, a lot.

Real genes vs. ORF Real genes are likely to be already documented in the databases (I know, a circular argument.) As we will see in the next series of slides. If not, a ORF has to be sequenced from a cDNA library instead of a genomic DNA library to be proven to be a gene.

BLASTING sequences

CSCI6904 Genomics and Biological Computing