Automation for Genomics Discovery at the Oklahoma Genome Center

Automation for Genomics Discovery at the Oklahoma Genome Center Bruce A. Roe Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019 Working Innovation into the Drug Discovery Pipeline June 3, 2004 Houston Marriott Medical Center

Central Dogma of Molecular Biology Gene transcribe RNA process/ transport mRNA translate Protein Stable RNAs Chromosome DNA Each Chromosome Contains Hundreds of Genes

What is a GENOME? For humans, is the complete set of 23 chromosome pairs that we inherited from our parents. The human genome contains all the information needed to make a human. Most bacteria have only a single chromosome that represents it’s genome and contains all the information needed to make that bacteria.

Achieve ~5-fold coverage of at least 90% of the genome in a “working draft” based on mapped clones and finish one-third of the 3 billion base paired human genomic DNA sequence by the end of 2000 Finish the complete human genome sequence by the end of April 2003, marking the 50th anniversary of the discovery of the double helix structure of DNA by Watson and Crick Make the sequence totally and freely accessible Reduce the cost of DNA sequencing to 25 cents/base over this 5 year period by developing new technologies Study human genome sequence variation by creating a Single Nucleotide Polymorphism (SNP) map with at least 100,000 markers Human Genome Project Goals 1998-2003

How Far Have We Come as of June 2004? • Over 99% of the ~3.15 billion bases in the human genome have been sequenced to completion finished as of April, 2003. All the data is publicly available in the public databases. • Ten human chromosomes (7,9,10,13,14,19,20,21,22,Y) have been annotated and published and the remaining 14 are in the final phases of annotation. • There are fewer than 400 gaps in the sequence of the 24 chromosomes (22 numbered chromosome pairs plus X and Y) • The cost of completed genomic DNA sequencing is slightly less than 8 cents/finished base with the development of improved automation. • Had 3 quality checking exercises where two groups checked the quality of another both in silico and by re-sequencing. http://www.ncbi.nlm.nih.gov/genome/seq/HsHome.shtml

How do we sequence DNA? The processes is similar to taking many copies a newspaper, shreading it, then trying to put together a copy of the original newspaper This is accomplished by breaking many copies of the DNA into small pieces and determining the order of the four bases in each of these small pieces Then, we overlap the small sequenced pieces to obtain the sequence of the original, larger DNA

DNA GenBank Sequencing (ABI 3700) Growing subclones (HiGroTM) Subclone isolation II (VPrepTM) DNA shearing (HydroshearTM) Data assembly and Analysis Thermocycling (ABI 9700) Subclone Isolation I (Mini-StaccatoTM) Colony Piking (QPixIITM) Closure Liquid Handling AMS-90 for PCR Product Analysis Primer Synthesis Sequence Pipeline at the University of Oklahoma Genome Center, OU-ACGT

Hydroshear • GeneMachines, Inc. San Carlos, CA • Precision-drilled ruby orifice • 500 m l syringe pump • Pump retraction speed range 0 – 40 • A 100 to 300 ml sample sheared at a retraction speed • setting of 10 produces DNA 1- 4 Kbp fragments

Genetix QPixII Colony Picker Digitizes colonies and picks in batches of 96 into 384-well plates Pins are sterilized after each set of 96 colonies are picked

Cell Growth in 384 well plates in a HiGro • Capacity: 48 shallow, 384 well plates or 24 deep well plates. • Cells are grown into TB medium supplemented with salts and antibiotic • Cells are shaken at 520 rpm for 22 hours at 370C. • After 3.5 hours, oxygen is added @ 0.5 ft3/min for 0.5 second every 30 seconds.

4 built in shakers Robotic 386 well plate loader and stacker 384 tip pipettor Zymark SciClone with Twister II

Subclone Isolation I (Mini-Staccato) • This Zymark robot has 384 cannula array, four built in shakers, three attached storage racks, built-in barcoding and a Twister II robotic arm. • This automation has allow us to perform the DNA isolation completely unattended from as many as 80 384 well plates of bacterial cells per day.

Subclone Isolation I (Mini-Staccato) The initial lysis solution (NaOH and SDS) is added to each of four 384 well plates containing bacterial cells that were loaded onto the built-in shakers incorporated into the SciClone workspace deck.

Subclone Isolation I (Mini-Staccato) The second solution, TE-RNase A, is added to each of the 384 well plates and again shaken on the four auto-centering magnetic shakers on the SciClone workspace deck.

Subclone Isolation I (Mini-Staccato) Once all three lysis solutions are added and the plates are shaken after each addition, the plates are transferred from the SciClone workspace deck to a storage rack by the Twister II robotic arm.

A C G T A C A C G T T C G G C Dye terminator-labeled nested fragment set of DNA copies from a template with unknown sequence in a single reaction tube C G A A C G T Reaction products are applied to a single gel lane or capillary and electrophoresed to separate the nested fragment set The sequence information is fed into a computer Detector Laser Fluorescent DNA Sequencing

Subclone Isolation and Sequencing Reaction Pipetting (Velocity 11 VPrep) • Liquid handling station with 384-channel pipettor head • Four movable shelves on either side of the pipettor head • Used for Subclone isolation, sequencing reactions set-up and as shown here, the ethanol-acetate precipitation clean-up step.

60 cycles Subclone sequencing conditions 600C 4:00 40C ∞ 950C 0:30 500C 0:20 950C 2:00 Thermocycling (ABI 9700)

Capillary Electrophoresis DNA Sequencing • Our present capacity is fourteen 96 ABI 3700 capillary electrophoresis-based DNA sequencing instruments that are capable of analyzing two 384-well thermocycle plates or eight 96-well thermocycle plates per day. • The DNA sequencing data is transferred to the Sun computer workgroup for base calling (Phred), assembly (Phrap) and analysis (Consed).

Primer synthesis (Mermade IV) for PCR-based closure and finishing • Standard phosphoramidite chemistry in an argon- filled reaction chamber. • 192 primers synthesized at 2.5 nmole scale. Twice each day. • 2.5 nanomole synthesis (50 cents/oligo) typically is used for either PCR or DNA sequencing primers, but can be scaled to 10 nanomole.

Data assembly and Analysis Phred/Phrap/Consed Sun V880 server Exgap • 32 GB RAM running Solaris 8 OS and 3 TB of data stored on RAID-5 arrays with autoloader tape backup • Also: • 12 workstations each with 1 GB RAM

Sanger, Keio, Wash U, OU

Human Chromosome 22 Sequence Features • 39 % of the sequence is occupied by genes including their introns, 5’ and 3’ non-translated regions. • 3 % of the complete sequence encodes the protein products of these genes. • 42 % of the sequence is composed of repetitive sequences, compared to 46 % for the entire genome. • Only slightly over half of the genes predicted for human chromosome 22 can be experimentally validated.* * Shoemaker DD., et al. Experimental annotation of the human genome using microarray technology. Nature. 409, 922-7 (2001).

Siblings by 1 to 2 million bases, ~99.98% identical, with coding regions 99.99999% identical Unrelated humans by 6 million bases, ~99.8% identical overall, with coding regions 99.9999% identical Chimpanzees by about 100 million base pairs ~98% identical Baboons by about 300 million base pairs ~92% identical Mice by about 2.8 billion bases, but coding regions are ~90% identical Leaf spinach by about 2.9 billion bases, but coding regions are ~40% identical An Individual’s Genome Differs from the DNA of:

Differences between individuals AGCCACACAGTGTCCACCGGATGGTTGATTTTGAAGCAGAGTTAGCTTGTCACCTGCCTCCCTTTCCCGGGACAACAGAAGCTGACCTCTTTGNTCTCTTGCGCAGATGATGAGTCTCCGGGGCTCTATGGGTTTCTGAATGTCATCGTCCACTCAGCCACTGGATTTAAGCAGAGTTCAAGTAAGTACTGGTTTGGGGAGNAGGGTTGCAGCGGCNGAGCCAGGGTCTCCACCCAGGAAGGACTNATCGGGCAGGGTGTGGGGAAACAGGGAGGTTGTTCAGATGACCACGGGACACCTTTGACCCTGGCCGCTGTGGAGTGTTTGTGCTGGTTGATGCCTTCTGGGTGTGGAATTGTTTTTCCCGGAGTGGCCTCTGCCCTCTCCCCTAGCCTGTCTCAGATCCTGGGAGCTGGTGAGCTGCCCCCTGCAGGTGGATCGAGTAATTGCAGGGGTTTGGCAAGGACTTTGACAGACATCCCCAGGGGTGCCCGGGAGTGTGGGGTCCNAGCCAG The yellow underlined sequence is the first exon of the BCR gene involved in leukemia. Only 5 bases (N) differ in non-gene regions.

Human Chromosome 22 Single Nucleotide Polymorphisms* Number of overlaps 335 Size of overlaps 13,203,147 bp Number of SNPs 11,116 (~1/1000 bp) Number of substitutions 9,123 (82%) Number of ins/del 1,193 (18%) Only 48 of the 11,116 SNPs were in coding regions ~ 10 fold lower than in non-coding * E. Dawson, et al. A SNP Resource For Human Chromosome 22: Extracting Dense Clusters of SNPs from the Genomic Sequence. Genome Research, 11, 170-178 (2001).

“We each are like a different symphony orchestra” “All playing the same instruments slightly differently”

Good news and Bad news • Bad news • 2-4 times as many proteins as other species due to extensive alternative splicing in humans. • Good news <40,000 genes (counting dark space?) • We only know the function of about half the predicted genes. • Likely > 1 million different gene products based on alternative splicing and post-translational modifications.

Where we stand now • We essentially have the ‘dictionary’ with all the words (genes) spelled correctly, but only slightly more than half of the words (genes) have definitions. • Slightly over half of the 936 genes predicted for human chromosome 22 have been experimentally validated. • 223 have a known function and expression • 172 have no known function but evidence for expression • 182 have no known function and no evidence for expression • 228 pseudogenes • Through comparative genomic sequencing we can annotate the human genome based on evolutionary conserved gene sequences and use model systems to study gene expression.

If a genomic region is conserved in evolutionary distant organisms, it is present because the region is maintained through selective pressure over evolutionary time likely because it performs necessary function.

Chimpanzee and Baboon Genomic Sequencing • Medically important model eukaryotic organisms • The chimpanzee is our nearest evolutionary relative with a genome that has ~98 % sequence identity with the human genome • The baboon genome has ~92 % sequence identity with the human genome

human- specific repeat regions Questionable gene present in primates but not in rodents PIP Plot of a region of human chr22 compared to syntenic regions of baboon and mouse

34 Kbp deletion in baboon

Exons in one copy of a zebrafish duplicated gene with 75% homology to human but greatly diverged, <50% homology, in the other copy

A complementary approach is to determine if the predicted protein coding conserved elements are functional by investigating their expression profiles during development.

Whole mount in situ hybridization using zebra fish as the model organism Small people that swim in the water and breath through gills… Han Wang, OU

Zebrafishas a model system • Have a short, ~ 3 month to reproductive maturity. • Can be easily bred in the lab in large numbers. • Are small in size - an adult is just a few centimeters long. • Have an ~ 5 day embryonic development period from fertilized egg to a swimming fish. • The embryos are transparent making it easy to see internal organs during development. • Is well established as a resource for genetic studies. • The Sanger Institute is completing the genome sequence, which presently is ~50% complete and publicly available. • More than 90 % of the predicted human genes have a zebra fish ortholog.

P P Whole mount in situ hybridization Alkaline phosphatase-conjugated anti-DIG antibody BCIP* + NBT** DIG-labeled ssDNA or RNA probe Digoxigeninlabel uridine Wash Wash P mRNA 1. Add digoxigenin-labeled probe complementary to RNA of interest 2. Add alkaline phosphatase-conjugated antibody that binds to digoxigenin 3. Add BCIP + NBT that turns dark purple dye when dephosphorylated by the alkaline phosphatase thereby coloring the cell *BCIP = 5 bromo-4-chloro-indoxyl phosphate **NBT = nitro-blue-tetrazolium

PCR off zebra fish genomic DNA Followed by unidirectional amplification with either forward or reverse (nested) primers in the presence of DIG-labeled dUTP ssDNA (sense and antisense probes) Exon-specific ssDNA primers Mermade synthesis of unique exon specific primers of the gene of interest These steps now have been automated in a 96 well format

Size Markers 1078 603 310 PCR F R PCR F R PCR F R PCR F R PCR F R Ethidium bromide stained 1% agarose gel of dsPCR off genomic DNA and subsequently unidirectional amplified single stranded DNA probes • These studies clearly demonstrate that, contrary to popular belief, single stranded DNA contains regions that fold into sufficient double stranded secondary structures that ethidium bromide can bind. • However, agarose gel electrophoresis is labor intensive (slab gel preparation and loading), electrophoresis is time consuming, and detection typically requires the use of carcinogenic ethidium bromide

AMS-90 for ssPCR primer, dsPCR and single strand unidirectional exon amplification

single strand uni-directional products F R single strand uni-directional products F R single strand uni-directional products F R single strand uni-directional products F R ds PCR product ds PCR product ds PCR product ds PCR product Bases 7000 4900 2900 1900 1100 700 500 300 100 15 PCR and Unidirectional Single Primer Amplification on the AMS-90 Both double and single stranded DNA rapidly can be resolved, detected and archived on the AMS-90

Bases 7000 4900 2900 1900 1100 700 500 300 100 15 ug/ul 2.0 1.0 0.5 0.25 0.12 0.06 0.06 0.12 0.25 0.5 1.0 2,0 Decreasing 20-mer Concentration Increasing 20-mer Concentration Custom MerMade Synthesized 20-mer DNA Primers Rapidly Analyzed on the AMS-90 Rapid, 30 seconds/lane run time vs over an hour/sample via capillary electrophoresis, of single stranded oligonucleotides

AMS-90 vs Ethidium Bromide Stained Agarose Gels or Capillary Electrophoresis • Both can be used to resolve and view both double stranded and single stranded DNAs • However, analysis on the AMS-90 requires: • minimal human interaction, • no separate photography, • much less technician time, • eliminates the use of carcinogenic ethidium bromide • is less error prone and • takes much less time.

Human hypothetical protein-KIAA0819 • One gene with 11 exons on Hu Chr 22 • This one gene is split into 2 genes • in zebra fish • ZF1 - Genomic location:307,280-316,461 bp on Sanger Institute chromosome fragment ctg14067 • With the first 4 exons • ZF2 - Genomic location:107,344-119,287 on Sanger Institute chromosome fragment ctg11065 • With the remaining 7 exons • Note: 4 + 7 = 11

100% 50% A multiPIP analysis of the predicted genes from human, rat, mouse, fugu and zebra fish (ZF1 and ZF2) with homology to cDNA probe KIAA0819

Single human kiaa0819 gene ZF2 ZF1 Two zebra fish kiaa0819 gene orthologs Orthologous duplicated copies of a single copy human KIAA0819 gene in zebra fish

Automation for Genomics Discovery at the Oklahoma Genome Center