1 / 69

BINF 101 Introduction to Bioinformatics

BINF 101 Introduction to Bioinformatics. Arthur W. Chou Dept of Math and Computer Science Clark University February 11, 2008. Human Genome Project. Historical context Goals of the HGP Strategy Results Impact on biomedical research. February 2001. « Finished » sequence April 2003.

arleen
Download Presentation

BINF 101 Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BINF 101 Introduction to Bioinformatics Arthur W. Chou Dept of Math and Computer Science Clark University February 11, 2008

  2. Human Genome Project • Historical context • Goals of the HGP • Strategy • Results • Impact on biomedical research

  3. February 2001 « Finished » sequence April 2003

  4. Brief history of HGP • 1984 to 1986 - first proposed at US DOE meetings • 1988 - endorsed by US National Research Council (Funded by NIH and US DOE $3 billion set aside) • 1990 - Human Genome Project started (NHGRI) Later – UK, France, Japan, Germany, China • 1998 - Celera announces a 3-year plan to complete the project years early • First draft published in Science and Nature in February, 2001 • Finished Human Genome sequence published in Nature 2003.

  5. Goals of HGP • Create a genetic and physical map of the 24 human chromosomes (22 autosomes, X & Y) • Identify the entire set of genes & map them all to their chromosomes • Determine the nucleotide sequence of the estimated 3 billion base pairs • Analyze genetic variation among humans • Map and sequence the genomes of model organisms

  6. Model organisms • Bacteria (E. coli, influenza, several others) • Yeast (Saccharomyces cerevisiae) • Plant (Arabidopsis thaliana) • Round worm (Caenorhabditis elegans) • Fruit fly (Drosophila melanogaster) • Mouse (Mus musculus)

  7. Goals of HGP (II) • Develop new laboratory and computing technologies to make all this possible • Disseminate genome information • Consider ethical, legal, and social issues associated with this research

  8. Two Competing Strategies for Human Genome • Hierarchical shotgun [Public human genome project] Map First, Sequence Later • Create a set of mapped large-insert clones to use as sequencing substrate • Whole-genome Shotgun [Celera project] Sequence First, Map Last • Create a genomic library (or libraries), sequence the clone ends at random, and use a computational approach to assemble the random fragments into contiguous stretches of sequence

  9. 2 Strategies for Sequencing Human Genome

  10. Map First Sequence Later • Sort chromosomes • For each chromosome clone large fragments of DNA • Map clones • Identify set of clones that span the chromosome • Shotgun sequence each clone • Finish (close gaps)

  11. Assembling Genome Sequencing Data STS: Sequence Tag Sites

  12. Sequenced-clone contigs are merged to form scaffolds of known order and orientation

  13. Sequence First Map Last (Whole Genome Shotgun) • Isolate genomic DNA • Construct clone libraries of varying sizes • Make sure library is random or nearly so • Sequence both ends of each clone • Assemble the sequences computationally • Finish (close gaps)

  14. Whole-genome shotgun sequencing • Whole genome randomly sheared three times • Plasmid library constructed with ~ 2kb inserts • Plasmid library with ~10 kb inserts • BAC library with ~ 200 kb inserts • Computer program assembles sequences into chromosomes • No physical map construction

  15. Genome Sequencing Strategies WGS Restrict and make small and large-insert clone libraries End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps among all clone end sequences Collapse overlaps into contigs WGS contigs

  16. Genome Sequencing Strategies Constructing Supercontigs (scaffolds)

  17. gaps Working Draft Sequence

  18. The Genome is Who We Are on the inside! Information coded in DNA • Chromosomes consist of DNA • molecular strings of A, C, G, & T • base pairs, A-T, C-G • Genes • DNA sequences that encode proteins • less than 3% of human genome

  19. 5000 bases per page CACACTTGCATGTGAGAGCTTCTAATATCTAAATTAATGTTGAATCATTATTCAGAAACAGAGAGCTAACTGTTATCCCATCCTGACTTTATTCTTTATG AGAAAAATACAGTGATTCC AAGTTACCAAGTTAGTGCTGCTTGCTTTATAAATGAAGTAATATTTTAAAAGTTGTGCATAAGTTAAAATTCAGAAATAAAACTTCATCCTAAAACTCTGTGTGTTGCTTTAAATAATC AGAGCATCTGC TACTTAATTTTTTGTGTGTGGGTGCACAATAGATGTTTAATGAGATCCTGTCATCTGTCTGCTTTTTTATTGTAAAACAGGAGGGGTTTTAATACTGGAGGAACAA CTGATGTACCTCTGAAAAGAGA AGAGATTAGTTATTAATTGAATTGAGGGTTGTCTTGTCTTAGTAGCTTTTATTCTCTAGGTACTATTTGATTATGATTGTGAAAATAGAATTTATCC CTCATTAAATGTAAAATCAACAGGAGAATAGCAAAAACTTATGAGATAGATGAACGTTGTGTGAGTGGCATGGTTTAATTTGTTTGGAAGAAGCACTTGCCCCAGAAGATACACAAT GAAATTCATGTTATTGAGTAGAGTAGTAATACAGTGTGTTCCCTTGTGAAGTTCATAACCAAGAATTTTAGTAGTGGATAGGTAGGCTGAATAACTGACTTCCTATC ATTTTCAGGTT CTGCGTTTGATTTTTTTTACATATTAATTTCTTTGATCCACATTAAGCTCAGTTATGTATTTCCATTTTATAAATGAAAAAAAATAGGCACTTGCAAATGTCAGATCACTTGCCTGTGGT CATTCGGGTAGAGATTTGTGGAGCTAAGTTGGTCTTAATCAAATGTCAAGCTTTTTTTTTTCTTATAAAATATAGGTTTTAATATGAGTTTTAAAATAAAATTAATTAGAAAAAGGCAA ATTACTCAATATATATAAGGTATTGCATTTGTAATAGGTAGGTATTTCATTTTCTAGTTATGGTGGGATATTATTCAGACTATAATTCCCAATGAAAAAACTTTAAAAAATGCTAGTGA TTGCACACTTAAAACACCTTTTAAAAAGCATTGAGAGCTTATAAAATTTTAATGAGTGATAAAACCAAATTTGAAGAGAAAAGAAGAACCCAGAGAGGTAAGGATATAACCTTACC AGTTGCAATTTGCCGATCTCTACAAATATTAATATTTATTTTGACAGTTTCAGGGTGAATGAGAAAGAAACCAAAACCCAAGACTAGCATATGTTGTCTTCTTAAGGAGCCCTCCCCT AAAAGATTGAGATGACCAAATCTTATACTCTCAGCATAAGGTGAACCAGACAGACCTAAAGCAGTGGTAGCTTGGATCCACTACTTGGGTTTGTGTGTGGCGTGACTCAGGTAATCT CAAGAATTGAACATTTTTTTAAGGTGGTCCTACTCATACACTGCCCAGGTATTAGGGAGAAGCAAATCTGAATGCTTTATAAAAATACCCTAAAGCTAAATCTTACAATATTCTCAAG AACACAGTGAA ACAAGGCAAAATAAGTTAAAATCAACAAAAACAACATGAAACATAATTAGACACACAAAGACTTCAAACATTGGAAAATACCAGAGAAAGATAATAAATAT TTTACTCTTTAAAAATTTAGTTAAAAGCTTAAACTAATTGTAGAGAAAA AACTATGTTAGTATTATATTGTAGATGAAATAAGCAAAACATTTAAAATACAAATGTGATTACTTAAAT TAAATATAATAGATAATTTACCACCAGATTAGATACCATTGAAGGAATAATTAATATACTGAAATACAGGTCAGTAGAATTTTTTTCAATTCAGCATGGAGATGTAAAAAATGAAAA TTAATGCAAAAAATAAGGGCACAAAAAGAAATGAGTAATTTTGATCAGAAATGTATTAAAATTAATAAACTGGAAATTTGACATTTAAAAAAAGCATTGTCATCCAAGTAGATGTG TCTATTAAATAGTTGTTCTCATATCCAGTAATGTAATTATTATTCCCTCTCATGCAGTTCAGATTCTGGGGTAATCTTTAGACATCAGTTTTGTCTTTTATATTATTTATTCTGTTTACTAC ATTTTATTTTGCTAATGATATTTTTAATTTCTGACATTCTGGAGTATTGCTTGTAAAAGGTATTTTTAAAAATACTTTATGGTTATTTTTGTGATTCCTATTCCTCTATGGACACCAAGGCT ATTGACATTTTCTTTGGTTTCTTCTGTTACTTCTATTTTCTTAGTGTTTATATCATTTCATAGATAGGATATTCTTTATTTTTTATTTTTATTTAAATATTTGGTGATTCTTGGTTTTCTCAGCC ATCTATTGTCAAGTGTTCTTATTAAGCATTATTATTAAATAAAGATTATTTCCTCTAATCACATGAGAATCTTTATTTCCCCCAAGTAATTGAAAATTGCAATGCCATGCTGCCATGTGG TACAGCATGGGTTTGGGCTTGCTTTCTTCTTTTTTTTTTAACTTTTATTTTAGGTTTGGGAGTACCTGTGAAAGTTTGTTATATAGGTAAACTCGTGTCACCAGGGTTTGTTGTACAGATCA TTTTGTCACCTAGGTACCAAGTACTCAACAATTATTTTTCCTGCTCCTCTGTCTCCTGTCACCCTCCACTCTCAAGTAGACTCCGGTGTCTGCTGTTCCATTCTTTGTGTCCATGTGTTCTC ATAATTTAGTTCCCCACTTGTAAGTGAGAACATGCAGTATTTTCTAGTATTTGGTTTTTTGTTCCTGTGTTAATTTGCCCAGTATAATAGCCTCCAGCTCCATCCATGTTACTGCAAAGAA CATGATCTCATTCTTTTTTATAGCTCCATGGTGTCTATATACCACATTTTCTTTATCTAAACTCTTATTGATGAGCATTGAGGTGGATTCTATGTCTTTGCTATTGTGCATATTGCTGCAAG AACATTTGTGTGCATGTGTCTTTATGGTAGAATGATATATTTTCTTCTGGGTATATATGCAGTAATGCGATTGCTGGTTGGAATGGTAGTTCTGCTTTTATCTCTTTGAGGAATTGCCATG CTGCTTTCCACAATAGTTGAACTAACTTACACTCCCACTAACAGTGTGTAAGTGTTTCCTTTTCTCCACAACCTGCCAGCATCTGTTATTTTTTGACATTTTAATAGTAGCCATTTTAACT GGTATGAAATTATATTTCATTGTGGTTTTAATTTGCATTTCTCTAATGATCAGTGATATTGAGTTTGTTTTTTTTCACATGCTTGTTGGCTGCATGTATGTCTTCTTTTAAAAAGTGTCTGTT CATGTACTTTGCCCACATTTTAATGGGGTTGTTTTTCTCTTGTAAATTTGTTTAAATTCCTTATAGGTGCTGGATTTTAGACATTTGTCAGACGCATAGTTTGCAAATAGTTTCTCCCATTC TGTAGGTTGTCTGTTTATTTTGTTAATAGTTTCTTTTGCTATGCAGAAGCTCTTAATAAGTTTAATGAGATCCTGATATGTTAGGCTTTGTGTCCCCACCCAAATCTCATCTTGAATTATA TCTCCATAATCACCACATGGAGAGACCAGGTGGAGGTAATTGAATCTGGGGGTGGTTTCACCCATGCTGTTCTTGTGATAGTGAATGAGTTCTCACGAGATCTAATGGTTTTATGAGG GGCTCTTCCCAGCTTTGCCTGGTACTTCTCCTTCCTGCCGCTTTGTGAAAAAGGTGCATTGCGTCCCTTTCACCTTCTTCTATAATTGTAAGTTTCCTGAGGCCTTCCCAGCCATGCTGAA CTTCAAGTCAATTAAACCTTTTTCTTTATAAATTACTCAGTCTCTGGTGGTTCTTTATAGCAGTGTGAAAATGGACTAATGAAGTTCCCATTTATGAATTTTTGCTTTTGTTGCAATTGCTT TTGACATCTTAGTCATGAAATCCTTGCCTGTTCTAAGTACAGGACGGTATTGCCTAGGTTGTCTTCCAGGGTTTTTCTAATTTTGTGTTTTGCATTTAAGTGTTTAATCCATCTTGAGTTGA TTTTTGTATATTGTGTAAGGAAGGGGTCCAGTTTCAATCTTTTGCATATGGCTAGTTAGTTATCCCAGTACCATTTATTGAAAAGACAGTCTTTTCCCCATCGCTCGTTTTTGTCAGTTTT ATTGATGATCAGATAATCATAGCTGTGTGGCTTTATTTCTGGGTTCTTTATTCTGTTCTATTGGTTTATGTCCCTGTTTTTGTGCCAGTACCATGCTGTTTTGGTTAACATAGCCCTGTAGT ATAGTTTGAGGTCAGATAGCCTGATGCTTCCAGCTTTGTTCTTTTTCTTAAGATTGCCTTGGCTATTTGGCCTCTTTTTTGGTTCCACATGAATTTTAAAACAGTTGTTTCTAGTTTTTGAA GAATGTCATTGGTAGTTTGATAGAAATAGCATTTAATCTGTAAATTGATTTGTGCAGTATGGCCTTTTAATGATATTGATTCTTCCTATCCATGAGCATGATATGTTTTCCATTTTGTTTG TATCCTCTCTGATTTCTTTGTGCAGTGTTTTGTAATTCTCAT TGTAGAGATTTTTCACCTCCCTGGTTAGTTGTATTTTACCCTAGATATTT TATTCTTTTTGTGAAAATTGTGAATGGGAT TGCCTTCCTGATTTGACTGC CAGCTTGGTTACTGTTGGTTTATAGAAATGCTAGTGATTTTTGTACATTG ATTTTCTTTCTAAAACTTTGCTGAAGTTTTTTTTATTAGCAGAAGGAGCT TTGGGGCTGAGACTATGGGGTTTTCTAGATATAGAATCATGTCAGCTTCAAATAGGGATAATTTTACTTCCTCTCTTCCTATTTGGATGCCCTTTATTTCTTTCTCTTGCCTGATTACTCTG GCTGGGATTTCCTATGTTGAATAGGAGT CATGAGAGAGGGCATCAAATCTACACATATCAAATACTAACCTTGAATGTCTAGATATTT TATTCTTTTTGTGAAAATTGTGAATGGGAT

  20. How much data make up the human genome? • 3 pallets with 40 boxes per pallet x 5000 pages per box x 5000 bases per page = 3,000,000,000 bases! • To get accurate sequence requires 6-fold coverage. • Now: Shred 18 pallets and reassemble.

  21. Important features of Human Genome • 20,000 – 25,000 protein-coding genes (2006) • Proteome (full set of proteins) more complex than those of invertebrates. • pre-existing components arranged into a richer architectures. • Hundreds of genes seem to come from horizontal transfer from bacteria.

  22. Human races have similar genes • Genome sequence centers have sequenced significant portions of at least three races. • (DNA from 5 humans: 2 males, 3 females, • 2 Caucasians, one each of African, Asian, Hispanic) • Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different race. • Very few genes are race specific.

  23. Comparative Genomics

  24. Genome Sizes (MegaBases)

  25. Questions Remain about the Human Genome • Difficult to precisely estimate number of genes at this time • Small genes are hard to identify • Some genes are rarely expressed and do not have normal codon usage patterns – thus hard to detect

  26. Annotation Data Integration Trends in Genomics Data Acquisition Computation Now: Next:

  27. Impact of Human Genome on Biomedical domain

  28. Applications to medicine and biology • Disease genes • human genomic sequence in public databases allows rapid identification of disease genes in silico • Drug targets • pharmaceutical industry has depended upon a limited set of drug targets to develop new therapies • now can find new target in silico • Basic biology

  29. Genomic Medicine • Anticipatory, not reactive • Predictive, preventive and personalized • Knowledge from genomics and derivative disciplines • Screening of individuals and populations • New analytical technologies and bioinformatics approaches

  30. Genomic Medicine: Challenges • Requires change in culture • Practitioners • Preventive medicine approach • Less independence • Population • Lifestyle • Participation in large clinical trials

  31. Genomic Medicine: Challenges • Requires change in medical data analysis • Driven by lab data • Huge amount of information • Pattern recognition

  32. Genomic Medicine: Challenges • Requires change in diagnostic approach • Trust in computer algorithms • Evaluation of matrices and dynamic complex systems, not linear pathways

  33. Genomic Medicine: Consequences • Drugs tailored to individual’s genetic make-up to improve efficacy and reduce side effects • Reduction of the burden of chronic illness • Decrease in the prevalence of common complex diseases

More Related