1 / 51

What is bioinformatics?

What is bioinformatics?. Development of databases to store and manipulate genomic and proteomic data? Or more broadly does it mean Computational Biology? Curricula review suggests it is the study of two information flows in molecular biology*

claire
Download Presentation

What is bioinformatics?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is bioinformatics? • Development of databases to store and manipulate genomic and proteomic data? • Or more broadly does it mean Computational Biology? • Curricula review suggests it is the study of two information flows in molecular biology* • *Altman, RB. 1998. A curriculum for bioinformatics: the time is ripe. Bioinformatics 14:549-550.

  2. First information flow is the central dogma of molecular biology • Use bioinformatics applications to address transfer of information within the central dogma, including organization and control of transcriptional units, prediction of protein structure from sequence, and the analysis of molecular function.

  3. Second flow is based on scientific method • We create hypotheses, design experiments to test these hypotheses, evaluate data, and extend or modify hypotheses • Bioinformatic applications address the transfer of info within this protocol

  4. BIOS 480 Goals • Provide a comprehensive understanding of current methods in biological sequence analysis • Assess challenges and approaches in new bioinformatics-related disciplines • Provide in-depth, hands-on experience in design and implementation of bioinformatics tools

  5. Grades • Attendance + participation = 20% • Homework and assignments = 25% • Laboratory assignments = 40% • Final exam = 15% • www.uwp.edu/~barber/bioinformatics/BIOS480.htm has lectures and important materials for this class

  6. Datasets for bioinformatics analyses • Genome sequences • Macromolecular structures • Functional genomics experiments

  7. Wide range of bioinformatics techniques • Sequence alignments • Motif identification • Gene prediction • Phylogeny • RNA and protein structural bioinformatics • Proteomics • Microarrays, protein chips, two-hybrid screens • Metabolomics

  8. The advent of genome sequencing brought bioinformatics into its own • Yeah, but now that the human genome is “done”, isn’t genomics “done” • No

  9. A G C T A G C A T C C G T A T Capillary and Slab gel electrophoresis use a modified Sanger technology with fluorescent dyes Typical reads of 500-750 nt on an hour timescale. Variation depending on sequencer.

  10. Microfabricated Capillary Arrays • Etch a glass chip with T-shaped channels that are 7 cm long, and mM in depth and width, can devise a 96 well chip that would be capable of 150,000 bases/h • Miniaturization is one booming field driving bioinformatics

  11. Free Solution Electrophoresis • Possibly will improve separation time (no matrix) without losing read length • Label DNA molecules with friction increasing molecule such as streptavidin • Currently can read 100 bp, a long way to go…

  12. Who needs electrophoresis? • Pyrosequencing • MALDI-TOF Mass Spectrometry • Sequencing by Hybridization • Massively Parallel Signature Sequencing • A testimony to innovative molecular biology • Single molecule methods

  13. Pyrosequencing • Real-time sequencing measuring release of PPi during DNA synthesis • Has been of particular use for SNP analysis • First of four deoxynucleotide triphosphates added to reaction, when correct one incorporated Ppi is released and measured using ATP sulfurylase-coupled ATP synthesis and luciferase – wash and repeat

  14. Put the sequencing reactions through a mass spectrometer Spectra of the C- and G- terminated oligonucleotides Current limit ~100 bp, Facilitated by sensitivity and high-throughput loading

  15. Potential innovations in DNA sequencing • Sequencing by hybridization • Cot-based analysis • http://www.msstate.edu/research/mgel/cotfig.htm • Chip-based analysis • http://www.hyseq.com/content/131.php • http://citeseer.nj.nec.com/context/471959/0 • Linear Read http://www.usgenomics.com/about/index.shtml

  16. Cot analysis

  17. Growth in genomic technology • U.S. Genomics's technology platform, the GeneEngine™, has two components, (1) nanotechnology systems for positioning DNA so that it can be read linearly (broadly termed DNA Delivery Mechanism(s)™) and (2) detection technologies that allow the reading of information from the DNA Delivery Mechanism(s)™.(FRET-based??)

  18. Overview of “Shotgun” Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) The future looks bright, but what about right now?

  19. Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end Base calling performed by Phred software: http://www.phrap.org/ http://www.genome.org/cgi/reprint/8/3/175.pdf

  20. Cloning vectors • 2-5 kb in pUC or M13 • 5-50 kb in phage or cosmid • 30-100 kb in P1 bacteriophage • 60-300 kb in BAC • 60-2000 kb in YAC

  21. Phred Software • Calls bases in four phases: • Predicting peaks (ideal locations) • Locating observed peaks • Matching observed to predicted • Finding missing peaks • http://www.genome.org/cgi/reprint/8/3/186.pdf • http://www.genome.org/cgi/reprint/8/3/175.pdf

  22. Errors in Sequencing Reads • Each base call is assigned a quality score: • q = -10 x log10(p) {Higher quality scores correspond to low error probabilities; } Errors are associated with peak vicinity, use the following parameters in error probability determination on a TRAINING SET: Peak spacing Uncalled/called ration (two window sizes) Peak resolution Result in a look-up table inherent to Phred software

  23. Common Sources of Sequencing Errors • The first fifty or so peaks of a trace are noisy and unevenly spaced due to anomalous migration of short DNA fragments, and unreacted dye-primer and dye-terminator molecules. • Near the end of the trace, peaks become less evenly spaced due to less accurate trace processing, less well resolved as diffusion effects increase, and also #labeled molecules decrease. • Compressions – most common in GC-rich regions when bases near the end of a single-stranded fragment bind to a complementary region forming a hairpin (migrates more rapidly than expected) • Dye-terminator sequencing method helps resolve compressions, but has own problems: “About 85% of high quality dye terminator errors resulted from a missing G peak following an A, or a missing A folling a T,…” Ewing and Green, 1998.

  24. Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end • Assemble fragments of sequence that have been read: Contig 1 Contig 2

  25. Assembly of large DNA sequences • Several assembly programs exist and can be run with different degrees of success: Phrap, TIGR Assembler, CAP, STROLL, etc.

  26. Overlap-layout-consensus • Most fragment assembly algorithms include the following three steps: • Overlap. Finding potentially overlapping fragments. • Layout. Finding the order of fragments. • Consensus. Deriving the DNA sequence from the layout. • New method: http://www.cs.ucsd.edu/groups/bioinformatics/software.html

  27. Assemble these fragments • F1 ATAT • F2 TATT • F3 TTAT • F4 TATA • F5 TAAT • F6 AATA

  28. Did you use a Greedy approach? • Most assemblers utilize a greedy algorithm; an algorithm that takes the best, immediate local solution picking the largest scoring overlap, merging the fragments and repeating until no more merges can be made

  29. Overlap • The overlap problem is to find the best match between the suffix of one sequence and the prefix of another. • If no sequencing errors, simply find the longest suffix of one string that exactly matches the prefix of another string. • Since errors are small, the common practice is to use filtration method and to filter out pairs of fragments that do not share a significantly long common substring.

  30. TIGR assembler • Finds exact 32 base matches between sequences; alignment between two sequences is scored based on the number and uniqueness of the 32-mer match (how often does 32-mer appear?) • Interestingly, 32 was not chosen in a particularly rigorous manner, 16 gave too many alignments, >32 too few

  31. 32-mer table example • AGCTTAGATCTACAAGAGGTATTAGATCTACGGACTA…. • 8-MER Occurences • AGCTTAGA 1 • GCTTAGAT 1 • CTTAGATC 1 • TTAGATCT 2 Internal repeat sequences are ignored, because they confuse the assembler

  32. 32-mer table…cont. • SeqA: …CCTGATTAGACATTGCATGAAGT… • SeqB: …ATAACATTGCATGAAGTCGAAC… • 8-mer Occurences Belongs to: • … • ACATTGCA 10 seqA, seqB,… • … Sequences seqA and seqB are said to overlap when they share 32-mers. Quality of overlap depends on number of 32-mers and their uniqueness

  33. Layout • Many algorithms select a pair of fragments with the best overlap at every step. • The score of overlap is either the similarity score or a more involved probablilistic score. • The selected pair of fragments with the best overlap score is checked for consistency. • If this check is accepted, the two fragments are merged.

  34. Sorting fragments • Assembler sorts all potential merges according to their 32-mer scores • Merges are performed in order of their scores (subject to quality restrictions = Phred scores) • After half of the merges are performed, all scores are re-evaluated and list is re-sorted..continued until no more merges

  35. Merging two sequences • …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC • CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… • Percent identity = 18/19% = 94.7% • Overlap = region of similarity between regions • Overhang = unaligned sequences at ends (underlined) • The assembler screens merges based on: • Length of overlap • % identity in overlap region (TIGR default = 97.5%) • Maximum overhang size (can be trimmed)

  36. Layout • At later stages of the algorithm the collections of fragments (contig) – rather than individual fragments – are merged. • The difficulty with the layout step is deciding whether two fragments with a good overlap really overlap (i.e. their differences are caused by sequencing errors) or represent a repeat in a genome (i.e. their differences are caused by mutations). • Use additional “scaffolding” measures –mapping

  37. Consensus • The simplest way to build the consensus is to report the most frequent character in the substring layout that is (implicitly) constructed after the layout step is completed.

  38. The Human Touch • Consed – AGraphical Tool for Editing Phrap Assemblies.

  39. Assembly can be greatly enhanced through use of maps • Genetic maps based on recombination frequencies at meiosis. Linked markers are co-inherited (closer the higher frequency of co-inheritance) – only maps genes… • Physical maps describe location of DNA sequences, use several physical mapping markers. • Expression maps - mRNA

  40. Sequence tagged sites (STS) are used for each map • An STS is a stretch of DNA ~300 bp in length generated using PCR, which tags the larger DNA molecule from which it is derived • The nucleotide sequence of the STS is used to specify the sequence of two synthetic oligonucleotides that will bind in opposite orientations at either end of the STS • Can be used to detect length polymorphisms or EST’s

  41. STSs • Allow different sources of DNA fragments to be examined for common sequences • Sequences for STS are widely available • Small number of false positives • Automation

  42. Genetic Maps • Linkage between markers measured in cM • Haplotypes • Closely linked alleles that tend to be co-inherited (can be >2) • CEPH families • Permanent cell lines derived from Mormons and French-Venezuelian families (Centre dEtude Polymorphism Human). Each family consists of three generations with four grandparents, 2 parents and minimum of 6 children – great pedigrees

  43. Physical mapping markers • RFLPs • Minisatellites • VNTR’s • Microsatellites • Radiation hybrid mapping • FISH • EST maps • Clone maps

  44. Restriction fragment length polymorphism • Based on presence or absence of a target for a restriction enzyme usually due to a polymorphism at one base (only two alleles at any one locus; either there or not) • Used extensively in pre-natal screening • Can be performed on high MW fragments using Pulsed Field Gel Electrophoresis and agarose • Can also be used for long range restriction mapping (ie. 8 bp or 16 bp cutters)

  45. Minisatellites • Variable number tandem repeats • Determine the different lengths by PCR or Southerns • Multiple AluI repeats at a particular locus… • However, use is limited by their distribution in the genome, as they tend to be clustered near telomeres • Southerns can be laborious and PCR can be difficult with large minisatellites

  46. Microsatellites • More common and more evenly distributed than minisatellites • These are variable number of dinucleotide repeats • Microsatellite based on CA repeats is the standard in construction of genetic maps • Both mini and microsatellites are used in forensics as DNA fingerprints

  47. Radiation Hybrid mapping Cells (human) are irradiated to fragment chromosomes Irradiated cells fused with a cell line (rat) to form a panel of hybrids (retains ~20% of donor fragments of ~ 10Mb) Radiation hybrids have an assortment of human chromosome fragments; further apart two markers are, less likely to be on same fragment (map units are centiRays, analogous to cM but depend on radiation dose)

  48. Clone maps • Generate YAC, PAC, or BAC library • Order by detecting sequences in common (overlapping clones): STS content, hybridizations (using EST cDNA’s), and fingerprinting

  49. The human genetic map • Took 15 million separate PCR reactions performed by a robotic line • Results description of ensuing paper required 900 printed pages • Check out: • www.chlc.org/homepage.html • www.ncbi.nlm.nih.gov/SCIENCE96/ • http://www-genome.wi.mit.edu • http://www-shgc.stanford.edu/

More Related