740 likes | 1.44k Views
What is bioinformatics?. Development of databases to store and manipulate genomic and proteomic data? Or more broadly does it mean Computational Biology? Curricula review suggests it is the study of two information flows in molecular biology*
E N D
What is bioinformatics? • Development of databases to store and manipulate genomic and proteomic data? • Or more broadly does it mean Computational Biology? • Curricula review suggests it is the study of two information flows in molecular biology* • *Altman, RB. 1998. A curriculum for bioinformatics: the time is ripe. Bioinformatics 14:549-550.
First information flow is the central dogma of molecular biology • Use bioinformatics applications to address transfer of information within the central dogma, including organization and control of transcriptional units, prediction of protein structure from sequence, and the analysis of molecular function.
Second flow is based on scientific method • We create hypotheses, design experiments to test these hypotheses, evaluate data, and extend or modify hypotheses • Bioinformatic applications address the transfer of info within this protocol
BIOS 480 Goals • Provide a comprehensive understanding of current methods in biological sequence analysis • Assess challenges and approaches in new bioinformatics-related disciplines • Provide in-depth, hands-on experience in design and implementation of bioinformatics tools
Grades • Attendance + participation = 20% • Homework and assignments = 25% • Laboratory assignments = 40% • Final exam = 15% • www.uwp.edu/~barber/bioinformatics/BIOS480.htm has lectures and important materials for this class
Datasets for bioinformatics analyses • Genome sequences • Macromolecular structures • Functional genomics experiments
Wide range of bioinformatics techniques • Sequence alignments • Motif identification • Gene prediction • Phylogeny • RNA and protein structural bioinformatics • Proteomics • Microarrays, protein chips, two-hybrid screens • Metabolomics
The advent of genome sequencing brought bioinformatics into its own • Yeah, but now that the human genome is “done”, isn’t genomics “done” • No
A G C T A G C A T C C G T A T Capillary and Slab gel electrophoresis use a modified Sanger technology with fluorescent dyes Typical reads of 500-750 nt on an hour timescale. Variation depending on sequencer.
Microfabricated Capillary Arrays • Etch a glass chip with T-shaped channels that are 7 cm long, and mM in depth and width, can devise a 96 well chip that would be capable of 150,000 bases/h • Miniaturization is one booming field driving bioinformatics
Free Solution Electrophoresis • Possibly will improve separation time (no matrix) without losing read length • Label DNA molecules with friction increasing molecule such as streptavidin • Currently can read 100 bp, a long way to go…
Who needs electrophoresis? • Pyrosequencing • MALDI-TOF Mass Spectrometry • Sequencing by Hybridization • Massively Parallel Signature Sequencing • A testimony to innovative molecular biology • Single molecule methods
Pyrosequencing • Real-time sequencing measuring release of PPi during DNA synthesis • Has been of particular use for SNP analysis • First of four deoxynucleotide triphosphates added to reaction, when correct one incorporated Ppi is released and measured using ATP sulfurylase-coupled ATP synthesis and luciferase – wash and repeat
Put the sequencing reactions through a mass spectrometer Spectra of the C- and G- terminated oligonucleotides Current limit ~100 bp, Facilitated by sensitivity and high-throughput loading
Potential innovations in DNA sequencing • Sequencing by hybridization • Cot-based analysis • http://www.msstate.edu/research/mgel/cotfig.htm • Chip-based analysis • http://www.hyseq.com/content/131.php • http://citeseer.nj.nec.com/context/471959/0 • Linear Read http://www.usgenomics.com/about/index.shtml
Growth in genomic technology • U.S. Genomics's technology platform, the GeneEngine™, has two components, (1) nanotechnology systems for positioning DNA so that it can be read linearly (broadly termed DNA Delivery Mechanism(s)™) and (2) detection technologies that allow the reading of information from the DNA Delivery Mechanism(s)™.(FRET-based??)
Overview of “Shotgun” Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) The future looks bright, but what about right now?
Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end Base calling performed by Phred software: http://www.phrap.org/ http://www.genome.org/cgi/reprint/8/3/175.pdf
Cloning vectors • 2-5 kb in pUC or M13 • 5-50 kb in phage or cosmid • 30-100 kb in P1 bacteriophage • 60-300 kb in BAC • 60-2000 kb in YAC
Phred Software • Calls bases in four phases: • Predicting peaks (ideal locations) • Locating observed peaks • Matching observed to predicted • Finding missing peaks • http://www.genome.org/cgi/reprint/8/3/186.pdf • http://www.genome.org/cgi/reprint/8/3/175.pdf
Errors in Sequencing Reads • Each base call is assigned a quality score: • q = -10 x log10(p) {Higher quality scores correspond to low error probabilities; } Errors are associated with peak vicinity, use the following parameters in error probability determination on a TRAINING SET: Peak spacing Uncalled/called ration (two window sizes) Peak resolution Result in a look-up table inherent to Phred software
Common Sources of Sequencing Errors • The first fifty or so peaks of a trace are noisy and unevenly spaced due to anomalous migration of short DNA fragments, and unreacted dye-primer and dye-terminator molecules. • Near the end of the trace, peaks become less evenly spaced due to less accurate trace processing, less well resolved as diffusion effects increase, and also #labeled molecules decrease. • Compressions – most common in GC-rich regions when bases near the end of a single-stranded fragment bind to a complementary region forming a hairpin (migrates more rapidly than expected) • Dye-terminator sequencing method helps resolve compressions, but has own problems: “About 85% of high quality dye terminator errors resulted from a missing G peak following an A, or a missing A folling a T,…” Ewing and Green, 1998.
Overview of Genomic Sequencing Original DNA • Break DNA into random fragments (8-10X Coverage) • Amplify fragments in a vector and sequence 500-700 bases in from each end • Assemble fragments of sequence that have been read: Contig 1 Contig 2
Assembly of large DNA sequences • Several assembly programs exist and can be run with different degrees of success: Phrap, TIGR Assembler, CAP, STROLL, etc.
Overlap-layout-consensus • Most fragment assembly algorithms include the following three steps: • Overlap. Finding potentially overlapping fragments. • Layout. Finding the order of fragments. • Consensus. Deriving the DNA sequence from the layout. • New method: http://www.cs.ucsd.edu/groups/bioinformatics/software.html
Assemble these fragments • F1 ATAT • F2 TATT • F3 TTAT • F4 TATA • F5 TAAT • F6 AATA
Did you use a Greedy approach? • Most assemblers utilize a greedy algorithm; an algorithm that takes the best, immediate local solution picking the largest scoring overlap, merging the fragments and repeating until no more merges can be made
Overlap • The overlap problem is to find the best match between the suffix of one sequence and the prefix of another. • If no sequencing errors, simply find the longest suffix of one string that exactly matches the prefix of another string. • Since errors are small, the common practice is to use filtration method and to filter out pairs of fragments that do not share a significantly long common substring.
TIGR assembler • Finds exact 32 base matches between sequences; alignment between two sequences is scored based on the number and uniqueness of the 32-mer match (how often does 32-mer appear?) • Interestingly, 32 was not chosen in a particularly rigorous manner, 16 gave too many alignments, >32 too few
32-mer table example • AGCTTAGATCTACAAGAGGTATTAGATCTACGGACTA…. • 8-MER Occurences • AGCTTAGA 1 • GCTTAGAT 1 • CTTAGATC 1 • TTAGATCT 2 Internal repeat sequences are ignored, because they confuse the assembler
32-mer table…cont. • SeqA: …CCTGATTAGACATTGCATGAAGT… • SeqB: …ATAACATTGCATGAAGTCGAAC… • 8-mer Occurences Belongs to: • … • ACATTGCA 10 seqA, seqB,… • … Sequences seqA and seqB are said to overlap when they share 32-mers. Quality of overlap depends on number of 32-mers and their uniqueness
Layout • Many algorithms select a pair of fragments with the best overlap at every step. • The score of overlap is either the similarity score or a more involved probablilistic score. • The selected pair of fragments with the best overlap score is checked for consistency. • If this check is accepted, the two fragments are merged.
Sorting fragments • Assembler sorts all potential merges according to their 32-mer scores • Merges are performed in order of their scores (subject to quality restrictions = Phred scores) • After half of the merges are performed, all scores are re-evaluated and list is re-sorted..continued until no more merges
Merging two sequences • …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC • CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… • Percent identity = 18/19% = 94.7% • Overlap = region of similarity between regions • Overhang = unaligned sequences at ends (underlined) • The assembler screens merges based on: • Length of overlap • % identity in overlap region (TIGR default = 97.5%) • Maximum overhang size (can be trimmed)
Layout • At later stages of the algorithm the collections of fragments (contig) – rather than individual fragments – are merged. • The difficulty with the layout step is deciding whether two fragments with a good overlap really overlap (i.e. their differences are caused by sequencing errors) or represent a repeat in a genome (i.e. their differences are caused by mutations). • Use additional “scaffolding” measures –mapping
Consensus • The simplest way to build the consensus is to report the most frequent character in the substring layout that is (implicitly) constructed after the layout step is completed.
The Human Touch • Consed – AGraphical Tool for Editing Phrap Assemblies.
Assembly can be greatly enhanced through use of maps • Genetic maps based on recombination frequencies at meiosis. Linked markers are co-inherited (closer the higher frequency of co-inheritance) – only maps genes… • Physical maps describe location of DNA sequences, use several physical mapping markers. • Expression maps - mRNA
Sequence tagged sites (STS) are used for each map • An STS is a stretch of DNA ~300 bp in length generated using PCR, which tags the larger DNA molecule from which it is derived • The nucleotide sequence of the STS is used to specify the sequence of two synthetic oligonucleotides that will bind in opposite orientations at either end of the STS • Can be used to detect length polymorphisms or EST’s
STSs • Allow different sources of DNA fragments to be examined for common sequences • Sequences for STS are widely available • Small number of false positives • Automation
Genetic Maps • Linkage between markers measured in cM • Haplotypes • Closely linked alleles that tend to be co-inherited (can be >2) • CEPH families • Permanent cell lines derived from Mormons and French-Venezuelian families (Centre dEtude Polymorphism Human). Each family consists of three generations with four grandparents, 2 parents and minimum of 6 children – great pedigrees
Physical mapping markers • RFLPs • Minisatellites • VNTR’s • Microsatellites • Radiation hybrid mapping • FISH • EST maps • Clone maps
Restriction fragment length polymorphism • Based on presence or absence of a target for a restriction enzyme usually due to a polymorphism at one base (only two alleles at any one locus; either there or not) • Used extensively in pre-natal screening • Can be performed on high MW fragments using Pulsed Field Gel Electrophoresis and agarose • Can also be used for long range restriction mapping (ie. 8 bp or 16 bp cutters)
Minisatellites • Variable number tandem repeats • Determine the different lengths by PCR or Southerns • Multiple AluI repeats at a particular locus… • However, use is limited by their distribution in the genome, as they tend to be clustered near telomeres • Southerns can be laborious and PCR can be difficult with large minisatellites
Microsatellites • More common and more evenly distributed than minisatellites • These are variable number of dinucleotide repeats • Microsatellite based on CA repeats is the standard in construction of genetic maps • Both mini and microsatellites are used in forensics as DNA fingerprints
Radiation Hybrid mapping Cells (human) are irradiated to fragment chromosomes Irradiated cells fused with a cell line (rat) to form a panel of hybrids (retains ~20% of donor fragments of ~ 10Mb) Radiation hybrids have an assortment of human chromosome fragments; further apart two markers are, less likely to be on same fragment (map units are centiRays, analogous to cM but depend on radiation dose)
Clone maps • Generate YAC, PAC, or BAC library • Order by detecting sequences in common (overlapping clones): STS content, hybridizations (using EST cDNA’s), and fingerprinting
The human genetic map • Took 15 million separate PCR reactions performed by a robotic line • Results description of ensuing paper required 900 printed pages • Check out: • www.chlc.org/homepage.html • www.ncbi.nlm.nih.gov/SCIENCE96/ • http://www-genome.wi.mit.edu • http://www-shgc.stanford.edu/