Genome assembly and an initial look at features

Genome assembly and an initial look at features

Celera believed they could assemble the whole human genome from shotgun sequence fragments in this way. But this approach failed. They had to use the public domain map data to resolve problems in their assembly.

Assembly of large DNA sequences • Several assembly programs exist and can be run with different degrees of success: Phrap, TIGR Assembler, CAP, STROLL, etc.

Overlap-layout-consensus • Most fragment assembly algorithms include the following three steps: • Overlap. Finding potentially overlapping fragments. • Layout. Finding the order of fragments. • Consensus. Deriving the DNA sequence from the layout. • New method: http://www.cs.ucsd.edu/groups/bioinformatics/software.html

Overlap • The overlap problem is to find the best match between the suffix of one sequence and the prefix of another. • If no sequencing errors, simply find the longest suffix of one string that exactly matches the prefix of another string. • Since errors are small, the common practice is to use filtration method and to filter out pairs of fragments that do not share a significantly long common substring.

TIGR assembler • Finds exact 32 base matches between sequences; alignment between two sequences is scored based on the number and uniqueness of the 32-mer match (how often does 32-mer appear?) • Interestingly, 32 was not chosen in a particularly rigorous manner, 16 gave too many alignments, >32 too few

32-mer table example • AGCTTAGATCTACAAGAGGTATTAGATCTACGGACTA…. • 8-MER Occurences • AGCTTAGA 1 • GCTTAGAT 1 • CTTAGATC 1 • TTAGATCT 2 Internal repeat sequences are ignored, because they confuse the assembler

32-mer table…cont. • SeqA: …CCTGATTAGACATTGCATGAAGT… • SeqB: …ATAACATTGCATGAAGTCGAAC… • 8-mer Occurences Belongs to: • … • ACATTGCA 10 seqA, seqB,… • … Sequences seqA and seqB are said to overlap when they share 32-mers. Quality of overlap depends on number of 32-mers and their uniqueness

Layout • Many algorithms select a pair of fragments with the best overlap at every step. • The score of overlap is either the similarity score or a more involved probablilistic score. • The selected pair of fragments with the best overlap score is checked for consistency. • If this check is accepted, the two fragments are merged.

Sorting fragments • Assembler sorts all potential merges according to their 32-mer scores • Merges are performed in order of their scores (subject to quality restrictions = Phred scores) • After half of the merges are performed, all scores are re-evaluated and list is re-sorted..continued until no more merges

Merging two sequences • …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC • CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… • Percent identity = 18/19% = 94.7% • Overlap = region of similarity between regions • Overhang = unaligned sequences at ends (underlined) • The assembler screens merges based on: • Length of overlap • % identity in overlap region (TIGR default = 97.5%) • Maximum overhang size (can be trimmed)

Layout • At later stages of the algorithm the collections of fragments (contig) – rather than individual fragments – are merged. • The difficulty with the layout step is deciding whether two fragments with a good overlap really overlap (i.e. their differences are caused by sequencing errors) or represent a repeat in a genome (i.e. their differences are caused by mutations). • Use additional “scaffolding” measures –mapping

Consensus • The simplest way to build the consensus is to report the most frequent character in the substring layout that is (implicitly) constructed after the layout step is completed.

The Human Touch • Consed – AGraphical Tool for Editing Phrap Assemblies.

Assembly can be greatly enhanced through use of maps • Genetic maps based on recombination frequencies at meiosis. Linked markers are co-inherited (closer the higher frequency of co-inheritance) – only maps genes… • Physical maps describe location of DNA sequences, use several physical mapping markers. • Expression maps - mRNA

Sequence tagged sites (STS) are used for each map • An STS is a stretch of DNA ~300 bp in length generated using PCR, which tags the larger DNA molecule from which it is derived • The nucleotide sequence of the STS is used to specify the sequence of two synthetic oligonucleotides that will bind in opposite orientations at either end of the STS • Can be used to detect length polymorphisms or EST’s

STSs • Allow different sources of DNA fragments to be examined for common sequences • Sequences for STS are widely available • Small number of false positives • Automation

Genetic Maps • Linkage between markers measured in cM • Haplotypes • Closely linked alleles that tend to be co-inherited (can be >2) • CEPH families • Permanent cell lines derived from Mormons and French-Venezuelian families (Centre dEtude Polymorphism Human). Each family consists of three generations with four grandparents, 2 parents and minimum of 6 children – great pedigrees

Physical mapping markers • RFLPs • Minisatellites • VNTR’s • Microsatellites • Radiation hybrid mapping • FISH • EST maps • Clone maps

Restriction fragment length polymorphism • Based on presence or absence of a target for a restriction enzyme usually due to a polymorphism at one base (only two alleles at any one locus; either there or not) • Used extensively in pre-natal screening • Can be performed on high MW fragments using Pulsed Field Gel Electrophoresis and agarose • Can also be used for long range restriction mapping (ie. 8 bp or 16 bp cutters)

Minisatellites • Variable number tandem repeats • Determine the different lengths by PCR or Southerns • Multiple AluI repeats at a particular locus… • However, use is limited by their distribution in the genome, as they tend to be clustered near telomeres • Southerns can be laborious and PCR can be difficult with large minisatellites

Microsatellites • More common and more evenly distributed than minisatellites • These are variable number of dinucleotide repeats • Microsatellite based on CA repeats is the standard in construction of genetic maps • Both mini and microsatellites are used in forensics as DNA fingerprints

Radiation Hybrid mapping Cells (human) are irradiated to fragment chromosomes Irradiated cells fused with a cell line (rat) to form a panel of hybrids (retains ~20% of donor fragments of ~ 10Mb) Radiation hybrids have an assortment of human chromosome fragments; further apart two markers are, less likely to be on same fragment (map units are centiRays, analogous to cM but depend on radiation dose)

Clone maps • Generate YAC, PAC, or BAC library • Order by detecting sequences in common (overlapping clones): STS content, hybridizations (using EST cDNA’s), and fingerprinting

The human genetic map • Took 15 million separate PCR reactions performed by a robotic line • Results description of ensuing paper required 900 printed pages • Check out: • www.chlc.org/homepage.html • www.ncbi.nlm.nih.gov/SCIENCE96/ • http://www-genome.wi.mit.edu • http://www-shgc.stanford.edu/

Restriction mapping on a genomic scale • http://www.lmcg.wisc.edu/

The age of complete genomic sequences www.nature.com

Genome sequence provides stats

DNA has distinctive, non-random base composition • In all DNA, regardless of species, the number of A’s equals # of T’s, and # G’s = # C’s, such that A + G = T + C • DNA specimens from different tissues of same organism have same base composition • Base composition of DNA can vary wildly among organisms (25% GC vs. 80% GC) • Non-randomness generates signals

Signals within nucleotide sequences (and structure?) • Promoter • Transcriptional terminators • Ribosome binding site • Genetic code (genes) • Splice sites • Restriction sites

Genes = Functional portions of DNA • Untranscribed genes • Replicator genes: sites for initiation and termination of DNA replication • Segregator genes sites of attachment to spindle machinery during meiosis and mitosis • Etc. • Pseudogenes • Looks similar to functional genes but contains mutations such as frameshift and nonsense • Protein-encoding genes • Transcribed and translated (and maybe modified) • RNA-specific genes • Transcribed (and maybe modified)

E. coli 1,146 889 18 239 1,129 1 10 H.influenzae M.genitalium The Minimal Genome

A consequence of mechanism: GC skew • Looking at microbial genomes, a plot G-C/(G+C) is an excellent predictor of origin of replication, where polarity switches

Cumulative GC skew can help

Why is there GC skew? • Consequence of asymmetry in replication or repair • Leading vs. lagging strand synthesis • Transcription-repair coupling • Removes most frequent types of DNA damage (deaminated cytosines and pyrimidine dimers, both of which lead to base substitution • This repair would occur on template strand (not coding), which then should become pyrimidine rich • Also, the template strand is significantly protected against DNA damage during transcription, while the coding strand is exposed. This should help increase the purine load of the coding strand.

Another example of base skew: CpG islands • CpG refers to the dinucleotide, while C-G to the base pair • In the human genome, the C of CpG is typically methylated, and there is a high chance of this methyl C mutating to a T (transition) • As a consequence, CpG are rarer in this genome than one would expect

CpG islands • However, for biologically important reasons methylation is suppressed in short stretches, such as around promoters • As a result, CpG islands have been used to define start sites for putative genes

Eukaryotes-vs-Prokaryotes

Many eukaryotic genes contain introns, few prokaryotic genes do • Fig1.1 patthy

Intron splicing and phases • Fig 1.2 and 1.3

Group I and II self-splicing introns

Genetic code(s)

Third codon position often synonymous

Types of substitution mutations

Frameshift mutations affect multiple codon positions

Genome assembly and an initial look at features

Genome assembly and an initial look at features

Presentation Transcript

Genome Assembly

Computational Genomics: Genome assembly

Genome sequence assembly

Bacterial Genome Assembly

Genome Assembly Stewardship (Ames)

Genome Assembly

Genome Assembly Final Results

Genome Assembly

Denovo genome assembly and analysis

Genome Assembly

On Genome Assembly

Let’s look at another initial meeting…

Sequencing techniques and genome assembly

Genome Assembly

Genome Sequencing and Assembly

Genome sequence assembly

Sequence Alignment and Genome Assembly

De novo genome assembly

Genome Sequencing and Assembly Progress

Genome Assembly and Annotation

Whole Genome Shotgun Assembly

Whole Genome Assembly