130 likes | 297 Views
BioInformatics (2). Physical Mapping - I. Low resolution Megabase-scale High resolution Kilobase-scale or better Methods for low resolution mapping Somatic cell hybrids (human and mouse or hamster) Fast chromosomal localisation of genes Subchromosomal mapping possible
E N D
Physical Mapping - I • Low resolution • Megabase-scale • High resolution • Kilobase-scale or better • Methods for low resolution mapping • Somatic cell hybrids (human and mouse or hamster) • Fast chromosomal localisation of genes • Subchromosomal mapping possible • Fluorescence in situ hybridisation (FISH) • Chromosome painting • Fractionation of chromosomes by flow cytometry
Physical Mapping - II • Methods for high resolution mapping • Long-range restriction mapping • Pulsed-field gel electrophoresis (PFGE) • Assembly of clone contigs • The double digest problem • Ordering fragments from a 2 restriction enzyme digest • Sequence Tagged Sites (STSs) • Sequence fragments in the genome described uniquely by a pair of PCR primers • Usually 200-300 bases • Very useful as ‘landmarks’ on the physical map • Can be mapped to individual clones by FISH • Assembly of STS-content physical maps
Physical Mapping - III • Map units (human genome) • 1 cM = ~ 1 Mb • 1 cR = ~ 30 kb • 1 centiRay = 1% chance of a radiation-induced break between 2 markers • Major information resources • Stanford Human Genome Center (RH maps) • http://www-shgc.stanford.edu • Whitehead/MIT Genome Center (STS content maps) • http://www-genome.wi.mit.edu/ • Centre d’Etude du Polymorphisme Humaine - CEPH (YAC maps) • http://www.cephb.fr/bio/ceph-genethon-map.html
Physical Mapping - IV • Conclusions • The value of physical mapping • Confirmation of chromosomal location of clones and genes • Correction of genetic map errors • Correlation to genetic map reveals ‘hot’and ‘cold’ regions of recombinational activity on chromosomes • Provides useful information for duplicated regions • High resolution mapping provides the framework necessary for high quality sequencing of large genomic regions
DNA Sequencing • Ordered clone library • Sequencing of overlapping clones of known order as determined by restriction analysis • Advantage • Easy ordering of resulting sequence reads • Disadvantage • Detailed mapping is time-consuming • Shotgun sequencing • Partial digestion of DNA with a 4-cuter enzyme • Sequencing of randomly overlapping clones • Computer-aided assembly of reads • Advantage • Speed • Disadvantage • High data redundancy due to random sequencing • Not suitable for large genomes (>300 Mb)
Assembly of Sequence Contigs • The problem: • Semi-automated assembly of a contiguous DNA sequence from overlapping gel readings • Steps • Base identification • Trimming of ends • Vector clipping • Assembly of fragments • Major software packages • SequencherTM from GeneCodes Inc., Ann Arbor, Michigan • Platforms: PowerMac, Windows NT • Up to 70 kb contigs • The Staden package by Staden et al., MRC, Cambridge • PHRED/PHRAP by Green et al., University of Washington, Seattle • Platforms: Unix • Megabase range contigs • Mutation detection capabilities
Quality Control of Sequence DataSource: US DOE Joint Genome Institute • Goals • Complete sequence continuity across a target region (both within and between clones) • No more than one gap in 200 kb • Size of all gaps no larger than 1% of the size of the total region • ‘Allowable gaps’ include • regions unclonable/unstable in conventional cloning vectors • repetitive regions • regions with significant secondary structure or abnormally high GC content • Gap size measured by PCR or restriction digest analysis • Accuracy of finished sequence: 1 error in 10,000 bases • At least 95% double-strand coverage • Assembly Verification • a minimum of three independent restriction digests • reassembly with an independent algorithm • re-sequencing of random clones
Submission and Annotation of Sequence DataSource: US DOE Joint Genome Institute • Size of the starting clone is minimum size of submission to public databases • 95% of the sequence represented on both strands • all ambiguities resolved or annotated • missing data from the end of a clone allowed if sequence overlap is detected with the adjacent clone in the tiling path • Level of annotation • all sequences annotated in a largely automated fashion • identification of putative or known genes, repetitive elements, EST matches and any other useful “miscellaneous features” • computationally-derived predictions must be indicated as such • Immediate release of finished annotated sequence • Global assembly of meta-contigs from previously submitted data will be performed periodically
International Strategy Meeting on Human Genome SequencingBermuda, 25th-28th February 1996Sponsored by the Wellcome Trust • Summary of agreed principles • Primary genomic sequence should be in the public domain • Primary genomic sequence should be rapidly released • Assemblies of greater than 1 Kb should be automatically released on a daily basis • Finished annotated sequence should be immediately submitted to the public databases • Coordination • Large-scale sequencing centres should inform HUGO of their intention to sequence particular regions of the human genome
Annotating the Human Genome Sequence • Identification of coding regions • Exon/intron prediction • High throughput comparison of genomic sequence to protein information • Full-length protein sequences • Databases of protein domains • How automated is automated annotation in reality? • Advantages • High speed • Good for tRNA genes, repetitive regions • Good for high-scoring matches in databases, but • Disadvantages • Error propagation can be detrimental • Domain ‘recycling’ in evolution causes misinterpretation, e.g. in the case of transcription factors similar to peptidases • Very computer-intensive task!