280 likes | 520 Views
Part 12 Genome Analysis. Outline. Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard Procedure General Purposes Databases for Comparative Genomics Organism Specific Databases Genome Analysis Environments
E N D
Outline • Overview • Why do comparative genomic analysis? • Assumptions/Limitations • Genome Analysis and Annotation Standard Procedure • General Purposes Databases for Comparative Genomics • Organism Specific Databases • Genome Analysis Environments • Genome Sequence Alignment Programs • Genomic Comparison Visualization Tools
Opportunistic In progress Bacteroides fragilis Veterinary In progress Bordetella bronchiseptica Whooping cough Bordetella parapertussis Complete Complete Whooping cough Bordetella pertussis Lung infections in CF In progress Burkholderia cepacia Melliodosis In progress Bur kholderia pseudomallei Veterinary Funded Chlamidophila abortus Botulism Funded Clostridium botulinum Colitis In progress Clostridium difficile Complete Diphtheria Corynebacterium diphtheriae Plant pathogen Funded Erwinia carotovora Escherichia/Shigella spp. (5) Various In progress Tuberculosis In progress Mycobacterium bovis Various In progress Mycobacterium marinum Neisseria meningitidis (serogroup C) Bacterial meningitis In progress Complete Typhoid fever Salmonella typhi Salmonella spp. (5) Various In progress Complete Staphylococcus aureus (MRSA) Various (Nosocomial) Staphylococcus aureus (MSSA) Various (Community acquired) In progress Bacterial meningitis In progress Streptococcus pneumoniae Various (ARF - associated) In progress Streptococcus pyogenes Streptococcus suis Veterinary In progress Streptococcus uberis Veterinary In progress Complete Non - pathogenic Streptomyces coelicolor Whipple’s disease In progress Tropheryma whipelli Vector (Bancroftian filariasis) In progress Wolbachia (Culex quinquefascia tus) River Blindness Funded Wolbachia (Onchocerca volvulus) Food poisoning In progress Yersinia enterocolitica Complete Plague Yersinia pestis Some of the prokaryotic genomes
Some of the eukaryotic genomes Farmer’s lung In progress Aspergillus fumigatus Soil amoeba In progress Dictyostelium discoideum Amoebic dysentry In progress Entamoeba histolitica Leishmaniasis In progress Leishmania major Malaria In progress Plasmodium falciparum Bilharzia In progress Schistosoma mansoni Complete Fission yeast Schizosaccharomyces pombe Veterinary In progress Theileria annulata Toxoplasmosis In progress Toxoplasma gondii Sleeping sickness In progress Trypanosoma brucei
Bioinformatics Flow Chart 1a. Sequencing 6. Gene & Protein expression data 1b. Analysis of nucleic acid seq. 7. Drug screening 2. Analysis of protein seq. 3. Molecular structure prediction Ab initio drug design OR Drug compound screening in database of molecules 4. molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks
Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence
Subcloning; generate small insert libraries • DNA features (repeats/similarities) • Gene finding • Peptide features • Initial role assignment • Others- regulatory regions Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap) • Problem lies in understanding what you have: • Gene prediction/gene finding • Annotation • Most genome will be sequenced and can be sequenced; • few problem are unsolvable. Clone by clone vs whole genome shotgun Release data to the public e.g. EMBL or GenBank Genome Sequencing - Review Strategy Strategy Libraries Libraries Sequencing Sequencing Assembly Assembly Closure Closure Annotation Annotation Release Release
Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B
Why do comparative genomics? • Many of the genes encoded in each genome from the genome projects had no known or predictable function • Analysis of protein set from completely sequenced genomes • Uniform evolutionary conservation of proteins in microbial genomes, 70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997) • Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms. • Cross species comparison to help reveal conserved coding regions • No prior knowledge of the sequence motif is necessary • Complement to algorithmic analysis
Assumptions/Limitation • Homologous genes are relatively well preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions. • Cross species comparative genomics is influenced by the evolutionary distance of the compared species.
Genome Analysis and Annotation: General Procedure • Basic procedure to determine the functional and structural annotation of uncharacterized proteins: • Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time. • Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam. • Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity • Generate a secondary and tertiary (if possible) structure prediction • Annotation: • Transfer of function information from a well-characterized organism to a lesser studied organism and/or • Use phylogenetic patterns (or profiles) and/or • Use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).
Genome Analysis and Annotation:One Possible Procedure • Basic procedure to determine the functional and structural annotation of uncharacterized proteins: • Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time. • Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam. • Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity • Generate a secondary and tertiary (if possible) structure prediction • Transfer of function information from a well-characterized organism to a lesser studied organism and/or use phylogenetic patterns (or profiles) and/or use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997)..
Automated Genome Annotation • GeneQuiz – limited number of searches/day • MAGPIE – outside users cannot submit own seq • PEDANT – commercial version allow for full capacity • SEALS – semi automated
General Databases Useful for Comparative Genomics • Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/ • PEDANT -Protein Extraction Description ANalysis Tool http://pedant.gsf.de/ • MIPS – http://mips.gsf.de/ • COGs - Cluster of Orthologous Groups (of proteins) http://www.ncbi.nih.gov/COG/ • KEGG - Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/ • MBGD - Microbial Genome Database http://mbgd.genome.ad.jp/ • GOLD - Genome OnLine Database http://wit.integratedgenomics.com/GOLD/ • TOGA – http://www.tigr.org/xxxxx
Problems with existing sequence alignments algorithms for genomic analysis • Most algorithms were developed for comparing single protein sequences or DNA sequences containing a single gene • Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme. • Unfortunately, most of these programs cannot accurately handle long alignments. • Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memory-limited) or very time-consuming. Higher speed vs increased sensitivity.
Genome-size comparative alignment tools • ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes • ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998) • BLAT – • http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx) • DIALIGN - DIagonal ALIGNment • http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999( • DBA - DNA Block Aligner • http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999( • GLASS - GLobal Alignment SyStem • http://plover.lcs.mit.edu/ (Batzoglou et al. 2000) • LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS • Email: jbuhler@cs.washington.edu (Buhler 2001) • MegaBlast • http://www.ncbi.nih.gov/blast/ (Zhang 2000) • MUMmer - Maximal Unique Match (mer) • http://www.tigr.org/softlab/ (Delcher et al. 1999) • PIPMaker - Percent Identity Plot MAKER • http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000) • SSAHA – Sequence Search and Alignment by Hashing Algorithm • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • WABA - Wobble Aware Bulk Aligner • http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000)
SSAHA • Sequence Search and Alignment by Hashing Algorithm • Software tool for very fast matching and alignment of DNA sequences. • Achieves fast search speed by converting sequence information into a hash table data structure which can then be searched very rapidly for matches • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • Run from the Unix command line • Need > 1GB RAM (needs a lot of memory) • SSAHA algorithm best for application requiring exact or “almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs
Genome Analysis Environment • MAGPIE - Automated Genome Project Investigation Environment • PEDANT • SEALS
Problems with Visualizing Genomes • Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes. • Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation. • Genome Alignment Visualization tools need to provide: • interpretable alignments, • gene prediction and database homologies from different sources • Interactive features: real time capabilities, zooming, searching specific regions of homologies • Represent breaks in synteny • Multiple alignments display • Displaying contigs of unfinished genomes with finished genomes • Handle various data formats • Software availabilty (no black box)
Genome Comparison Visualization Tool • ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool) • http://www.sanger.ac.uk/Software/ACT/ • Alfresco (displays DBA alignments and ...) • http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin 2000) • PipMaker (displays BlastZ alignments) • http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000) • Enteric/Menteric/Maj (displays Blastz alignments) • http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al. 2000) • Intronerator (displays WABA alignments and ...) • http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b) • VISTA (Visualization Tool for Alignment) (displays GLASS alignments) • http://www-gsd.lbl.gov/vista/ • SynPlot (displays DIALIGN and GLASS alignments) • http://www.sanger.ac.uk/Users/igrg/SynPlot/
Artemis Comparison Tool (ACT) • ACT is a DNA sequence comparison viewer based on Artemis • Can read complete EMBL and GenBank entries or sequence in FASTA or raw format • Additional sequence feature can be in EMBL, GenBank, GFF format • ACT is free software and is distributed under the GNU Public License • Java based software • Latest release 2.0 better support Eukaryotic Genome Comparison http://www.sanger.ac.uk/Software/ACT/
Salmonella typhi vs. E. coli – SPI-2 G+C tRNA phage/IS genes Pseudogenes S.typhi Blast hits E.coli
Salmonellatyphi and Yersinia pestis type III secretion systems
Salmonella typhi vs. E. coli - ACT SPI-10 SPI-2 SPI-1 SPI-9 SPI-7 Vi S. typhi DNA matches E. coli
ASSIRC • Accelerated Search for SImilarity Regions in Chromosome • ASSIRC finds regions of similarity in pair-wise genomic sequence alignments. • The method involves three steps: • (i) identification of short exact chains of fixed size, called 'seeds', common to both sequences, using hashing functions; • (ii) extension of these seeds into putative regions of similarity by a 'random walk' procedure (i.e. the four bases are associated; • (iii) final selection of regions of similarity by assessing alignments of the putative sequences. • We used simulations to estimate the proportion of regions of similarity not detected for particular region sizes, base identity proportions and seed sizes. • This approach can be tailored to the user's specifications. • They looked for regions of similarity between two yeast chromosomes (V and IX). The efficiency of the approach was compared to those of conventional programs BLAST and FASTA, by assessing CPU time required and the regions of similarity found for the same data set. • http://www.biologie.ens.fr/perso/vincens/assirc.html • ftp://ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz
BLAT • Only DNA sequences of 25,000 or less bases and protein or translated sequence of 5000 or less letters will be processed. If multiple sequences are submitted at the same time, the total limit is 50,000 bases or 12,500 letters. • BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates • BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non- overlapping 11-mers except for those heavily involved in repeats. The index takes up a bit less than a gigabyte of RAM. The genome itself is not kept in memory, allowing BLAT to deliver high performance on a reasonably priced Linux box. The index is used to find areas of probable homology, which are then loaded into memory for a detailed alignment. Protein BLAT works in a similar manner, except with 4-mers rather than 11-mers. The protein index takes a little more than 2 gigabytes • BLAT was written by Jim Kent. Like most of Jim's software interactive use on this web server is free to all. Sources and executables to run batch jobs on your own server are available free for academic, personal, and non-profit purposes. Non- exclusive commercial licenses are also available. Contact Jim for details.