1 / 37

Genomic and comparative genomic analysis

Genomic and comparative genomic analysis. BIO520 Bioinformatics Jim Lund. Comparative genomics delivers. Clues as to human disease genes and evolutionary history Evidence of general trends in genome evolution Previously unknown regulatory strategies

yasuo
Download Presentation

Genomic and comparative genomic analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomic and comparative genomic analysis BIO520 Bioinformatics Jim Lund

  2. Comparative genomics delivers • Clues as to human disease genes and evolutionary history • Evidence of general trends in genome evolution • Previously unknown regulatory strategies • “Natural history”of species as apparent in genome records • Surprises

  3. Difference is in Scale and Direction Other “omics” Comparative One or several genes compared against all other known genes. Entire Genome compared to other entire genomes. Use information from many genomes to learn more about the individual genes. Use genome to inform us about the entire organism.

  4. What are some questions that comparative genomics can address? How has the organism evolved? What differentiates species? Which non-coding regions are important? Which genes are required for organisms to survive in a certain environment? (prokaryotes)

  5. Time (My) -200 -150 -80 -10 0 A B C D E F Genomic characteristics observed in recently diverged species • Organism-specific differences in gene regulation more apparent than difference in genome sequence or structure • Relatively small amount of neutral drift • Apparent positive selection • Some chromosomal rearrangement • Minimal species-specific gene innovation

  6. Time (My) -200 -150 -80 -10 0 A B C D E F Genomic characteristics observed in species that have diverged ~80MYA • Chromosomal re-arrangements dominate organizational change. • Changes in chromosome number likely. • Conservation of synteny regions within rearrangements. • High conservation features indicate purifying selection against drift background, therefore important genomic features in common. • Protein domain arrangements largely conserved among orthologs. • Species-specific gene duplication, divergence, and/or loss.

  7. Time (My) -1000 -500 0 F A E G Genomic characteristics observed between species that have diverged ~1BYA • Genome structure has no resolvable large or small-scale homology. • Cis-regulatory regions do not correspond. • Greatest conservation at the functional level in some protein domains and functional RNA. • Different strategies in gene organization and regulation. • Apparent homology in shared-ancestral systems, such as energy processing and storage.

  8. Different Questions Require Different Comparisons From: Hardison. Plos Biology. Vol 1 (2): 156-160

  9. What is compared? • Gene location • Gene structure • Exon number • Exon lengths • Intron lengths • Sequence similarity • Gene characteristics • Splice sites • Codon usage • Conserved synteny

  10. Millions of years From: Miller et al. Annu. Rev. Genom. Human. Genet. 2004.5:15-56.

  11. Reminder: Orthologues & Paralogues t 0 2 3 1 Frog alpha Alpha chain Orthologues Early globin gene Human alpha Paralogues Human Beta Frog beta Beta chain First duplication event Second duplication event (speciation)

  12. Figure 1   Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regions in mouse, are shown (white) between the genes and the alignment regions.

  13. Example Functional elements: Gene regulation? Chromatin structure?

  14. Synteny Two or more genes that are located in the same chromosome. Relevant within a species. Conserved synteny Orthologs of genes that are syntenic in one species are also located on a single chromosome in a second species. Gene order is irrelevant. Conserved segments/linkages In a segment of DNA, the order of multiple orthologous genes is the same in two species. Terminologies (Cont’d)

  15. Image credit: U.S. Department of Energy Human Genome Program From: http://www.macdevcenter.com/pub/a/mac/2004/06/29/bioinformatics.html

  16. Q: Why do gene pairs in syntenic regions have more significant E scores?

  17. VISTAA genomic alignment and visualization programhttp://genome.lbl.gov/vista/index.shtml • VISTA automatically finds an orthologue for your input sequence and performs a VISTA similarity plot • Example: Rat BAC: gj (AC097115) • For alignment, uses the AVID or LAGAN programs • Quickly aligns 100’s of kb • Can handle sequence in draft format • Uses HMM-like algorithm to find strong anchors from a collection of maximal matches • Uses VISTA browser – sequence alignment visualization tool • Allows easy visualization of areas with high similarit.y • Visualization is scalable – allows you to zoom in/out.

  18. Gene: CARP – cardiac ankyrin repeat protein

  19. There are many genomic alignment and visualization tools: • BLASTZ/PipMaker : http://bio.cse.psu.edu/ • AVID/VISTA: http://www-gsd.lbl.gov/vista/ • LAGAN/Multi-LAGAN: http://lagan.stanford.edu • AVID: http://baboon.math.berkeley.edu/mAVID • BLAT: http://www.genome.ucsc.edu/ • SSAHA: http://bioinfo.sarang.net/wiki/SSAHA • CONREAL:http://conreal.niob.knaw.nl/ • MUMmer: http://www.tigr.org/software/mummer.

  20. Example output from PipMaker

  21. Genomic view of simple sequence categories Q: What general patterns can be seen? Q: Why do some of the factors correlate w/ gene density?

  22. Multi-species conservation

  23. Conserved Non-Coding Sequences

  24. What are those MCS? • Regulatory • Transcription factor binding sites • miRNAs or miRNA target sites • Chromosome structure • Insulator sequences • Structural • Replication • Recombination • Chromosome structure

  25. Between-proteome comparisons Used to identify orthologs. Protein alignments involving a search of one protein from species A against the proteome of a species B Several different bioinformatic approaches have been used to make the comparison. • High scoring reciprocal best hits. • COGs (and KOGs) • Genome-wide phylogenetic analysis

  26. Using High scoring reciprocal best hits • High scoring reciprocal best hitswith the same domain structure are most likely orthologs • share common ancestry • likely to have the same function • Function likely to be more essential (replication, etc) • Genes are not unique to either organism. • E-value should be >0.01 and alignment should stretch over >60% of each protein • High scoring hits with slightly different domain structures may be orthologous, but it difficult to tell due to common, conserved domains that have complicated histories • Cluster analysis can help sort this out

  27. Worm v. yeast sequences

  28. What is COG? • The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt on a phylogenetic classificationof the proteins encoded in complete genomes. • Each COG group consists of individual orthologous proteins or orthologous sets of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. • http://www.ncbi.nlm.nih.gov/COG

  29. A shortcut for identifying orthologs---the genomic-specific best hit (BeT) • Given a gene from one genome, the gene from another genome with the highest sequence similarity (the BeT) is the ortholog.

  30. Algorithm of clustering orthologous groups (overview) All-against-all sequence comparison (gapped-BLAST) Input protein sequences paralogs Graph of BeTs Quality control COG database Ortholog triangle Merge triangles

  31. A(a) C(c) B(b) The ortholog triangle • Multiple alignment • Comparing pairwise alignments of AC and AB, we deduce the alignment of BC. • Comparing the calculated and deduced alignment of BC; if the two alignments are consistent, the BeTs triangle is a triangle of orthologs and can initiate a new COG group.

  32. Algorithm – merging triangles • Merging triangles that had a common side until no new ones can be joined. The candidates of orthologous sets were detected. A simple COG with two yeast paralogs isoleucyl-tRNA synthetase

  33. Functional and phylogenetic patterns E, E. coli; H, H. influenzae; G, M. genitalium; P, M. pneumoniae; C, Synechocystis sp.; M, M. jannaschii; Y, S. cerevisiae.

  34. Phyletic patterns of COGs (2003) ~500 COGs • 74% of COGs show scattered distribution, which reflect frequent lineage-specific gene loss and horizontal gene transfer in prokarytic evolution.

  35. Representation of the 7 analyzed eukaryotic species in KOGs • KOG: eukaryotic orthologous groups

More Related