1 / 53

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES. Shin-Han Shiu Plant Biology / QBMI Michigan State University. Genomes and gene contents. 17,000. 6,000. 45,000. 10,000. 30,000. 25,000. Duplicate genes in the genome. Arabidopsis gene families*.

chana
Download Presentation

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University

  2. Genomes and gene contents 17,000 6,000 45,000 10,000 30,000 25,000

  3. Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

  4. Gene function and duplication • What’s the consequence?

  5. Gene function and duplication • What’s the consequence?

  6. Focus I: Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Consequences

  7. + Duplication mechanisms • Whole genome duplication • Tandem duplication • Segmental duplication • Replicative transposition

  8. Lineage-specific gains in plants and animals • Substantially more recent duplicates in plants than in animals • Mostly due to frequent whole genome duplications in plants *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains.

  9. 3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 million years Gain vs. Loss 120,000 60,000 30,000 15,000* Arabidopsis gene content: 21,000** Genome duplications + tandem duplications – gene losses = *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families.

  10. “Age” distribution of animal duplicates • Steady decay in the number of duplicates • Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006

  11. Apparent peak at ~0.18 instead of zero Ks Frequent WGD, TD, SD (maybe), and RT (in some plants) Plant duplicate “age” distribution Shiu et al., 2004

  12. Genome remodeling in polyploids • Natural and synthetic polyploids ~314 Mb ~203 Mb ~257 Mb 20,000 yr ~348 Mb

  13. Experimental approaches • Genome-wide polymorphism monitored by tiling array Resolution Gap Genome Tiled probes Array ~6 million features 20,000 yr

  14. Genome-wide Single Feature Polymorphism • Mid-parent (MP) vs. Arabidopsis suecica (As)

  15. Genome-wide Single Feature Polymorphism • Genome-wide polymorphism monitored by tiling array Gene Pseudogene Transposon

  16. Genome-wide Single Feature Polymorphism • Duplication or deletion MP duplication or As deletion

  17. Genome Survey Sequencing • Sequence ~40-60Mb of the Arabidopsis suecica genome • 0.15-0.2 X coverage, will be done next week! • Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant • Ultra-high throughput • 20-30 Mb per run, each run 5 hours • Will be 100Mb per run early 2007 • Cost efficient • ~$0.3/kb • Read length rather limited • ~100bp per read now • Will be ~200bp early 2007 • For more information contact: • Andreas Weber (aweber@msu.edu) • David DeWitt (dewittd@msu.edu) • Or Shin-Han Shiu (shius@msu.edu) • Seminar on instrumentation: • 9/29, Friday, 1pm, 1415 BPS

  18. Summary: Gene duplication and polyploidy • Gene duplication occurred frequently in eukaryotes but most duplicate are lost. • In plants, whole genome duplication is common. But gene lost occurred frequently. • After 4 generations, very small number of SFPs are identified in synthetic polyploids. • After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion. • Clustered polymorphisms mostly locate in pseudogenes and transposons. • Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.

  19. Focus II: Differential Retention of Duplicates Gene Duplications Mechanisms Preferential retention Consequences

  20. Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

  21. Large gene families in plants • One of the largest gene families

  22. Normalized gain: % expanded OGs • Large family sizes do not necessarily indicates higher expansion rates

  23. Ancestral family sizes and gene gains • Large ancestral family tend to have more lineage specific gains but with many exceptions

  24. Differential expansion of functional categories • GO: GeneOntology • Protein ubiquitination • Polysaccharide biosynthesis • Cell wall modification • Transcriptional regulation • Biotic stress response • Secondary metabolism

  25. Differences in Duplicability • Duplicability • The propensity for the retention of a duplicate gene • Computational analysis of genome-wide trend

  26. Kinase superfamily sizes among eukaryotes Shiu & Bleecker, 2003

  27. Kinase families in rice and Arabidopsis • Gene count differences among families indicate differential expansion Shiu et al., 2004

  28. A. B. WAK LRR VIII, X, XII Estimation of ancestral RLK family size • Kinase phylogeny of Arabidopsis and rice RLKs 440 speciation points rice Arabidopsis A. B. Shiu et al., 2004

  29. Development vs. resistance/defense RLKs Shiu et al., 2004

  30. Resistance/Defense RLKs High duplicability Developmental RLKs Low duplicability Animal tyrosine kinases Low duplicability Transcription factors High duplicability Contradiction • Plant genes invovled in development tend to have high duplicability

  31. T T Selection for expansion • Depend on the level of variations of the signals OR

  32. Longevity and duplicability of plant genes Summary: differential retention Longevity Examples Duplicability High High Transcription factors High Low Resistance genes Low High Enzymes in central metabolic pathways Low Low ??

  33. Focus III: Functional Consequences Gene Duplications Mechanisms Preferential retention Consequences

  34. Functional Consequences of Duplication • Functional divergence and conservation • Is it because of changes in cis-regulatory elements or coding sequences • How are duplicates retained, subfunctionalization or neofunctionalization

  35. Expression data Clusters of genes with similar expression profiles Over-represented sequence motifs in 5’ regions Cis-regulatory logic Machine learning Experimental validations Motif functional prediction Divergence in gene expression • Develop pipelines for cis-element prediction and

  36. Divergence in post-translational modification • Conservation of phosphorylation site across speces • SACE: budding yeast • CAGL: Candida glabra • CAAL: Candida albicans • CATR: Candida tropicalis • NECR: Neurospora crassa • DEHA: Debaryomuces hansenii

  37. Detailed Functional Studies of Duplicate Genes • Functional analyses of DDF1 and DDF2 transcription factors • Derived from recent whole genome duplication in Arabidopsis • Related to the well known CBF factors involved in cold and draught stress Arabidopsis thaliana Arabidopsis lyrata Promoter GFP Knockouts Promoter GFP Knockouts DDFs DDFs Over- expression studies Over- expression studies Binding targets Interacting proteins Binding targets Interacting proteins

  38. Gene Duplications Mechanisms Preferential retention Preferential retention Consequences Consequences Focus IV: Protein space

  39. Tiling array analysis of transcriptome • Human Chr 21, 22 Kapranov et al., 2002

  40. Posterior probability p(F|coding)

  41. Performance of the CI measure • Known Arabidopsis exon and intron 90-300bp • Arabidopsis small protein that are not annotated • Correctly predict 19 out of 20 (95%). • Yesat sORF with translation evidence • Correctly predict 98 out of 114 (86%) • In “intergenic” sequences of Arabidopsis genome • 3,274 sORF identified

  42. Coupling with tiling array expression • Hybridization intensities for feature types

  43. Summary: Novel coding genes • Many unannotated regions in the genomes are expressed. • Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly. • Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome. • Using tiling array data, we found that many of these novel coding regions are expressed.

  44. Acknowledgement • Lab members • University of Chicago • Justin Borevitz • Xu Zhang • University of Wisconsin • Sara Patterson • Rick Vierstra • University of Missouri • Scott Peck • Michigan State University • Many… • Rong Jin, Comp Sci & Eng • Yue-Hua Cui, Stat & Prob • Startup fund Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode

  45. Recent completion …

  46. Genome remodeling in polyploids • Genome duplication occur frequently in plants • What is the fate of duplicates? • How fast do gene losses occur? • Is there any preference in genes retained? A B C D E A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A1 B1 C1 D1 E1 t1 t2 Ng = 5 10 8 5

  47. Comparing degrees of expansion Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Combined set Gene/domain families unique GO:0001 Shared ui = 1 Pairwise distance ei = 4 Putative orthologous groups All orthologous groups Total unexpanded = Σ ui Total expanded = Σ ei

  48. Major questions on gene duplication • When: timing of gene duplications, e.g. N = 10

  49. Domain gains in rice and Arabidopsis • Gain in one lineage does not necessarily predict gain in the other

  50. Pc(AAAT) # of AAA # of all NNN Pc(AAA) Identify novel small coding genes • Determine base composition probabilities Coding sequences CDS parameters Pc(AAA) = Pc(T|AAA) = Non-coding sequences NCDS parameters • Feature tables c1 c2 c3 n c4 c5 c6 • Calculate posterior probability

More Related