580 likes | 889 Views
GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES. Shin-Han Shiu Plant Biology / QBMI Michigan State University. Genomes and gene contents. 17,000. 6,000. 45,000. 10,000. 30,000. 25,000. Duplicate genes in the genome. Arabidopsis gene families*.
E N D
GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University
Genomes and gene contents 17,000 6,000 45,000 10,000 30,000 25,000
Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Gene function and duplication • What’s the consequence?
Gene function and duplication • What’s the consequence?
Focus I: Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Consequences
+ Duplication mechanisms • Whole genome duplication • Tandem duplication • Segmental duplication • Replicative transposition
Lineage-specific gains in plants and animals • Substantially more recent duplicates in plants than in animals • Mostly due to frequent whole genome duplications in plants *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains.
3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 million years Gain vs. Loss 120,000 60,000 30,000 15,000* Arabidopsis gene content: 21,000** Genome duplications + tandem duplications – gene losses = *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families.
“Age” distribution of animal duplicates • Steady decay in the number of duplicates • Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006
Apparent peak at ~0.18 instead of zero Ks Frequent WGD, TD, SD (maybe), and RT (in some plants) Plant duplicate “age” distribution Shiu et al., 2004
Genome remodeling in polyploids • Natural and synthetic polyploids ~314 Mb ~203 Mb ~257 Mb 20,000 yr ~348 Mb
Experimental approaches • Genome-wide polymorphism monitored by tiling array Resolution Gap Genome Tiled probes Array ~6 million features 20,000 yr
Genome-wide Single Feature Polymorphism • Mid-parent (MP) vs. Arabidopsis suecica (As)
Genome-wide Single Feature Polymorphism • Genome-wide polymorphism monitored by tiling array Gene Pseudogene Transposon
Genome-wide Single Feature Polymorphism • Duplication or deletion MP duplication or As deletion
Genome Survey Sequencing • Sequence ~40-60Mb of the Arabidopsis suecica genome • 0.15-0.2 X coverage, will be done next week! • Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant • Ultra-high throughput • 20-30 Mb per run, each run 5 hours • Will be 100Mb per run early 2007 • Cost efficient • ~$0.3/kb • Read length rather limited • ~100bp per read now • Will be ~200bp early 2007 • For more information contact: • Andreas Weber (aweber@msu.edu) • David DeWitt (dewittd@msu.edu) • Or Shin-Han Shiu (shius@msu.edu) • Seminar on instrumentation: • 9/29, Friday, 1pm, 1415 BPS
Summary: Gene duplication and polyploidy • Gene duplication occurred frequently in eukaryotes but most duplicate are lost. • In plants, whole genome duplication is common. But gene lost occurred frequently. • After 4 generations, very small number of SFPs are identified in synthetic polyploids. • After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion. • Clustered polymorphisms mostly locate in pseudogenes and transposons. • Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.
Focus II: Differential Retention of Duplicates Gene Duplications Mechanisms Preferential retention Consequences
Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Large gene families in plants • One of the largest gene families
Normalized gain: % expanded OGs • Large family sizes do not necessarily indicates higher expansion rates
Ancestral family sizes and gene gains • Large ancestral family tend to have more lineage specific gains but with many exceptions
Differential expansion of functional categories • GO: GeneOntology • Protein ubiquitination • Polysaccharide biosynthesis • Cell wall modification • Transcriptional regulation • Biotic stress response • Secondary metabolism
Differences in Duplicability • Duplicability • The propensity for the retention of a duplicate gene • Computational analysis of genome-wide trend
Kinase superfamily sizes among eukaryotes Shiu & Bleecker, 2003
Kinase families in rice and Arabidopsis • Gene count differences among families indicate differential expansion Shiu et al., 2004
A. B. WAK LRR VIII, X, XII Estimation of ancestral RLK family size • Kinase phylogeny of Arabidopsis and rice RLKs 440 speciation points rice Arabidopsis A. B. Shiu et al., 2004
Development vs. resistance/defense RLKs Shiu et al., 2004
Resistance/Defense RLKs High duplicability Developmental RLKs Low duplicability Animal tyrosine kinases Low duplicability Transcription factors High duplicability Contradiction • Plant genes invovled in development tend to have high duplicability
T T Selection for expansion • Depend on the level of variations of the signals OR
Longevity and duplicability of plant genes Summary: differential retention Longevity Examples Duplicability High High Transcription factors High Low Resistance genes Low High Enzymes in central metabolic pathways Low Low ??
Focus III: Functional Consequences Gene Duplications Mechanisms Preferential retention Consequences
Functional Consequences of Duplication • Functional divergence and conservation • Is it because of changes in cis-regulatory elements or coding sequences • How are duplicates retained, subfunctionalization or neofunctionalization
Expression data Clusters of genes with similar expression profiles Over-represented sequence motifs in 5’ regions Cis-regulatory logic Machine learning Experimental validations Motif functional prediction Divergence in gene expression • Develop pipelines for cis-element prediction and
Divergence in post-translational modification • Conservation of phosphorylation site across speces • SACE: budding yeast • CAGL: Candida glabra • CAAL: Candida albicans • CATR: Candida tropicalis • NECR: Neurospora crassa • DEHA: Debaryomuces hansenii
Detailed Functional Studies of Duplicate Genes • Functional analyses of DDF1 and DDF2 transcription factors • Derived from recent whole genome duplication in Arabidopsis • Related to the well known CBF factors involved in cold and draught stress Arabidopsis thaliana Arabidopsis lyrata Promoter GFP Knockouts Promoter GFP Knockouts DDFs DDFs Over- expression studies Over- expression studies Binding targets Interacting proteins Binding targets Interacting proteins
Gene Duplications Mechanisms Preferential retention Preferential retention Consequences Consequences Focus IV: Protein space
Tiling array analysis of transcriptome • Human Chr 21, 22 Kapranov et al., 2002
Performance of the CI measure • Known Arabidopsis exon and intron 90-300bp • Arabidopsis small protein that are not annotated • Correctly predict 19 out of 20 (95%). • Yesat sORF with translation evidence • Correctly predict 98 out of 114 (86%) • In “intergenic” sequences of Arabidopsis genome • 3,274 sORF identified
Coupling with tiling array expression • Hybridization intensities for feature types
Summary: Novel coding genes • Many unannotated regions in the genomes are expressed. • Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly. • Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome. • Using tiling array data, we found that many of these novel coding regions are expressed.
Acknowledgement • Lab members • University of Chicago • Justin Borevitz • Xu Zhang • University of Wisconsin • Sara Patterson • Rick Vierstra • University of Missouri • Scott Peck • Michigan State University • Many… • Rong Jin, Comp Sci & Eng • Yue-Hua Cui, Stat & Prob • Startup fund Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode
Genome remodeling in polyploids • Genome duplication occur frequently in plants • What is the fate of duplicates? • How fast do gene losses occur? • Is there any preference in genes retained? A B C D E A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A1 B1 C1 D1 E1 t1 t2 Ng = 5 10 8 5
Comparing degrees of expansion Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Combined set Gene/domain families unique GO:0001 Shared ui = 1 Pairwise distance ei = 4 Putative orthologous groups All orthologous groups Total unexpanded = Σ ui Total expanded = Σ ei
Major questions on gene duplication • When: timing of gene duplications, e.g. N = 10
Domain gains in rice and Arabidopsis • Gain in one lineage does not necessarily predict gain in the other
Pc(AAAT) # of AAA # of all NNN Pc(AAA) Identify novel small coding genes • Determine base composition probabilities Coding sequences CDS parameters Pc(AAA) = Pc(T|AAA) = Non-coding sequences NCDS parameters • Feature tables c1 c2 c3 n c4 c5 c6 • Calculate posterior probability