GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University

Genomes and gene contents 17,000 6,000 45,000 10,000 30,000 25,000

Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Gene function and duplication • What’s the consequence?

Focus I: Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Consequences

+ Duplication mechanisms • Whole genome duplication • Tandem duplication • Segmental duplication • Replicative transposition

Lineage-specific gains in plants and animals • Substantially more recent duplicates in plants than in animals • Mostly due to frequent whole genome duplications in plants *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains.

3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 million years Gain vs. Loss 120,000 60,000 30,000 15,000* Arabidopsis gene content: 21,000** Genome duplications + tandem duplications – gene losses = *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families.

“Age” distribution of animal duplicates • Steady decay in the number of duplicates • Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006

Apparent peak at ~0.18 instead of zero Ks Frequent WGD, TD, SD (maybe), and RT (in some plants) Plant duplicate “age” distribution Shiu et al., 2004

Genome remodeling in polyploids • Natural and synthetic polyploids ~314 Mb ~203 Mb ~257 Mb 20,000 yr ~348 Mb

Experimental approaches • Genome-wide polymorphism monitored by tiling array Resolution Gap Genome Tiled probes Array ~6 million features 20,000 yr

Genome-wide Single Feature Polymorphism • Mid-parent (MP) vs. Arabidopsis suecica (As)

Genome-wide Single Feature Polymorphism • Genome-wide polymorphism monitored by tiling array Gene Pseudogene Transposon

Genome-wide Single Feature Polymorphism • Duplication or deletion MP duplication or As deletion

Genome Survey Sequencing • Sequence ~40-60Mb of the Arabidopsis suecica genome • 0.15-0.2 X coverage, will be done next week! • Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant • Ultra-high throughput • 20-30 Mb per run, each run 5 hours • Will be 100Mb per run early 2007 • Cost efficient • ~$0.3/kb • Read length rather limited • ~100bp per read now • Will be ~200bp early 2007 • For more information contact: • Andreas Weber (aweber@msu.edu) • David DeWitt (dewittd@msu.edu) • Or Shin-Han Shiu (shius@msu.edu) • Seminar on instrumentation: • 9/29, Friday, 1pm, 1415 BPS

Summary: Gene duplication and polyploidy • Gene duplication occurred frequently in eukaryotes but most duplicate are lost. • In plants, whole genome duplication is common. But gene lost occurred frequently. • After 4 generations, very small number of SFPs are identified in synthetic polyploids. • After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion. • Clustered polymorphisms mostly locate in pseudogenes and transposons. • Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.

Focus II: Differential Retention of Duplicates Gene Duplications Mechanisms Preferential retention Consequences

Duplicate genes in the genome • Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Large gene families in plants • One of the largest gene families

Normalized gain: % expanded OGs • Large family sizes do not necessarily indicates higher expansion rates

Ancestral family sizes and gene gains • Large ancestral family tend to have more lineage specific gains but with many exceptions

Differential expansion of functional categories • GO: GeneOntology • Protein ubiquitination • Polysaccharide biosynthesis • Cell wall modification • Transcriptional regulation • Biotic stress response • Secondary metabolism

Differences in Duplicability • Duplicability • The propensity for the retention of a duplicate gene • Computational analysis of genome-wide trend

Kinase superfamily sizes among eukaryotes Shiu & Bleecker, 2003

Kinase families in rice and Arabidopsis • Gene count differences among families indicate differential expansion Shiu et al., 2004

A. B. WAK LRR VIII, X, XII Estimation of ancestral RLK family size • Kinase phylogeny of Arabidopsis and rice RLKs 440 speciation points rice Arabidopsis A. B. Shiu et al., 2004

Development vs. resistance/defense RLKs Shiu et al., 2004

Resistance/Defense RLKs High duplicability Developmental RLKs Low duplicability Animal tyrosine kinases Low duplicability Transcription factors High duplicability Contradiction • Plant genes invovled in development tend to have high duplicability

T T Selection for expansion • Depend on the level of variations of the signals OR

Longevity and duplicability of plant genes Summary: differential retention Longevity Examples Duplicability High High Transcription factors High Low Resistance genes Low High Enzymes in central metabolic pathways Low Low ??

Focus III: Functional Consequences Gene Duplications Mechanisms Preferential retention Consequences

Functional Consequences of Duplication • Functional divergence and conservation • Is it because of changes in cis-regulatory elements or coding sequences • How are duplicates retained, subfunctionalization or neofunctionalization

Expression data Clusters of genes with similar expression profiles Over-represented sequence motifs in 5’ regions Cis-regulatory logic Machine learning Experimental validations Motif functional prediction Divergence in gene expression • Develop pipelines for cis-element prediction and

Divergence in post-translational modification • Conservation of phosphorylation site across speces • SACE: budding yeast • CAGL: Candida glabra • CAAL: Candida albicans • CATR: Candida tropicalis • NECR: Neurospora crassa • DEHA: Debaryomuces hansenii

Detailed Functional Studies of Duplicate Genes • Functional analyses of DDF1 and DDF2 transcription factors • Derived from recent whole genome duplication in Arabidopsis • Related to the well known CBF factors involved in cold and draught stress Arabidopsis thaliana Arabidopsis lyrata Promoter GFP Knockouts Promoter GFP Knockouts DDFs DDFs Over- expression studies Over- expression studies Binding targets Interacting proteins Binding targets Interacting proteins

Gene Duplications Mechanisms Preferential retention Preferential retention Consequences Consequences Focus IV: Protein space

Tiling array analysis of transcriptome • Human Chr 21, 22 Kapranov et al., 2002

Posterior probability p(F|coding)

Performance of the CI measure • Known Arabidopsis exon and intron 90-300bp • Arabidopsis small protein that are not annotated • Correctly predict 19 out of 20 (95%). • Yesat sORF with translation evidence • Correctly predict 98 out of 114 (86%) • In “intergenic” sequences of Arabidopsis genome • 3,274 sORF identified

Coupling with tiling array expression • Hybridization intensities for feature types

Summary: Novel coding genes • Many unannotated regions in the genomes are expressed. • Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly. • Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome. • Using tiling array data, we found that many of these novel coding regions are expressed.

Acknowledgement • Lab members • University of Chicago • Justin Borevitz • Xu Zhang • University of Wisconsin • Sara Patterson • Rick Vierstra • University of Missouri • Scott Peck • Michigan State University • Many… • Rong Jin, Comp Sci & Eng • Yue-Hua Cui, Stat & Prob • Startup fund Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode

Recent completion …

Genome remodeling in polyploids • Genome duplication occur frequently in plants • What is the fate of duplicates? • How fast do gene losses occur? • Is there any preference in genes retained? A B C D E A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A1 B1 C1 D1 E1 t1 t2 Ng = 5 10 8 5

Comparing degrees of expansion Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Combined set Gene/domain families unique GO:0001 Shared ui = 1 Pairwise distance ei = 4 Putative orthologous groups All orthologous groups Total unexpanded = Σ ui Total expanded = Σ ei

Major questions on gene duplication • When: timing of gene duplications, e.g. N = 10

Domain gains in rice and Arabidopsis • Gain in one lineage does not necessarily predict gain in the other

Pc(AAAT) # of AAA # of all NNN Pc(AAA) Identify novel small coding genes • Determine base composition probabilities Coding sequences CDS parameters Pc(AAA) = Pc(T|AAA) = Non-coding sequences NCDS parameters • Feature tables c1 c2 c3 n c4 c5 c6 • Calculate posterior probability

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

Presentation Transcript

DNA, Gene, and Genome

Gene Regulation in Eukaryotes

Bioinformatis and Evolutionary Genomics Genome Duplications

INDM 3007 Gene expression in eukaryotes

Regulation of Gene Expression in Eukaryotes

Gene Control in Eukaryotes

Whole Genome Duplications (Polyploidy)

Chapter 17 Gene Regulation in Eukaryotes

Gene regulation in prokaryotes and eukaryotes

Gene Regulation in Eukaryotes

Evolution of Eukaryotic Genome Gene 342

Gene regulation in prokaryotes and eukaryotes

32 Gene regulation in Eukaryotes

Gene Expression in Eukaryotes

Gene Expression Systems in Prokaryotes and Eukaryotes

Chapter 13 Gene Regulation in Eukaryotes

Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Chapter 9: Gene Transfer, Mutations, and Genome Evolution

Chapter 17 Gene Regulation in Eukaryotes

Regulation of Gene Expression in Eukaryotes

Gene Expression Systems in Prokaryotes and Eukaryotes