580 likes | 710 Views
The gene family play and the chromosomal theater. Todd Vision Department of Biology University of North Carolina at Chapel Hill. Outline. Large-scale duplication and loss of genes in the angiosperms Looking into the future of plant phylogenomics A case study in gene family demography
E N D
The gene family play and the chromosomal theater Todd Vision Department of Biology University of North Carolina at Chapel Hill
Outline • Large-scale duplication and loss of genes in the angiosperms • Looking into the future of plant phylogenomics • A case study in gene family demography • Duplication and functional divergence
Arabidopsis as a hub for plant comparative maps data from Arumuganathan & Earle (1991)Plant Mol Biol Rep 9:208-218
Tomato-Arabidopsis synteny Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121
Modes of gene duplication • Tandem (T) • unequal crossing-over • mostly young • Dispersed (D) • transposition • all ages • Segmental (S) • polyploidy • all old
Paleotetraploidy? The Arabidopsis Genome Initiative. 2000. Nature 408:796
distribution of dA in blocks not in blocks Problems • proteins diverge at different rates • high dA is difficult to estimate Solution • average dA within blocks
A B C E F D Rosids (Arabidopsis) 110-160 Mya 160-240 Mya Asterids (tomato) monocots (rice) Mya 50 100 0 150 200 discrete duplication events
the 2-4 complex(one ancestral segment broken up by 4 large inversions)
coefficient of variation = 0.67 coefficient of variation = 0.53
Rice-Arabidopsis microsynteny Mayer et al. (2001) Genome Res. 11, 1167
Arabidopsis Arabidopsis Arabidopsis Arabidopsis Rice Rice Rice Rice duplication
Block 37 after Asterid-Rosid split Block 57 before monocot-dicot divergence Raes, Vandepoele, Saeys, Simillion, Van de Peer (2003) J. Struct. Func. Genomics 3, 117-129
Divergence among duplicated genes in rice Goff et al. (2002) Science 296: 92
Hidden syntenies Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627
Interspecies comparison can reveal hidden syntenies Vandepoele, Simillion, Van de Peer (2002) TIG 18, 606-608
Major plant genome datasets Family GenusgenomeESTmap Aizoaceae Mesembryanthemum crystallinum X Brassicaceae Arabidopsis thaliana X X X Brassica spp. X Fabaceae Glycine max X X Medicago truncatula X X Phaseolus spp. X Malvaceae Gossypium arboreum X X Solanaceae Capsicum annuum X Lycopersicon esculentum X X Solanum tuberosum X X Poaceae Hordeum vulgare X X Oryza sativa X X X Sorghum bicolor/propinguim X X Triticum aestivum X X Zea mays X X Other Beta vulgaris X Chlamydomonas reinhardtii X X Pinus taeda X X Populus spp. X Prunus spp. X
Plant unigene datasets species TIGRPlantGDB barley 49885 74621 beet na 13565 chlamydomonas 30296 na citrus na 4266 coffee na 392 cotton 24350 27854 grape 49885 74621 iceplant 84558945 lettuce 21960na lotus 11025na maize 55063 71655 marchantia na 1059 medicago 3697643384 oat na 361 onion 11726 na pine 26882 24668 poplar na 20935 potato 24275 24839 rice 6077852156 rye 5199 5384 sorghum 33273 34363 soybean 67826 73946 sunflower 20520 na tomato 3101235725 wheat 109509 95949 + Arabidopsis 27170
Plant phylogenomics: Phytome • The goal is to integrate • Organismal phylogeny • Gene family • sequence • alignment • phylogeny • Genetic and physical maps
Some uses for Phytome • Starting with a chromosome segment • Identify homologous segments • Predict unobserved gene content (candidate QTL) • Starting with a gene family • Resolve orthology/paralogy relationships • Identify coevolving families • Starting with a species • Explore lineage-specific diversification • Guide comparative mapping wet-work
Current pipeline Protein sequence prediction Homolog identification Unigene collections Protein family clustering Annotations Multiple sequence alignment Phytome Phylogenetic inference
Lineage specific diversification Arabidopsis 1033 173 436 Cotton 334 836 696 Medicago 715 Tomato 919 Rice 152 genes are “single copy” in all four species
A tale of two sisters: the ARF and the Aux/IAA gene families • Modulate whole plant response to auxin • Interact via dimerization • ARFs are transcription factors • Aux/IAAs bind and repress ARFs in the absence of auxin
Why the different patterns of diversification? • 12% (ARF) vs 40% (Aux/IAA) segmental duplications • Presumably reflects differential retention • Possible explanations • Dosage requirements • Coevolution with other interacting genes • Regional transcriptional regulation
Divergence of duplicated genes Divergence in expression profile Age of duplication
Duplicate pairs in yeast and human (Gu et al. 2002, Makova and Li 2003) • Appx. 50% of pairs diverge very rapidly • Proportion of divergent pairs increases with Ks and Ka • Plateaus at Ka ~0.3 in human • In humans, • Immune response genes over-represented among young, divergent pairs • Distantly related pairs with conserved expression tend to be either ubiquitous or very tissue specific
Retention of duplicated genes • Nonfunctionalization, or loss of one copy • The fate of most pairs • Neofunctionalization (NF) • Positive selection on a new mutation can maintain the pair • Subfunctionalization (SF) • Mutations that increase the specificity of duplicates can fix due to drift provided that, combined, the two copies provide the functionality of the ancestral gene. Once SF happens, both copies are indispensable and are retained. • One prediction of the model is that SF more likely for tandem than dispersed pairs (due to linkage)
Digital expression profiling • Massively Parallel Signature Sequencing (MPSS) • Count occurrence of 17-20 bp mRNA signatures • Cloning and sequencing is done on microbeads • Similar to Serial Analysis of Gene Expression (SAGE) • “Bar-code” counting reduces concerns of • cross-hybridization • probe affinity • background hybridization • Advantages • Accurate counts of low expression genes • Can distinguish expression profiles of duplicate genes
AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA mRNA AAAAAAA extract mRNA from tissue Convert to cDNA TTTTTTT Add linker AAAAAAA Cut w/ Sau3A TTTTTTT AAAAAAA 3’ - Add unique 32 bp tag and standard primer 5’ - Add standard primer TTTTTTT AAAAAAA (added by cloning) Anneal to beads coated with unique anti-tag (32 bp, complementary to tag on mRNA) PCR TTTTTTT AAAAAAA Remove 3’ primer and expose single stranded unique tag (digest, 3' 5' exonuclease) MPSS library construction Brenner et al., PNAS 97:1665-70. GATC
AAAAAAA AAAAAAA AAAAAAA MPSS library construction AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Brenner et al., PNAS 97:1665-70. AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Sort by FACS to remove ‘empty’ beads The result of the library construction is a set of microbeads. Each bead contains many DNA molecules, all derived from the 3’ end of a single transcript. Beads are loaded in a monolayer on a microscope slide for the sequencing of 17 – 20 bp from the 5’ end.
NNNN 4 3 2 1 + NNNX CODEX1 RS NNXN CODEX2 RS NXNN CODEX3 RS Sequence by hybridization XNNN CODEX4 RS Add adaptors 16 cycles for 4 bp Digest with Type IIS enzyme to uncover next 4 bases 13 bp Repeat Cycle Steps of four bases; overhang is shifted by four bases in each round ^ GNNN CODEC4 RS DECODERED CNNN ^ 4 3 2 1 NNNN 9 bp 8 7 6 5 MPSS Sequencing Brenner et al., Nat. Biotech. 18:630-4.
TGA ATG MPSS Sequencing Each bead provides a signature of 17-20 bp Signature Sequence # of Beads (Frequency) Tag # 1 2 3 4 5 6 7 8 9 . . 30,285 GATCAATCGGACTTGTC GATCGTGCATCAGCAGT GATCCGATACAGCTTTG GATCTATGGGTATAGTC GATCCATCGTTTGGTGC GATCCCAGCAAGATAAC GATCCTCCGTCTTCACA GATCACTTCTCTCATTA GATCTACCAGAACTCGG . . GATCGGACCGATCGACT 2 53 212 349 417 561 672 702 814 . . 2,935 Total # of tags: >1,000,000 Two sets of signatures are generated from each sample in different reading frames staggered by two bases
Duplicated: expression may be from other site in genome Potential alternative splicing or nested gene Potential alternative termination Anti-sense transcript or nested gene? Potential anti-sense transcript Potential un-annotated ORF Triangles refer to colors used on our web page: Class 1 - in an exon, same strand as ORF. Class 2 - within 500 bp after stop codon, same strand as ORF. Class 3 - anti-sense of ORF (like Class 1, but on opposite strand). Class 4 - in genome but NOT class 1, 2, 3, 5 or 6. Class 5 - entirely within intron, same strand. Class 6 - entirely within intron, anti-sense. Grey = potential signature NOT expressed Class 0 - signatures found in the expression libraries but not the genome. or or or or or or Classifying signatures Typical signatures
Core Arabidopsis MPSS librariessequenced by Lynx for Blake Meyers, U. of Delaware Signatures Distinct Library sequenced signatures Root 3,645,414 48,102 Shoot 2,885,229 53,396 Flower 1,791,460 37,754 Callus 1,963,474 40,903 Silique 2,018,785 38,503 TOTAL 12,304,362 133,377
http://www.dbi.udel.edu/mpss • Query by • Sequence • Arabidopsis gene identifier • chromosomal position • BAC clone ID • MPSS signature • Library comparison • Site includes • Library and tissue information • FAQs and help pages
Chr. I Chr. II Chr. III Chr. IV Chr. V Genome-wide MPSS profile in Arabidopsis Of the 29,084 gene models, 17,849 match unambiguous, expressed class 1 and/or 2 signatures
Dataset of duplicate pairs • Gene families of size two in Arabidopsis classified as • Dispersed (280) • Segmental (149) • Tandem (63) • For each pair • Measure similarity/distance in expression profile • Estimate of Ks and KA
library 2 library 1 library 3 Expression distance