360 likes | 448 Views
Slowly approaching grass specific gene diversification OR We need to fix those phylogenies first! . gene family:. a set of divergent but functionally related genes that descend from the same ancestral gene . species A: 5 copies. species B: 15 copies. retention of duplicated gene copies.
E N D
Slowly approaching grass specific gene diversificationORWe need to fix those phylogenies first!
gene family: a set of divergent but functionally related genes that descend from the same ancestral gene species A: 5 copies species B: 15 copies
retention of duplicated gene copies mechanisms increasing gene copy number • tandem duplication • segmental duplication • whole genome duplication • large quantities of a gene product are needed • • specialization for functions, location, times
lineage specific diversification species A: 5 copies species B: 15 copies species 3 has 5 gene copies species 3 has 5 gene copies
lineage specific diversification species A: 5 copies species B: 15 copies species 3 has 5 gene copies species 3 has 5 gene copies
NBS-LRR resistance gene families in Arabidopsis: ~150 - 200 gene copies in rice: ~500 - 700 gene copies CC coiled-coil domain NBS nuclear binding site domain LRR Leucine-rich repeats
grasses are agronomically very important monocots dicots gymnosperms mosses, ferns
Research objectives • I will search plant gene families for grass-specific expansions • I will identify those containing known resistance genes or their interacting partners • I will test for co-evolution of known resistance genes with their interacting partners • I will determine whether co-evolution with resistance genes is a new means to identify interacting partners of these genes
Phytome protein-coding sequence data from 39 plant species 26,393 families with ≥ 2 members 307,492 singleton families related families multiple alignments motif and domain and subfamilies and phylogenies structure information
identifying grass specific expansions • counting genes per taxon is not sufficient! • identify gene family phylogenies that contain many successive grass-specific internal nodes • identify duplication and speciation nodes for each gene family • label duplication nodes with grass-specific nodes
identify successive grass-specific nodes in practice: a perl script • acesses the Phytome database • takes every tree stored in Phytome • and, comparing it to the species tree, labels its internal nodes according to the common ancestor of all descendant leaf nodes species tree gene tree
identify duplication and speciation nodes speciation nodes: duplication nodes: (7) SDI: speciation duplication inference. Zmasek & Eddy 2001, Bioinformatics
required: labeled duplication/speciation nodes PROBLEM FOR SDI: UNRESOLVED GENE TREES!
required: accurate gene phylogenies PROBLEM FOR DISTANCE METHODS: NO OVERLAP OF PARTIAL SEQUENCES!
digressing from grass specific expansion: How can we generate phylogenies from these “partial sequence alignments” ? required for grass specific expansion project important for Phytome necessary for anyone using EST data for phylogenetic analysis
matrixA matrixB How can we generate correct phylogenies from “partial sequence alignments” ? can’t directly compute a single distance matrix with all sequences divide alignment into sub-sections, compute separate pairwise distance matrices: matrixA, matrixB 3. combine these to one single distance matrix, use it for phylogenetic reconstruction GOAL: define columns and sequences for sub-matrices
The OverlapGraph • Sequence alignment 2. Overlap matrix seqAXXXXXXXXXXXXX seqB XXXXXXXXXXXXX seqC -------XXXXXX seqD ------XXXXXXX seqE XXXXXXX------ seqF XXXXXX------- 3. Overlap graph 4. Find largest cliques (complete subgraps)
The OverlapGraph • Sequence alignment 2. Overlap matrix seqAXXXXXXXXXXXXX seqB XXXXXXXXXXXXX seqC -------XXXXXX seqD ------XXXXXXX seqE XXXXXXX------ seqF XXXXXX------- 3. Overlap graph 4. Find largest cliques (complete subgraps)
The OverlapGraph • Sequence alignment 2. Overlap matrix seqAXXXXXXXXXXXXX seqB XXXXXXXXXXXXX seqC -------XXXXXX seqD ------XXXXXXX seqE XXXXXXX------ seqF XXXXXX------- 3. Overlap graph 4. Find largest cliques (complete subgraps)
The OverlapGraph • Sequence alignment 2. Overlap matrix seqAXXXXXXXXXXXXX seqB XXXXXXXXXXXXX seqC -------XXXXXX seqD ------XXXXXXX seqE XXXXXXX------ seqF XXXXXX------- 3. Overlap graph 4. Find largest cliques (complete subgraps)
problem: clique overlap alignment overlap graph
problem: clique overlap Clique A: 1, 2, 3, 4, 5, 7 Clique B: 1, 3, 4, 5, 6, 7 Clique C: 4, 5, 8, 12, 13 Clique D: 4, 5, 8, 9, 10, 11, 12, 13 alignment overlap graph
new strategy includes merging cliques 1. partial sequence alignment 2. generate OverlapGraph, find cliques 3. merge overlapping cliques 4. find connected components
Validation How can we test whether this method will really generate the best phylogeny possible? – Use artificial data! ROSE - Random model Of Sequence Evolution(Stoye et al. 1998, Bioinformatics) input: • root sequence, • tree topology output: • a family of related sequences, created from the root sequence by insertion, deletion and substitution sequences with a known evolutionary history • a correct multiple alignment of these sequences
Validation • vary numbers of sequences per alignment (e.g., two alternatives: 10 and 50 sequences) • vary tree topologies (e.g., four alternatives: low resolution at deep nodes, high nodes, no low resolution, imbalanced tree) • vary alignment lengths (e.g., two alternatives: 50 and 200 aa) • vary average branch lengths/distances (two different mutation probabilities) • vary masks (e.g., five alternatives, based on deletion-patterns of Phytome families)
Actual results soon to follow! ? ?
gene tree vs. species tree species A species B species C
gene tree vs. species tree species A species B species C
gene tree vs. species tree species A species B species C
gene tree vs. species tree species A species B species C
gene tree vs. species tree species A species B species C gene in species A gene in species B gene in species C A B C
what if gap-boundaries aren’t so clear?what if some cliques are contained within others?
grass specific diversification: patterns duplication event prior to diversification of the grass lineage duplication events after diversification of the grass lineage lineage specific diversification of an orthologous ancestor lineage specific genes