280 likes | 447 Views
www. PHYTOME .org a plant comparative genomics resource. Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann. Outline of today’s presentation. What kind of data is stored in Phytome - and how did we generate this data? How can you search Phytome?
E N D
www.PHYTOME.org a plant comparative genomics resource Todd Vision, Jason Phillips, Dihui Lu, Stefanie Hartmann
Outline of today’s presentation • What kind of data is stored in Phytome - and how did we generate this data? • How can you search Phytome? • What kind of results will Phytome give you?
Phytome integrates • organismal phylogeny • gene family information: sequences • alignments • phylogenies • genetic and physical maps
Phytome: applications • Starting with a gene family • resolve orthology/paralogy relationships • identify coevolving families • Starting with a species • explore lineage-specific diversification • guide comparative mapping bench-work • Starting with a chromosome segment • identify homologous segments • predict unobserved gene content (candidate QTL)
protein cDNA cDNA clone DNA pre-RNA mRNA data aquisition • EST - expressed sequence tags • are partial sequences of expressed genes • are error-prone, contain sequence or frame shift errors • are very useful for discovering new genes, provide data on gene expression, make up much of the sequence data • EST contig assemblies • contigs: continuous sequences of multiple overlapping ESTs • singletons: don’t match other ESTs in the dataset • sources • • TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network; • • for each species, we used the source with the largest number of EST
protein sequence prediction • from EST contigs to peptide sequences: ESTwise • translate cDNA sequence (ESTs) in all reading frames • compare the translated DNA to a database of known proteins (Swiss-Prot, TrEMBL) • use this information for gene prediction/translation • correct frame shift errors based on the homology information protein TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG +V TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV EST agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat
protein family clustering(Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families
protein family clustering(Tribe-MCL) input: • a set of proteins • BLAST-all vs. BLAST-all values method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change output: • clusters of related proteins: protein families image taken from the MCL homepage: http://micans.org/mcl/
multiple sequence alignment tested program quality speed algorithm ClustalW + ++ progressive Mafft i ++ + iterative Mafft p ++ +++ progressive T-Coffee +++ memory! consistency-based/progressive Dialign +++ time! consistency based progressive sequence alignment: 1. generate pairwise distances from a multiple alignment 2. use distances to construct a guide tree 3. start by aligning the most similar sequences 4. progressively add more sequences to the existing alignment
multiple sequence alignment identification of homologous proteins, clustering these into a Phytome family, generation of a multiple sequence alignment identification of homologous sequence positions within the homologous proteins = of columns of amino acids that share a common ancestral amino acid
multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps
multiple sequence alignment 1. find columns that will be retained • remove columns with low average pairwise scores • remove columns with high percentage of gaps 2. find sequences that will be retained • remove sequences with a high proportion of gaps within the retained columns • remove misaligned sequences (i.e., with a low overall score) 3. final check • are enough sequences left for a phylogeny?
phylogenetic inference generate distance matrix generate unrooted neighbor-joining tree midpoint-root the tree do molecular clock test PHYLIP ? TreePuzzle
1 2 3 4 5 6 1 2 3 4 5 6 1 2 1 2 3 4 5 6 1 2 1 2 3 1 2 3 4 1 2 3 4 5 6 7 8 9 10 defining subfamilies
webflow, overview search pages result pages
Lab meeting, Sept 13, 2004: Phytome demo Dihui - BLAST search ∑ a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more about it - what it does and which other genes in which other taxa it is related to. ∑ Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene. ∑ Best to use for this: the single BLAST search. ∑ Navigate to the single BLAST search and explain the page. Mention batch BLAST. ∑ paste the friend's sequence into the appropriate field ∑ MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGMAGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH ∑ explain the results page ∑ view the best result: taes7111 from wheat ∑ go to the best scoring family: 1980 Stefanie - Unigene search ∑ http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167 ∑ search Phytome for InterproEntry 000167 ∑ look at the hvul1175 entry: ∑ The family and subfamily ID ∑ Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily ∑ The species name ∑ A link to the primary source for this unigene sequence ∑ A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly ∑ Predicted peptide sequence (available for download in FASTA format) Jason - "restrict by species" search ∑ You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page. ∑ The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information) ∑ and you can select families to include or exclude using radio buttons to the right of each species name. ∑ If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species. ∑ I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO families! ∑ 119273 families were retrieved. Their family ID is shown ∑ click on family number 1980 Stefanie - family results page ∑ The "Family Information Page" includes o Related families if this family is part of a superfamily (?) o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected). o A link to a list of family members excluded from the reduced alignment by REAP o A list of those species represented within the family (these will work if the with the default species tab) ∑ The tabs below allow one to view o A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a multiple alignment and/or phylogeny. o InterPro and GO assignments for an examplar of each subfamily. o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as additional information such as the names used by the Unigene sources and the component Genbank accession numbers.
I = 53.6 2.82.01.2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 3 3 2 2 1 4 4 3 1 1 4 4 3 1 1 4 4 3 1 1 4 4 2 1 1 5 5 4 1 1 5 5 4 1 1 6 5 4 3 1 6 5 4 3 1 protein family clustering(Tribe-MCL)
...some numbers • almost 1 million EST contigs/singletons • ESTwise translation • 730,000 unigenes • BLAST all vs. BLAST all • 640,000 unigenes 110,000 singletons • to be clustered • into families
data aquisition • species tax_id common name NCBI PGDB PGN SPNK TIGR • Allium cepa 4679 onion X • Amborella trichopoda 13333 amborella X • Arabidopsis thaliana 3702 thale cress X • Avena sativa 4498 oat X • Beta vulgaris 161934 sugarbeet X • Brassica napus 3708 rape X • Capsicum annuum 4072 (orgnamental) pepper X • Ceratopteris richardii 49495 water sprite or indian fern X • Citrus sinensis 2711 orange X • Cryptomeria japonica 3369 Japanese cedar X • Cucumis sativus 3659 cucumber X • Cycas rumphii 58031 sago palm or seashore cycad X • Eschscholzia californica 3467 california poppy X • Glycine maxX 3847 soybean X • Gossypium hirsutum 3635 cotton (tetraploid) X • Helianthus annuus 4232 sunflower X • Hordeum vulgare 4513 barley X • Lactuca sativa 4236 lettuce X • Lotus corniculatus 47247 lotus X • Lycopersicon esculentum 4081 tomato X • Marchantia polymorpha 3197 marchantia X • Medicago truncatula 3880 barrel medic X • Mesembryanthemum crystallinum 3544 ice plant X • Nicotiana benthamiana 4100 wild tobacco X • Oryza sativa 4530 rice X • Physcomitrella patens 3218 Physcomitrella moss X • Pinus taeda 3352 loblolly pine X • Phaseolus coccineus 3886 scarlet runner bean X • Populus tremula x Populus tremuloides 47664 aspen X • Prunus persica 3760 peach X • Saccharum officinarum 4547 plume grass or sugar cane X • Secale cereale 4550 rye X • Solanum tuberosum 4113 potato X • Sorghum bicolor 4558 sorghum X • Stevia rebaudiana 55670 candyleaf X • Theobroma cacao 3641 cacao X • Triticum aestivum 4565 wheat X • Vitis vinifera 29760 wine grape X • Zea mays 4577 corn X • Zinnia elegans 34245 zinnia X
multiple sequence alignment tested program quality speed algorithm ClustalW + ++ progressive Mafft i ++ + iterative Mafft p ++ +++ progressive T-Coffee +++ memory! consistency-based/progressive Dialign +++ time! consistency based