190 likes | 317 Views
Large scale genomes comparisons Practical sessions. Fredj Tekaia Institut Pasteur tekaia@pasteur.fr. EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012. Plan for the practical sessions
E N D
Large scale genomes comparisons Practical sessions Fredj Tekaia Institut Pasteur tekaia@pasteur.fr EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012
Plan for the practical sessions • • Saccharomyces cerevisiae(SACE: 5863 protein sequences) • • Candida glabrata(CAGL: 5202 protein sequences) • • Zygosaccharomyces rouxii(ZYRO: 4991 protein sequences) • data from ftp.ncbi.nlm.nih.gov/genomes/Fungi
For each proteome we will perform the following: Data preparation: -Transform the protein identification so that to get simpler identifiers; -Split the whole protein sequence database into single protein sequences; Intra-species comparisons: -Compare the proteome to itself, using blastp (with adequate options); -Get for each protein its best significant match (presented in a table form); -Get for each protein all its significant matches (presented in a table form); -For each protein calculate the number of its significant matches; Interspecies comparisons: -Perform all pair-wise proteome comparisons; For each pair: -Get for each protein its best significant hit in the other proteome; -Get for each protein all its significant hits in the other proteome; -For each protein calculate the number of its significant matches; Multiple comparisons: -Extract all pairs of proteins that are Reciprocally Best Hits (Venn Diagram); CIRCOS Prepare a table relating the relationships between genomes to be used with circos.
Plan for the practical sessions: • use 3 yeast species (SACE, CAGL, ZYRO). • data from ftp.ncbi.nlm.nih.gov/genomes/Fungi • • Prepare in adequate fasta format, the protein sequence data • (needs data transformation) • • Compare each proteome to itself (duplication - paralogs) • • Compare each proteome to another proteome (RBH - orthologs) • • prepare a file for the visualization of protein similarities (circos). Need for writing sh and perl (or xx) scripts
Candida glabrata (CAGL): 13 chromosomes Zygosaccharomyces rouxii (ZYRO) : 7 chromosomes
SACE -rw-r----- 1 tekaia staff 54600 Apr 25 11:54 NC_001133.faa -rw-r----- 1 tekaia staff 5273 Apr 25 11:57 NC_001133.ptt -rw-r----- 1 tekaia staff 233863 Apr 25 11:54 NC_001134.faa -rw-r----- 1 tekaia staff 22362 Apr 25 11:57 NC_001134.ptt -rw-r----- 1 tekaia staff 85632 Apr 25 11:54 NC_001135.faa -rw-r----- 1 tekaia staff 9043 Apr 25 11:57 NC_001135.ptt -rw-r----- 1 tekaia staff 436412 Apr 25 11:54 NC_001136.faa -rw-r----- 1 tekaia staff 41636 Apr 25 11:57 NC_001136.ptt -rw-r----- 1 tekaia staff 152613 Apr 25 11:54 NC_001137.faa -rw-r----- 1 tekaia staff 15210 Apr 25 11:57 NC_001137.ptt -rw-r----- 1 tekaia staff 71415 Apr 25 11:54 NC_001138.faa -rw-r----- 1 tekaia staff 7115 Apr 25 11:57 NC_001138.ptt -rw-r----- 1 tekaia staff 303249 Apr 25 11:54 NC_001139.faa -rw-r----- 1 tekaia staff 28954 Apr 25 11:57 NC_001139.ptt -rw-r----- 1 tekaia staff 156585 Apr 25 11:53 NC_001140.faa -rw-r----- 1 tekaia staff 15544 Apr 25 11:57 NC_001140.ptt -rw-r----- 1 tekaia staff 119694 Apr 25 11:53 NC_001141.faa -rw-r----- 1 tekaia staff 11384 Apr 25 11:57 NC_001141.ptt -rw-r----- 1 tekaia staff 213993 Apr 25 11:53 NC_001142.faa -rw-r----- 1 tekaia staff 19732 Apr 25 11:57 NC_001142.ptt -rw-r----- 1 tekaia staff 184175 Apr 25 11:53 NC_001143.faa -rw-r----- 1 tekaia staff 17048 Apr 25 11:57 NC_001143.ptt -rw-r----- 1 tekaia staff 302218 Apr 25 11:53 NC_001144.faa -rw-r----- 1 tekaia staff 28180 Apr 25 11:57 NC_001144.ptt -rw-r----- 1 tekaia staff 267545 Apr 25 11:53 NC_001145.faa -rw-r----- 1 tekaia staff 25329 Apr 25 11:57 NC_001145.ptt -rw-r----- 1 tekaia staff 223148 Apr 25 11:53 NC_001146.faa -rw-r----- 1 tekaia staff 21558 Apr 25 11:57 NC_001146.ptt -rw-r----- 1 tekaia staff 304338 Apr 25 11:53 NC_001147.faa -rw-r----- 1 tekaia staff 29393 Apr 25 11:57 NC_001147.ptt -rw-r----- 1 tekaia staff 266238 Apr 25 11:53 NC_001148.faa -rw-r----- 1 tekaia staff 25450 Apr 25 11:57 NC_001148.ptt A B C D …..
NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >gi|6319249|ref|NP_009332.1| Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >gi|6319250|ref|NP_009333.1| Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >gi|6319252|ref|NP_009335.1| hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….
NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >gi|6319249|ref|NP_009332.1| Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >gi|6319250|ref|NP_009333.1| Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >gi|6319252|ref|NP_009335.1| hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….
Final sequence format NC_001133.ptt SACE S288c chromosome I, complete sequence. - 1..230218 94 proteins Location Strand Length PID Gene Synonym Code COG Product 1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p 2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein 7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p 11565..11951 - 128 6319252 - YAL065C - - hypothetical protein 12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein 13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein 21566..21850 + 94 330443360 - YAL064W - - hypothetical protein ….. NC_001133.faa >YAL068C Pau8p MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAV FNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN >YAL067W-A hypothetical protein YAL067W-A MPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVL GVVYC >YAL067C Seo1p MYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETS SYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLA FYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSL DLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAG GIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQT GKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGL GMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDI CRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERN NAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK >YAL065C hypothetical protein YAL065C MNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLI TSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW ……….
Write a perl/sh script to systematically transform the sequence identifications Follow the indications on PS document
Notations: Sequence and genome files: We consider sequences and databases in “fasta” format. DB.pep (extension “.pep” for protein databases); Exp.: GSACE.pep, for Saccharomyces cerevisiae protein db. seq.prt (extension “.prt” for protein sequences); Exp.: YAL063C.prt Scripts: script.pl (extension “.pl” for perl scripts); script.scr (extension “.scr” for unix shell scripts);
Associative array #!/bin/perl #Use: replaceid.pl NC_000962.ptt NC_000962.faa #output in NC_xx.pep $PTT = @ARGV[0]; # ncbi ptt file $FAA = @ARGV[1]; # ncbi faa file $CHR=substr($PTT, 0 , length($PTT) -4); open(OUT,">$CHR.pep"); open(IN, "$PTT") || die "can't find $PTT"; while(<IN>) { @tab=split(/\s+/, $_); $PID{$tab[3]} = "$tab[5]"; } #while close(IN); open (IN2, "$FAA") || die "can't open $FAA"; while(<IN2>) { print OUT $_ if ( !m/^>/ ); if ( m/^>/ ) { @tab = split( /[\|]/, $_ ); print OUT ">$PID{$tab[1]} $tab[4]"; }#if }# while close(IN2); close(OUT); Examples List of values
#!/bin/sh for file in `ls *.ptt` do NC=`echo $file | sed -e "s/\..*//g"` replaceid.pl $NC.ptt $NC.faa done
Comparing one proteome vs itself All hits YAL005C YLL024C 98.19 607 11 0 … 0.0 1041 YAL005C YER103W 83.58 609 97 2 … 0.0 889 YAL005C YBL075C 81.94 609 107 2 … 0.0 888 YAL005C YJL034W 64.74 604 209 3 … 0.0 702 YAL005C YDL229W 64.02 567 198 4 … 2e-176 613 YAL005C YNL209W 63.84 567 199 4 … 4e-176 613 YAL005C YJR045C 51.06 611 281 9 … 1e-136 481 YAL005C YEL030W 49.43 615 285 11 … 8e-130 459 YAL005C YLR369W 49.27 548 254 8 … 1e-125 445 YAL005C YPL106C 35.85 371 230 4 … 9e-63 236 YAL005C YBR169C 35.85 371 230 4 … 1e-57 219 YAL005C YHR064C 31.90 373 242 5 … 2e-48 188 YAL005C YKL073W 24.55 501 343 10 … 2e-28 122 YAL007C YOR016C 75.00 180 42 1 … 8e-61 227 YAL012W YGL184C 32.69 413 240 13 … 2e-46 181 YAL012W YLR303W 30.37 438 243 12 … 6e-34 139 YAL012W YHR112C 29.90 398 236 14 … 2e-29 125 YAL012W YFR055W 27.99 293 199 7 … 2e-27 117 YAL015C YOL043C 50.88 285 140 0 … 9e-82 298 YAL017W YOL045W 62.10 694 224 6 … 0.0 771 ………. Multiple matches if any
Comparing one proteome vs itself Best hits YAL005C YLL024C 98.19 607 11 0 … 0.0 1041 YAL005C YER103W 83.58 609 97 2 … 0.0 889 YAL005C YBL075C 81.94 609 107 2 … 0.0 888 YAL005C YJL034W 64.74 604 209 3 … 0.0 702 YAL005C YDL229W 64.02 567 198 4 … 2e-176 613 YAL005C YNL209W 63.84 567 199 4 … 4e-176 613 YAL005C YJR045C 51.06 611 281 9 … 1e-136 481 YAL005C YEL030W 49.43 615 285 11 … 8e-130 459 YAL005C YLR369W 49.27 548 254 8 … 1e-125 445 YAL005C YPL106C 35.85 371 230 4 … 9e-63 236 YAL005C YBR169C 35.85 371 230 4 … 1e-57 219 YAL005C YHR064C 31.90 373 242 5 … 2e-48 188 YAL005C YKL073W 24.55 501 343 10 … 2e-28 122 YAL007C YOR016C 75.00 180 42 1 … 8e-61 227 YAL012W YGL184C 32.69 413 240 13 … 2e-46 181 YAL012W YLR303W 30.37 438 243 12 … 6e-34 139 YAL012W YHR112C 29.90 398 236 14 … 2e-29 125 YAL012W YFR055W 27.99 293 199 7 … 2e-27 117 YAL015C YOL043C 50.88 285 140 0 … 9e-82 298 YAL017W YOL045W 62.10 694 224 6 … 0.0 771 ……….
Comparing one proteome vs a different proteome All hits YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823 YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584 YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223 YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993 YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899 YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684 YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477 YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475 YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429 YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220 YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197 YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7 YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219 YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296 YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398 YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275 YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659 YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206 ………. Multiple hits
Comparing one proteome vs a different proteome Best hits YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823 YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584 YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223 YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993 YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899 YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684 YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624 YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477 YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475 YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429 YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220 YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197 YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7 YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219 YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296 YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398 YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275 YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659 YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206 ……….