1 / 35

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Analysis and comparison of very large metagenomes with fast clustering and functional annotation. Weizhong Li, BMC Bioinformatics 2009 Present by Chuan- Yih Yu. Outline. Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) Goal Methodology

rendor
Download Presentation

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis and comparison of very large metagenomes with fastclustering and functional annotation WeizhongLi, BMC Bioinformatics 2009 Present by Chuan-Yih Yu

  2. Outline • Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) • Goal • Methodology • Metagenome comparison • Conclusion • Discussion

  3. Goal • Reduce computation time • Global Ocean Survey(GOS): 1 M CPU Hours = 144 yrs • Discover the novel gene or protein families • Metagenomic Profiling of Nice Biomes(BIOME) : ~90% sequences unknown • GOS: double the protein families • Compare metagenome data • Clustering-based • Protein family-based

  4. RAMMCAP

  5. RNA

  6. RAMMCAP

  7. Meta_RNA & tRNA‐scan • High sensitivity, Low specificity(Except 16S) “Identification of ribosomal RNA genes in metagenomic fragments.“, Huang, Y., Gilna, P. & Li, W. Z. Bioinformatics “tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.“, Lowe, T.M. and Eddy, S.R.  Nucleic Acids Res

  8. CD-HIT Clustering

  9. RAMMCAP

  10. CD-HIT • Greedy incremental clustering algorithm • Whole pairwise alignment avoid • Short word (2~5) • Index table "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, et al. Bioinformatics, (2001) "Tolerating some redundancy significantly speeds up clustering of large protein databases", WeizhongLi, et al. Bioinformatics, (2002) "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", WeizhongLi, et al. Bioinformatics, (2006).

  11. Limitation of CD-HIT • Evenly distributed mismatches • Greedy issue • Group in first meet cluster

  12. CD-HIT Performance

  13. OrfS clustering

  14. RAMMCAP

  15. Why Cluster ORFs • Function studies • Novel genes finding

  16. ORF Prediction • ORF_finder • Metagene

  17. ORF Prediction Performance • MetaSim • Average 100, 200, 400, 800 bp, 1 million reads • True ORF (sensitivity) • Overlap 30 AA with NCBI annotated ORF • Predicted ORF (specificity) • 50% overlap with true ORF

  18. ORF Clustering • Run 1 clustering • 90~95% identity • Run 2 clustering • 60% identity over 80% of length (454) • 30% identity over 80% of length (Sanger) • Merge run 1 & 2 result

  19. Clustering Evaluation • Test sets • GOS-ORF (30%),BIOME (95%),BIOME-ORF (60%)

  20. BIOME Microbiomes & Viromes • Microbial sequences are more conserved than viral sequences.

  21. Clustering Quality • Need conservative threshold • Use only >30 AA Pfam sequence • Discard short sequence in overlapping Pfam sequence • Place into different cluster • Sequence in the same Pfam, place into different cluster.

  22. Clustering Validation • Generate a clusters whose sequences from the same Pfam • Minimize the number of clusters • Good clusters : >95% members from the same Pfam • >97% sequences are in good clusters • ~30 times more than bad clusters Cluster Size Number of clusters Number of sequences

  23. RAMMCAP

  24. Protein Family Annotation • Pfam (24.0, Oct. 2009, 11912 families) • textual descriptions, other resources and literature references • TIGRFAMs (9.0, Nov. 2009, 3808 models) • GO, Pfam and InterPro models • COG(2003,4873 clusters of orthologous groups) • 3 lineages and ancient conserved domain • RPS‐BLAST(Reverse psi-blast) • E values ≤ 0.001

  25. Novel Protein Families Discovery • Spurious ORFs in a large size of cluster without homology match may contain novel protein families. • In GOS only 1.3% of clusters with cluster size ≧10 map to 93% of true ORFs • In BIOME only 1.0% of clusters with cluster size ≧5 map to 28% of true ORFs

  26. Metagenome comparison

  27. Statistical Comparison of Metagenomics • Occurrence profile coefficient • z score, why? (not Rodriguez-Brito'srequire105 simulated samples) • Low occurrence cut off 1.z> cut off 2.PA≧ f x PB HA=4 (0.95) z=1.96 HA=7 (0.99) z=2.58

  28. Comparison between Rodriguez-Brito's method and z test method.

  29. Clustering-based Comparison No. of cluster rAB GOS ORF clusters

  30. Clustering-based Comparison • BIOME samples are more diverse than GOS BIOME clusters

  31. Protein Family-based Comparison • Merge Pfam, Tigrfamand COG into super families • Pfam- clans, Tigrfam- role categories, and COG- functional classes • Compare with a specific super family

  32. Protein Family-based Comparison (a) GOS on COG Class F, (b) GOS on COG Class T, (c) BIOME on COG Class F, (d) BIOME on COG Class T

  33. Conclusion • RAMMCAPimprove performance • CD-HIT • z test • Novel protein families discovery • ORFs clustering • Metagenome comparison • Cluster-based • Protein family-based

  34. Discussion • How much improvement when apply RNA prediction before raw reads? • How to determine significant factor? • PA ≧ f * PB (f>1)

More Related