660 likes | 1.3k Views
Plant Genomics & Bioinformatics. Jen Taylor : Bioinformatics Leader CSIRO Plant Industry EMBL Australia April 2011. Overview. Introductions Definitions - Bioinformatics Our scope Plant Genomics and Bioinformatics Research aims “Unique” challenges
E N D
Plant Genomics & Bioinformatics Jen Taylor : Bioinformatics Leader CSIRO Plant Industry EMBL Australia April 2011
Overview • Introductions • Definitions - Bioinformatics • Our scope • Plant Genomics and Bioinformatics • Research aims • “Unique” challenges • Case Study 1 : Solving QTLs with data integration • Our challenges and activities • Building the biology informatics interface • Peopleware • Infrastructure – software and systems • Ideas for CSIRO-EMBL interaction CSIRO. EMBL Australia - April 2011 - Jen Taylor
Plant Genomes – Research Aims • Molecular drivers of phenotype • Performance traits • Food security • Stress resistance • Environment • Disease • Molecular profiles of diversity • Genotypic diversity • Robustness to • Climate change • Human impact CSIRO. EMBL Australia - April 2011 - Jen Taylor
Crop Bioinformatics • Large, complex, repeat-rich genomes • Lack of a reference genome – denovo • Ploidy • Genome duplication, Gene duplication – large gene families • Pan genome / kingdom • Large evolutionary distances • Plants, microbial, viral, fungal CSIRO. EMBL Australia - April 2011 - Jen Taylor
…and yet, large gains are being made Rice Wheat Feuillet et al., Trends in Plant Science February 2011, Vol. 16, No. 2 CSIRO. EMBL Australia - April 2011 - Jen Taylor
Plant Genomes – “Unique Challenges” • No / partially sequenced reference genome • Ploidy • Genome duplication, Gene duplication – large gene families • Genome Size • Large range of genome sizes CSIRO. EMBL Australia - April 2011 - Jen Taylor
Plant Genomes – “Unique Challenges” CSIRO. EMBL Australia - April 2011 - Jen Taylor
Plant Genomes – Haploid Size Human Arabidopsis Rice Potato Sugarcane Cotton Barley CSIRO. EMBL Australia - April 2011 - Jen Taylor
Wheat Plant Genomes – Total Size Human Cotton Barley Sugarcane CSIRO. EMBL Australia - April 2011 - Jen Taylor
Plant Genomes – “Unique Challenges” • Pan genome / kingdom • Large evolutionary distances • Plants, microbial, viral, fungal, insect 75 MYA 155 MYA CSIRO. EMBL Australia - April 2011 - Jen Taylor
1. Data deluge Rapid capacity increases Large, complex genomes High-throughput potential 2. Heterogeneous and asynchronous data release Many large public sequence data sets and annotations emerging….. PlantEnsembl, 1001 arabidopsis genomes Genome efforts : Barley, Cotton, Wheat, Sugarcane, Lupin, Eucalyptus, Compositae, Brachypodium 3. Deep and universal customisation Lack of universally applicable analysis strategies Need flexible modular structures 4. Diverse analytical needs - Different architectures Parallel processing – many processors ( 100’s – 1000’s) Moderate to high RAM needs (16 - 250 – 1 TB RAM) i.e. “fat node” Plant Bioinformatics – “Unique Challenges” CSIRO. EMBL Australia - April 2011 - Jen Taylor
Crop Bioinformatics @ CSIRO • Rice / Cotton • Transcriptome, SNP detection • Methylome - MeDIP • Small RNA profiling • Sugarcane • SNP detection • Barley • Transcriptome • Small RNA and PARE projects • Wheat • Genome sequencing • Transcriptome • Pathogenomics High-throughput platforms RNA-Sequencing Small RNA sequencing PARE RNA- IP ChIP-Seq Soil Metagenomics Illumina, SOLiD, 454 Genome Sequencing Bisulfite Sequencing MeDIP Sequencing Expression arrays Tiling arrays Custom Arrays CSIRO. EMBL Australia - April 2011 - Jen Taylor
Case Study – Solving QTLs • Quantitative Trait Loci (QTLs) • A region of the genome thought to contribute significantly to the control of a phenotypic trait • Key questions: • What are the functional elements within a QTL? • Which of these control the trait? CSIRO. EMBL Australia - April 2011 - Jen Taylor
Case Study – Solving QTLs • Aim : Deeply annotate the region to find the causative gene • 1. Compare with public sequence collections from related species • Hieracium – 1,080,343 expressed sequences, 25,743 proteins • Wheat – 4 sequenced genomes and 2 incomplete genomes • 2. Integrate computational predictions and evidence • > 20 different annotation algorithms • > 12 different sequence comparisons x parameter optimisations CSIRO. EMBL Australia - April 2011 - Jen Taylor
Current activities – Wheat CR genome PE1 PE2 Wheat : Colin Cavanagh, Matt Hayden, Darren Cullerne • Complexity reduced genome in Yitpi, Baxter, Westonia and Chara. Random Shear Pst1 475 ± 50 bp Baird et al., (2008) CSIRO. EMBL Australia - April 2011 - Jen Taylor
Current activities – Wheat CR Genome PE1 PE2 Random Shear Pst1 • Build consensus stacks • Map reads to consensus stacks • Map PE reads between stacks • Exclude “foreign” reads from stacks Baird et al., (2008) CSIRO. EMBL Australia - April 2011 - Jen Taylor
k-mer counts and frequencies GCGAGATCCAACGGTGAACAGCTGCCCAAAAGAAAAaCCGCCTGGAAGTCCGAGGACCTTTAGTACTGTACTCTACCCCCGAACCAGCAGCCTTCGtGCCAaGCAAGACCGCCCTTGTCCCTTTCCTTTATCCATTCCGCcTCCTTCTTTGCTTTGTTCCAATAGAGTCTAAGGCAAAGCTAAAGTGGTTCGTaTGCCTACTTTACCTACTTGACGAAAGGGAACGAACTTCGTTTCGTTTCCGGGTTTATGGATTGGATTCAGTCAGCCTCACTCCTTCCTTTTTATGTTGTCGTGATGGTTACCGGCGAACGCTCCCAAAGGCGACCCTCTCGAGTTTCCGGCTGTTTTCTAGATTGAAGTAGCCTTTCGTCGCCCCGAAAGAAGTCACTATCAAAGAGCTCGCCCTACTGAAGTACCAAAGGTGCGCTCAGCCCGGTGACTAAGAAATGGGTTTGCGCTTGAATTGAAGTGATGAGGTTTTTCGAGGGAAGTAGGGCTCTTATTGACTAAAAGTGGGTTCTTCGCTTTCCTTTAGAATGAAAGTTGCTATGAAGCCCCTACTACTTACTTTGTTTGATTCAAAAGGCGAACGGCCCCCCAACAAGTCGTATGGGGTGGGGTGCTTGTGATAAGCTGCCTTGGATATGAGGAATTCTCAAATTGGGAAAGCATTTCTTGATTTGAAGAAACAAGAAAGTTAGGGTTTTTGGAATTGGATTCGGATAATGTTTGTTGTTTTTtGTAAGTGTGAGATTAGAGGTTCACGAAATTTTGATGGG k = 8 Total n = 782 8-mers = 775 Unique = 98% Multiple = 10 x k-mers CSIRO. EMBL Australia - April 2011 - Jen Taylor
Knomes – Genomes as k-mers 1. The majority of k-mers are at low frequencies within genomes Maize Genome Proportion of Unique of k-mers k-mer size - bases Kurtz et al., 2008 CSIRO. EMBL Australia - April 2011 - Jen Taylor
VIOLIN BIOTIN DARKER MASKER MARKED Knomes – Genomes as k-mers 2. k-mers are lonely K-mer neighbourhoods can be defined by min distances and numbers of matching neighbours. MARKER MARKET CSIRO. EMBL Australia - April 2011 - Jen Taylor
Knomes – transcriptomes as k-mers • Transcriptome k-mer profiles have been less well studied • Barley RNASeq 50-mer [Barrero-Sanchez] • The majority of k-mers are at low counts 98% of k-mers < 50 counts CSIRO. EMBL Australia - April 2011 - Jen Taylor
K Why do we kare? k-mers are useful k-mers used in analysis : • Sequence alignment • k-mer overlaps used as seeds • k-mer pair-wise distances used to robustify alignments • Genome Assembly • Look for overlaps between k-mers to generate k-mer graphs • Repeat annotation [e.g. Campagna et al., 2005] • Looking for small groups of highly frequent k-mers CSIRO. EMBL Australia - April 2011 - Jen Taylor
K Why do we kare? k-mers are useful k-mers used in analysis : • Error correction in NGS [Kelley et al., 2010] • Removing abnormal k-mer frequencies. • Genome size and coverage [Marcais and Kingsford, 2011] • If a large fraction of k-mers occur C times, then coverage ~ C • Genome size can be inferred from C and total read length • Clustering of mixtures of sequences • Metagenomics • Haplotypes and SNPs • RNA sequencing • “annotation free” detection of differential expression CSIRO. EMBL Australia - April 2011 - Jen Taylor
edit Poisson Gaussian Error correction using k-mers High coverage = trusted Low coverage = non trusted • Minimizing edit distances • Assembly • Alignments • QUAKE – weights edit sites • Incorporates quality scores • Error properties of Illumina sequencing Kelley et al., 2010 CSIRO. EMBL Australia - April 2011 - Jen Taylor
Kelley error correction Kelley et al., 2010 CSIRO. EMBL Australia - April 2011 - Jen Taylor
Genomic loci Align Transcriptome Analysis – Differential Expression Align Reads Contigs Assemble Differential expression SNPs Sequence Reads Profile K-mer spectra CSIRO. EMBL Australia - April 2011 - Jen Taylor
k-mers in denovo differential expression 9,458 (0.3%) Significant 50-mers Barley : Barrero-Sanchez, Stephen, Gubler, Helliwell, • RNA sequencing to compare transcriptomes with different dormancy phenotypes Next steps: Look for 1 base differences between k-mers that might be SNPs Map DE k-mers to public databases Compare k-mer measures against contig assemblies 0hrs 3hrs 6hrs 0hrs 3hrs 6hrs Genotype 35 Genotype 36 CSIRO. EMBL Australia - April 2011 - Jen Taylor
Acknowledgements CSIRO Plant Industry Bioinformatics Darren Cullerne Jose Robles Andrew Spriggs Stuart Stephen Hua Ying Paul Greenfield David Lovell CSIRO Transformational Biology Capability Platform Projects Iain Wilson (Cotton) Jose Barrero Sanchez (Barley) Frank Gubler (Barley) PIBioinformatics@csiro.au CSIRO. EMBL Australia - April 2011 - Jen Taylor
Bio - Jen Taylor Current appointments CSIRO Plant Industry Bioinformatics Leader Adjunct Fellow Mathematical Sciences Institute, ANU Brief CV University of Queensland & QIMR • PhD (Biochemistry, Genetics) University of Oxford (Department of Statistics) • Functional genomics Wellcome Trust Centre for Human Genetics • Head of Functional Analysis, Bioinformatics Core CSIRO. EMBL Australia - April 2011 - Jen Taylor