300 likes | 438 Views
Genomic diversity and population structure in switchgrass, Panicum virgatum : Genotyping-by-sequencing and population genomics. Geoff Morris*, Paul Grabowski, Justin Borevitz Dept. of Ecology and Evolution University of Chicago. Genomic diversity and population structure.
E N D
Genomic diversity and population structure in switchgrass, Panicum virgatum: Genotyping-by-sequencing and population genomics Geoff Morris*, Paul Grabowski, Justin Borevitz Dept. of Ecology and Evolution University of Chicago
Genomic diversity and population structure • Geographic patterns of genomic diversity reflect: drift, migration, and adaptation • Genomic diversity: nucleotide variation and insertions/deletions across many loci in the nuclear and organellar genomes. • Leads to design of mapping populations for quantitative genetics and molecular breeding
Genomic diversity and natural history Example: Pitcher plant mosquito (Wyeomyiasmithii) Emerson et al. PNAS 2010
Ecotypic diversity in switchgrass • Switchgrass and other wide-ranging grassland species have many ecotypes • Great variability in size, shape, color, and habitat preference • Example: Upland/lowland divergence Lowland (Oklahoma) Upland (Michigan) Adapted to: Shorter growing season, Drier climates Adapted to: Long growing season, Wet climates
Effects of ecotype diversity of productivity • Three year plot (6m2) experiment at Fermilab • ~20% overyield in switchgrass mixtures compared to monocultures
“Genomic diversity and population structure in switchgrass, Panicum virgatum: from the continental scale to a dune landscape” Morris, Grabowski, and Borevitz Accepted, Molecular Ecology
Biogeography of Indiana Dunes flora Boreal flora: e.g. Jack Pine, Bearberry Great Plains flora: e.g. Sandreed, Little Bluestem Coastal Plain flora: e.g. Seaside spurge, Marramgrass Eastern deciduous flora: e.g. Tulip tree Recolonized post-glaciacation: ~10,000 years ago
Switchgrass gene pools ? Zhang et al. 2011
Landscapes in Indiana Dunes • Landscape features are dynamic and can be dated: • 100s – 1000s of years for dunes • 10s – 100s of years for blowouts Big blowout ~ 150 years old
Study questions • Can switchgrass population structure be confirmed with a genome-wide sample of non-ascertained markers? • In a hierarchical sample of switchgrass, how much diversity is there on a landscape, regional, and continental scale? • Did multiple switchgrass gene pools contribute to the Indiana Dunes populations? • Is there genomic diversity in a single landscape feature (blowout)? • Is there local (private) genetic diversity in the Indiana Dunes?
Switchgrass plant samples • Switchgrass cultivated varieties (cultivars) • Kanlow (Oklahoma - lowland) • Blackwell (Oklahoma - upland) • High Tide (Maryland - Coastal) • Forestburg and Sunburst (South Dakota) • Dacotah (North Dakota) • Cave-in-Rock (Illinois) • Southlow (Southern Michigan “ecopool”) • Indiana Dunes switchgrass • Big Blowout • Jack pine savanna • Interdune
Problems with traditional markers systems • Locus sampling: • Typically only a few kb are sequenced in a few loci (rDNA, cp introns) • Large stochastic error and loci-specific bias • e.g. Plant chloroplast has 100X lower rate of evolution than animal mitochondria • Ascertainment bias: • Occurs whenever markers are discovered and typed separately • Worst when ascertainment panel is geographically restricted subpopulation • e.g. Inferred genetic diversity in Africans is spuriously low when when European markers are used
Genomic diversity from de novo sequencing = restriction site 1) PstI digest of genomic DNA 2) End-polish, blunt-end ligation; Illumina barcodes 3) PCR amplify and pool fragments from multiple samples 4) Assemble and map reads to “stacks” and call SNPs • Reduced representation +multiplexing = more samples • 10,000+ candidate SNPs • No reference genome needed • Data here from 76 or 100bp paired end reads • 40 billion base pair data set
Plastome sequence in RRLs 1) PstI digest of genomic DNA, with star activity and random shearing • Nuclear whole genome shotgun sequence is too light (<<1X) for assembly • Plastome WGS is very high (>>1X) 2) End-polish, blunt-end ligation
Analysis of chloroplast data • Chloroplast genome sequence (plastome) included in data • Random (shotgun) sequence + 20 PstI sites • Switchgrass chloroplast reference available (Upland and Lowland) • Mapped reads to both ~140,000 base pair chloroplast genomes • Coverage (# of times each position is read): 1X – 786X
Chloroplast coverage and polymorphisms Chloroplast Genome Coverage Position (kb)
Chloroplast phylogeny • Neighbor joining tree based on 140kb • Named haplogroups have >50% bootstrap • Unfilled lines indicate low-coverage sample
Population analysis of nuclear loci • Create “pseudoreference” of RRL loci with de novo assembly • Map reads to pseudoreference to create stacks (150-1500 reads) • Map reads to switchgrass chloroplast and sorghum mitochondria, and drop stacks that match organelles • Select single-nucleotide variants that: • Have high sequence quality (PHRED score < 0.001 for both alleles) • Vary in frequency across samples (chi-square < 0.01) • Are nearest to restriction site, closest to beginning of read • Randomly select one allele per sample (weighted by observed frequency)
Coding sequence variation in the chloroplast • 77 coding genes in chloroplast (including Rubisco, ribosome, etc) • 60kb of coding sequence • Constraints in non-synonymous (NS) vs. synonymous (S) variation provides biological validation for SNPs • Upland vs. Lowland (~1 million years): • 23 NS : 16 S (ratio = 1.4) • Within upland ( < 0.5 millions years) • 16 NS : 3 S (ratio = 5.3)
Nuclear genome: Multidimensional scaling ~11000 nuclear loci, mean of 100 random allele samples
Nuclear loci: Structure analysis Bayesian clustering algorithm ~11000 nuclear loci, random allele sample, Burn-in 10K, Run 10K
Conclusions • Confirmed upland vs. lowland differentiation and differentiated a local population using non-ascertained markers • Lake Michigan switchgrass is distinct from broader upland population in midwest and Great Plains. • Post-glacial gene flow into the Indiana Dunes included genotypes from across the Great Plains and Midwest • The chloroplast diversity in the Indiana Dunes did not evolve in the current midwestern population, but originated one or more glacial cycles ago • A single blowout in the dunes can have as much chloroplast diversity as the Midwest
New GBS methods for population genomics • For true population analysis we need 10+ individuals in multiple populations • Illumina multiplexing is too expensive – separate prep cost for each library adds $100s/sample • Read count overdispersion (up to ~200X more Poisson) requires technical replicates to even counts • Sticky-end ligation increases specificity and removes random sequence (including plastome)
Genotype-By-Sequencing (GBS) Based on Elshire et al. 2011, PlosONE
New population genomic studies with GBS • Continental population structure (126 individuals) • 50/50 deep diversity and shallow diversity based on chloroplast markers and SSRs • Tetraploid cultivars (24 each for TX, OK, NE, ND cultivars) • Ploidy differences may be confounded with genetic diversity • High sample size should allow traditional pop gen analyses (Fst etc...) • Dune half-sibs (4 mothers and 10 offspring each) • True SNPs will segregate in the offspring while homeologous substitutions will not
Bioinformatics overview • No software package for population genomic analysis on GBS • Stacks (U. Oregon) comes closest but multinomial sampling model expects high frequency SNPs (e.g. mapping population) • Buckler lab TASSEL package (Java) may be appropriate • We’ve been using custom pipeline (CLC, MySQL, R) for analysis • http://create.ly/gefxsub43