CEB - ESD - LBNL Todd DeSantis, Sonya Murray, Jordan Moberg, Gary Andersen Carol Stone (DSTL, U.K.)

Rapid quantification and taxonomic classification of a complex consortium of rDNA amplicons from both prokaryotic and eukaryotic origins using a microarray. What bugs are in my sample? CEB - ESD - LBNL Todd DeSantis, Sonya Murray, Jordan Moberg, Gary Andersen Carol Stone (DSTL, U.K.)

The ponderings of a toddler Why must Mom confiscate my “Hello Kitty” blanket on laundry day? Will the swings be wet at the park? How will this sausage impact the diversity in my lower G.I. bacterial community? Will I inhale any archaeal microorganisms when I visit the hot springs? Gianna DeSantis

Every discarded water sample, geological core, or spent air filter is lost data. • But who wants to do all the work? • Culture? Anaerobes? non-cultivable? Safety? • Analysis of nucleic acids isolated from environment • Must classify or sort heterogeneous nucleic acids into bins. • Restriction Fragment Length Polymorphisms (RFLP) • Single Stranded Conformation Polymorphisms (SSCP) • Temp/Denat Gradient Gel Electrophoresis (T/DGGE) • Sequencing • Provides taxonomic nomenclature • estimates the relative abundance • Need to create, clone, & process hundreds of samples • Can we create a simple, quantitative, comprehensive microbial test?

Outline • Goals • Experimental Approach • Organization of rDNA sequences into taxa (CASCADE-P) • Assigning sets of probes for each taxa • Using 16S GeneChip for quantitative aerosol analysis

Project Overview • Goal • Create a single microarray capable of detecting and quantifying bacterial and/or archaeal organisms in a complex sample. • Approach • Combinatorial power of multiple probes for sequence-specific hybridization

16S rRNA gene (16S rDNA) • Used to identify and classify organisms by gene sequence variations. • Variations have been used in design of DNA probes for the detection of: • taxonomic domains, divisions, groups … • specific organisms

The Ribosome rDNA rRNA (functional molecule) LSU SSU 16s or 18s

The Ribosome • Folded secondary structure • Essential functional component • Conserved spans • structure must be retained for viability • targeted for universal/group-specific PCR primers and probes • Variable regions • spans not fundamental to the folded structure • receive less pressure from natural selection • probed for genus and species level discrimination

What could be amplified? • Universal 16S PCR primers  complex population of amplicons. • Must define the targets to consider as the Potential Amplicon Set. Variable

First generation rDNA Array uses 85-basehighly variable region of ribosomal DNA Ccomp 1492R pA SSU rDNA 5’ 3’ Region interrogated on chip 1390 1507 Sample reacts only with complementary signature sequences on chip 20 base DNA signature segments on chip = probe set

http://greengenes.llnl.gov/16S • Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes • Igor Dubosarskiy • Java implementations • Tim Harsch • RDBMS consultations • Lisa Corsetti • Apache module management • Kevin Melissare • Graphics

2.28.3.27.2 2.30.9.2.10 1st Level: BACTERIA 1st Level: BACTERIA 2nd Level: PROTEOBACTERIA 2nd Level: GRAM_POSITIVE_BACTERIA 3rd Level: GAMMA_SUBDIVISION 3rd Level: CLOSTRIDIUM_AND_RELATIVES 4th Level: ENTERICS_AND_RELATIVES (Group) 4th Level: C.BOTULINUM_GROUP 5th Level: ESCHERICHIA_SUBGROUP 5th Level: C.ACETOBUTYLICUM_SUBGROUP U85138 clone ACK-SA7 AE000452 Escherichia coli str. K-12 Er.trachep Erwinia tracheiphila LMG 2906 (T) E.coliK12 Escherichia coli [gene=rrnG gene] Haf.alvei3 Hafnia alvei S.tymuriu3 Salmonella typhimurium str. Stm1 Shi.boydii Shigella boydii AF084835 str. KN4 S.enterit4 Salmonella enteritidis str. SE22 S.ptyphi6 Salmonella paratyphi S.typhi3 Salmonella typhi str. St111 S.bovismrb Salmonella bovis morbificans Sbm1 Alt.agrlyt Alterococcus agarolyticus str. ADT3 Shi.flxne2 Shigella flexneri ATCC 29903 (T) Clostridium collagenovorans DSM 3089 (T) Clostridium sardiniensis ATCC 33455 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum DSM 792 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum NCDO 1712 Clostridium acetobutylicum DSM 1731 Hierarchical Phylocodes

Chip Taxa • Avoid groupings based on historical nomenclature. • Sequence-dependent classification by transitive similarity clustering. • Each sequence must end up in exactly 1 taxon. if x R y & y R z  x R z

Assigning Probes for GeneChip Microarray • Select probe sets for each taxon • Ideal Probe • Present in all sequences of the taxon • Not present outside the taxon • Unable to X-hybe with seqs in other taxa • Ideal Mis-match Control Probe • Unable to X-hybe to any sequence

Finding groupings probes sequences Consider A – O to be 16S sequences. Consider 1 – 24 to be probes already embedded on the chip. First, associate all available probes with all available sequences. Let probe similarities drive sequence groupings.

Finding groupings Consider A – O to be 16S sequences. Consider 1 – 24 to be probes already embedded on the chip. First, associate all available probes with all available sequences. Let probe similarities drive sequence groupings.

Progressive Transitive Clustering DEFINE: upp (useful probe pair): a PM,MM pair where the 20-mer PM complements all intra-cluster sequences AND the central 16-mer of PM does not complement any extra-cluster sequences AND the central 16-mer of the MM does not complement any sequence. Probe pairs are reassessed whenever the sequence clusters are altered. nGBupp: number of upps for a cluster, these probe pairs globally differentiate a cluster from all other sequences. L:the value of nGBupp which must be met for a cluster to be locked. nPW uppA: number of useful probe pairs which pair-wise differentiate clustA from clustB nPW uppB: number of useful probe pairs which pair-wise differentiate clustB from clustA m: the value of nPW upp which must be met to inhibit two clusters from merging. FORL (11 .. 4) DO FORm (1 .. 10) DO Determine nGBupp for each cluster; Lock all clusters where nGBupp ≥ L ; Pair-wise compare non-locked clusters (clustA, clustB); UNLESS (nPW uppA ≥ m AND nPW uppB ≥ m) Merge sequences of clustA and clustB into one cluster; END UNLESS END FOR Uncluster non-locked clusters; END FOR 650 clusters found

Approach: Custom Affymetrix GeneChip • Massive parallelism – Up to 500,000 probes in a 1.28 cm2 array • Identification of multiple species in a mixed population • Single nucleotide mismatch resolution cctagcatgCattctgcata cctagcatgGattctgcata MATCH MISMATCH

General Protocol Air Soil Feces Blood Water gDNA Universal 16S rDNA PCR rRNA Contains probes adhered to glass surface in grid pattern.

Locating Hybridization Events A C G G T C G A A C G G T C G A A C G G T C G A A C G G T C G A A C G G T C G A PCR Amplify DNA 50 µ Fractionate DNA 50 µ Biotin End-label Hybridize

Parameter Frankia Clostridium Positive fraction 1.00 0.64 Average difference 3720 625 PM MM Frankia sp. str. G48 Clostridium butyricum

Can the chip detect more than one analyte?

OTU % pos pairs 2.30.7.12.1.013* 100 2.30.7.12.1.014 46 – 57 2.30.7.12.1.015 54 - 61 2.30.7.12.1.016 39 – 54 2.30.7.12.1.017 18 2.30.7.12.2.002 11 2.30.7.12.2.003 14 2.30.7.12.2.005 14 – 32 2.30.7.12.2.006 18 – 32 2.30.7.12.2.007 21 – 25 2.30.7.12.2.008 14 – 29 2.30.7.12.3.001 7 – 25 2.30.7.12.3.002 8 2.30.7.12.3.003 4 2.30.7.12.3.004 7 – 11 2.30.7.12.3.005 4 – 14 2.30.7.12.3.006 11 2.30.7.12.3.007 14 – 29 2.30.7.12.3.008 7 2.30.7.12.3.009 4 – 11 2.30.7.12.3.010 0 - 4 2.30.7.12.4.001 21 – 36 2.30.7.12.4.004* 100 2.30.7.12.4.005 0 – 11 2.30.7.12.4.006 29 – 54 2.30.7.12.4.007 11 – 14 2.30.7.12.4.008 11 Combinatorial scoring of “Probe Sets” are able to categorize mixed samples. S. aureus spike Can the chip detect more than one analyte? B. anthracis spike

OTU % pos pairs 2.30.7.12.1.013* 100 2.30.7.12.1.014 46 – 57 2.30.7.12.1.015 54 - 61 2.30.7.12.1.016 39 – 54 2.30.7.12.1.017 18 2.30.7.12.2.002 11 2.30.7.12.2.003 14 2.30.7.12.2.005 14 – 32 2.30.7.12.2.006 18 – 32 2.30.7.12.2.007 21 – 25 2.30.7.12.2.008 14 – 29 2.30.7.12.3.001 7 – 25 2.30.7.12.3.002 8 2.30.7.12.3.003 4 2.30.7.12.3.004 7 – 11 2.30.7.12.3.005 4 – 14 2.30.7.12.3.006 11 2.30.7.12.3.007 14 – 29 2.30.7.12.3.008 7 2.30.7.12.3.009 4 – 11 2.30.7.12.3.010 0 - 4 2.30.7.12.4.001 21 – 36 2.30.7.12.4.004* 100 2.30.7.12.4.005 0 – 11 2.30.7.12.4.006 29 – 54 2.30.7.12.4.007 11 – 14 2.30.7.12.4.008 11 Combinatorial scoring of “Probe Sets” are able to categorize mixed samples. Can the chip detect more than one analyte? Hybridization results from spike-in experiment done in triplicate. Sonya Murray Aubree Hubbel Percent of probe-pairs scored positive for each probe set in the Staphylococcus Group.

Application Example • Does air filter sample processing affect detection? • Method 1 • Wash particles from filter with SDS • Digest particles with lysozyme • Purify DNA using Qiagen kit • Method 2 • Pulverize filter and particles with bead mill, SDS, P:C:ISA • Purify DNA using MoBio kit and Sephacryl column

Bead beating allowed greater diversity to be detected.

Quantitative Analysis • Could the concentration of each amplicon in a sample be measured by fluorescence intensity? • Experimental setup for 20 point Latin Square calibration: SPIKE CONCENTRATION (pM in Hybridization Solution) Sonya Murray Carol Stone * 18uL of products from 30 cycle universal 16S PCR of gDNA extracted from U.K. air sample.

Oo Fn Sg Mn Oo Fn Sg Mn 1 5 (5474) 13 (16069) 31 (31805) 74 (124732) 2 13 (7885) 31 (61185) 74 (81107) 143 (115237) 3 31 (58912) 74 (70317) 143 (98235) 5 (8759) 4 74 (101803) 143 (69529) 5 (7789) 13 (11530) 5 143 (149869) 5 (4534) 13 (16228) 31 (56103) 6 n.a. n.a n.a. n.a. Final concentration of spike in hybridization in pM. Values in parentheses are the resulting hybridization signal in arbitrary units (a.u.) obtained from the Latin Square experiments. All spikes were added to 18µL of products of 30 cycle universal SSU PCR of gDNA extracted from air samples using Method 2.

Log2 transformed Linear Least Squares Regression Pearson’s corr coeff was significant (df=18) 95% confidence intervals calculated according to: National Measurement System’s Valid Analytical Measurement Programme (VAM)

Conf Interval: Conc(t(RSE)/b)(1/m+1/n+((Y-y)2)/ (b2(n-1)sx2)) b = slope from regression Y = mean of 6 replicate measurements m = number of repeat measurements = 6 y = mean of the HybScores for the 20 points used for calibration t = critical value obtained from t-table for 18 d.f. for 95% = 1.734 RSE = residual standard error of calibration points = 0.56 sx = standard deviation of the conc. for the 20 points used for calibration Environmental community is measured with confidence intervals.

Summary The SSU microarray was able to rapidly quantify and taxonomically classify of a complex consortium of rDNA amplicons from both prokaryotic and eukaryotic orgins.

Acknowledgements • Gary Andersen – group Leader • Carol Stone – sample collection, hybridization Sonya Murray - hybridizations

CEB - ESD - LBNL Todd DeSantis, Sonya Murray, Jordan Moberg, Gary Andersen Carol Stone (DSTL, U.K.)