DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

DNA Barcode Data Analysis:Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion Mândoiu Computer Science & Engineering Department, University of Connecticut

Outline • Motivation & Problem Definition • Methods used • Hamming Distance (MIN-HD and AVG-HD) • Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM) • Convex-score similarity (MAX-CS-SIM) • Trinucleotide frequency (MIN-3FREQ) • Positional weight matrix (MAX-PWM) • Character-based pairwise species discrimination (k-BEST) • Combining the Methods • Results • Species Classification • New Species Recognition • Future Work & Conclusions

Motivation • “DNA barcoding” was proposed as a tool for differentiating species • Goal: To make a “finger print” for species, using a short sequence of DNA • Assumption: mitochondrial DNA evolves at a lower rate than regular DNA • Mitochondrial DNA: High interspecie variability while retaining low intraspecie sequence variability • Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).

Problem definition The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy. We address two problems: • Classification of barcodes given a training set of species. • Identification of barcodes that belong in new species. Assumption: All the barcode DNA sequences are aligned

Problem definition(1) • Species Differentiation: • INPUT: a set S of barcodes for which the species is known and x a new barcode • OUTPUT: the species of x, given that there are barcodes S that have the same species as x

Problem definition(2) • Species Differentiation & New Species Detection: • INPUT: a set S of barcodes for which the species is known and x a new barcode • OUTPUT: find the species of x, if there is at least a barcode in S with the same species or determine if x belongs to a new species.

Methods • Find a “distance” between barcodes that is “able to distinguish between species”: • Low intraspecie variability • High interpecie variability • Hamming Distance • Aminoacid Similarity • Convex-score similarity • Trinucleotide frequency • Closer barcodes tend to have similar trinucleotide frequencies • Positional weight matrix • Compute the probability of that barcode x belongs to a given species • Character-based pairwise species discrimination • Find k most informative characters that are able to distinguish between two species.

Methods d(x,S1) d(x,Sn) x species Sn species S1 d(x,S2) … species S2 • d(x,Si) = Minimum{ d(x,y) | sequence y belongs to species Si } • Minimum “Method” Classifier • d(x,Si) = Average{ d(x,y) | sequence y belongs to species Si } • Average “Method” Classifier

Hamming Distance • Percent of basepair divergences • Average: • Given barcode x find species S such that the minimum hamming distances on the average from x to y (y in S) is minimized • species(x)= S. • Minimum: • Given barcode x find barcode y that minimizes the hamming distance from x to y • species(x) = species(y)

Aminoacid Similarity • Genetic code: • rules that map DNA sequences to proteins • Codon: tri-nucleotide unit that encodes for one aminoacid • Divide DNA seq. into codons and substitute each one by its corresp. aminoacid • Blosum62 (BLOck SUbstitution Matrix) • 20x20 matrix that gives score for each two aminoacids based on aminoacid properties • The higher the score the more likely no functional change in the protein

Aminoacid Similarity • Measures How similar the two aminoacid sequences encoded by the barcodes are • Distance(x,y) • barcodes x, y -> Aminoacid sequences x’ , y’ (using genetic code) • Score of the aminoacid alignment using the Blosum62 • Average: • Find the species with maximum average similarity • Minimum: • Find the barcode with max. similarity

Convex-score Similarity • “Long runs of consecutive basepair matches” indicate that the encoded aminoacid sequence plays an important role -> the two barcodes are “close” on the evolutionary distance • The longer the run of basepair matches, the higher the score • The contribution of a run is convexly increasing with its length • The new sequence is assigned to the species containing the highest scoring sequence

Trinucleotide Distance • For each species compute the vector of trinucleotide frequencies • For the new sequence x we compute the vector of trinucleotide frequencies • Find the closest species. • To measure the distance between 2 vectors of frequencies we use Minimum Mean Square distance

Positional weight matrix • For each species we compute a positional weight matrix • For each locus the PWM gives the probability of seeing each nucleotide appear at that locus in that species • We assume independence of loci • For a barcode x we can compute the probability that x belongs to species S as the product of the probabilities of observing at every locus the respective nucleotide in x • Assign x to the specie that gives the highest probability

Character-based pairwise species discrimination • Given species S1, S2 and new barcode x we find the k most discriminating characters • A locus -> character • Nucleotides -> possible values for character • Idea: If at a given locus, there is a nucleotide that appears in S1 and not in S2, then if x contains that nucleotide at that locus -> x is more likely to belong to S1 and not to S2

Character-based pairwise species discrimination • Finding the k most discriminative characters • The discriminative power of character i is given by • Cnt(i,X,S1) - the number of times we see nucleotide X at position i in species S1 • Size(S1) - number of barcodes in specie S1

Character-based pairwise species discrimination • The two species (red, blue) are discriminated by character i with 100% accuracy • The nucleotide present at position i in the new barcode x safely tells us in which specie x is more likely to belong • i is a “pure” character i … A … … A … … C … … C … … C … … T … … T … … T … … G … … G… w(i) = 1

Character-based pairwise species discrimination • The two species (red, blue) are discriminated by character i with 90% accuracy • if the new barcode x has a C,T,G at i we guess correctly the species of x • if the new barcode x has an A at i then we choose the species of x as the species containing the highest number of A’s at i (red sp.) i … A … … A … … C … … C … … C … … A … … T … … T … … G … … G… w(i) = 0.9

Character-based pairwise species discrimination • Given species S1, S2 and new barcode x we find the k most discriminating characters • We compute how many times specie S1 is favored over S2 and output the most favored specie • We repeat steps 1 and 2 for all pairs of species and the new barcode x • The specie S that is favored the most in all these pairwise discriminations is assigned to barcode x

Combining the Methods • Every classifier outputs the specie the new barcode is most likely to belong • Simple Voting: • Every classifier’s returned species has a weight of 1 • Output the species with the most votes

Datasets(1) • We used the dataset provided at http://dimacs.rutgers.edu/workshops/BarcodeResearchchallanges consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average. • We randomly deleted from each species 10 to 50 percent of the sequences • Deleted seq -> test • Remaining seq -> train • We made sure that in every species has a least one sequence

Species Recovering Accuracy(in %)(no new species - DAWG train dataset)

Datasets(2) • We used the cowries dataset provided at xxx • We removed the species containing less than 4 barcodes per species • We randomly deleted from each species 10 to 50 percent of the sequences • Deleted seq -> test • Remaining seq -> train • We made sure that in every species has a least one sequence

Species Recovering Accuracy(in %)(no new species)

Datasets(3) • In order to test the accuracy of new species detection and classification we devised a regular leave one out procedure. • delete a whole species • randomly delete from each remaining species 0 to 50 percent of the sequences • Deleted seq -> test • Remaining seq -> train • The following table gives accuracy results on average for 150x6 different testcases

Leave one out Accuracy(in %)DAWG train dataset

Leave one out Accuracy(in %)Cowries dataset

Conclusions(1) • Every method shows a tradeoff between new species detection and classification accuracy • Hamming distance performs very good when no new species are present but the accuracy results are low for new species detection • The combined method yields better accuracy results both on new species detection and seq. classification. • The runtime of all methods is within the same order of magnitude

Future Work • New species clustering: determining the different new species present • Further investigate threshold selection and weighting schemes. • Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions? • Use independent weighting schemes for new species detection and classification into known species.

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers