Classification of Mitochondrial DNA SNPs into Haplogroups

Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA 19104 National Science Foundation – BioGRID REU Fellows Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269

Mitochondrial DNA & Haplogroups • The Genographic Project • 1 Nearest Neighbor • Support Vector Machines • Random Forest • RF PCA • Results • Discussions • Extensions

Mitochondrial DNA • Found in Mitochondria • 2 to 10 copies per Mitochondrion • Hundreds to thousands per cell • Circular • Bacterial origin • Uniparental • Non-Combining • High mutation rate • Maternal Inheritence • Egg vs Sperm & Ubiquitin Marker

Haplogroups | Sequencing • Coding is done at the D-loop • Mutation Hotspots • Hypervariable Region I (HVR-I) • Nucleotides 16024 - 16569 • Each sample is tagged with a haplogroup label representing its genetic content. • Contains similar haplotyes that share a common ancestor based on SNPs. www.ncbi.nlm.nih.gov/bookshelf/br.fcg

SNPs • SNPs (Single nucleotide polymorphisms) • Insertion • Deletion • Transversion • Transition • Heteroplasmy • Variables = SNPs • 545 HVS I SNPs

Haplotypes/Haplogroups • Haplotype – combination of SNPs • Haplotype: HVS-I variants • 21164 samples • Dataset - Coarse- Hg labels • Coding region SNPs/ HVS – I motifs • Considered the ‘gold – standard’

http://learn.genetics.utah.edu/content/health/pharma/snips/

Cladistics • Classification based on shared ancestry • Back Mutations • Homoplasy http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0030104

Cladistics cont. http://upload.wikimedia.org/wikipedia/en/d/dd/Migration_map4.png

The Genographic Project • The National Geographic Society • Anthropological and Forensic Questions • 78,590 Samples • 21,141 consented database • Hg labeling is done with both HVR-I motifs and the 22-SNP panel results • Utilizes Nearest Neighbor Algorithm (1-NN)

Nearest Neighbor • Pattern recognition | Instance Based Learning • Simple and Power • High accuracy rate with large data sets • Data point is classified by a vote of its k nearest neighbors • Training data is separated in space into regions • Data is classified to the highest number of votes amongst its neighbors.

Support Vector Machines -Training and Testing -Data  Vectors -Model Production -Mapping into higher dimensional plane -Maximum separating margin

Data Processing (SVM) • -Numbering of detailed data • -{x,y,z}  (0,0,1), (0,1,0), (1,0,0) • -Radial Basis Function (RBF) Kernel • -Higher plane mapping • -Simplicity • -Opitmal Parameters: • - Grid Search:

Random Forest • Tree-based classification algorithm • Fortran (computationally oriented programming language) original package • Leo Breiman and Adele Cutler • Ensemble learning algorithm • Implementation through R environment

RF • Voting for classification • Random inputs • Variables • Samples • Ntree single decision trees • Mtry variables • Random sampling • Training set obtained through bootstrap sampling • OOB data/error estimate • Inputted dataset excludes certain cases • 1/3 of input cases left out

Random Forest http://images.google.com/imgres?imgurl=http://proteomics.bioengr.uic.edu/malibu/docs/images/random_forest_thumb.png&imgrefurl=http://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html&usg=__oCugzsEOtYKwtBLo2Mi11kcgkcE=&h=240&w=420&sz=101&hl=en&start=1&um=1&tbnid=bGAVW705VPSR9M:&tbnh=71&tbnw=125&prev=/images%3Fq%3DrandomForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1 http://images.google.com/imgres?imgurl=http://cg.scs.carleton.ca/~luc/bst.gif&imgrefurl=http://cg.scs.carleton.ca/~luc/trees.html&usg=__gYANVMgGa_H8CUhJZApOczZD5Xs=&h=447&w=548&sz=21&hl=en&start=3&um=1&tbnid=lsdiIpjqYFENXM:&tbnh=108&tbnw=133&prev=/images%3Fq%3Drandom%2BForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1

5F - Cross Validation • Test out predictive model • Divide into 5 subsets (5-fold) • Training set • Test set • ‘unseen’ data • Five fold cross validation • Training set/ testing set • Random Forest with training set • Testing set

5F CV https://esus.genome.tugraz.at/ProClassify/help/contents/pages/images/xv_folds.gif

RF model • Genotyped mtDNA • Dataset • 545 SNPs in HVS – I • HVS-I haplotypes • 21164 samples • Hg classification from similar haplotypes • SNPs dictate Hg classifications • SNPs = variables • Coarse – Hg classifications in dataset • ‘gold standard’

Model - Optimal mtryand ntree values for entire dataset • Pair of parameters with lowest OOB error (training set) • Mtry SNPs used to construct each tree • Ntree decision trees constructed • 5 fold Cross validation • Random forest on training set • Training set : random sampling with replacement • Bootstrap sampling • random sampling with replacement of cases • OOB data • Model : random forest object outputted • Apply random forest model on test set • Output = predicted Hg classifications • Compare back to ‘observed’ Hg classifications

R environment • Bill Venables and David M. Smith • Primary programming language: ‘S’ (statistical) • Coherent system integrating data manipulation, calculation and graphical display

R environment

PCA • PCA = Principal Component Analysis • Feature Selection tool • Which variables more informative than others? • Confusing dataset • Too many variables – 545

PCA • Reexpress dataset in another basis, the principal components (PCs) • Change of basis • Possible dimensional reduction • Reveal hidden structure, underlying relationships • Which basis best represents the dynamics of interest? • Maximize variance • Minimize covariance (redundancy) • Find PCs – new basis vectors

PCA on dataset • Eigendecomposition on Cx • Original dataset = X • CX. = 1/(number of samples) *XXT • Transformed dataset = Y • Y = PX • P = orthonormal matrix • Rows = principal components of X • Rows = eigenvectors of Cx • CY = diagonal covariance matrix of transformed X, Y. • Diagonal entries = eigenvalues = variances

PCA • Eigenvalues represent variance • Rank order PCs= eigenvectors of original covariance matrix(new variables) by corresponding eigenvalues • Subselection of k new variables (PCs) from available pool • K = 64, 100, 200, 300, 400, 545 • Select the first k rank ordered PCs for input into RF • Transformed dataset = 21164 by k dimensions

Results

SVM Findings Macro: 88.06% Micro: 96.59%

Comparison

Discussion • Unbalanced dataset • Underrepresented haplogroups • Overrepresented haplogroups • Possibility: change weights/ coarser Hgs • Bootstrap sampling in RF • Cross validation • RF vs. SVM

Graph

Coarse Hg accuracy rates

Conclusions • RF: ? • Random sampling of variables • Random sampling of training cases (samples) • Repeated trials • SVM vs. 1 – NN • Deterministic models • SVM (5FCV) > 1-NN (5FCV)

Acknowledgements • Advisor: Chih Lee • Dr. Chun-Hsi Huang • National Science Foundation REU grant CCF-0755373 • Univ. of Connecticut

Thank you! • Any questions?

Classification of Mitochondrial DNA SNPs into Haplogroups

Classification of Mitochondrial DNA SNPs into Haplogroups

Presentation Transcript

Heteroplasmy and Forensic Mitochondrial DNA Testing

Detection of the human Mitochondrial DNA

Mitochondrial DNA

Mitochondrial DNA in Molecular Systematics

Comparing Mitochondrial DNA across species

Mitochondrial vs. Nuclear DNA

Mitochondrial DNA

Human Mitochondrial DNA

What is Mitochondrial DNA (mtDNA)?

Inferring Ethnicity from Mitochondrial DNA Sequence

Mitochondrial DNA

Nuclear DNA and Mitochondrial DNA

Mitochondrial DNA maintenance disorders

Mitochondrial DNA

Mitochondrial DNA

Mitochondrial DNA (mtDNA)

Nuclear DNA and Mitochondrial DNA

Forensic mitochondrial DNA sequencing

Mitochondrial DNA

MITOCHONDRIAL DNA CONTENT

Interested in Mitochondrial DNA and mitochondrial function?

mitochondrial dna sequencing