350 likes | 494 Views
Classification of Mitochondrial DNA SNPs into Haplogroups. Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717. Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA 19104.
E N D
Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA 19104 National Science Foundation – BioGRID REU Fellows Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269
Mitochondrial DNA & Haplogroups • The Genographic Project • 1 Nearest Neighbor • Support Vector Machines • Random Forest • RF PCA • Results • Discussions • Extensions
Mitochondrial DNA • Found in Mitochondria • 2 to 10 copies per Mitochondrion • Hundreds to thousands per cell • Circular • Bacterial origin • Uniparental • Non-Combining • High mutation rate • Maternal Inheritence • Egg vs Sperm & Ubiquitin Marker
Haplogroups | Sequencing • Coding is done at the D-loop • Mutation Hotspots • Hypervariable Region I (HVR-I) • Nucleotides 16024 - 16569 • Each sample is tagged with a haplogroup label representing its genetic content. • Contains similar haplotyes that share a common ancestor based on SNPs. www.ncbi.nlm.nih.gov/bookshelf/br.fcg
SNPs • SNPs (Single nucleotide polymorphisms) • Insertion • Deletion • Transversion • Transition • Heteroplasmy • Variables = SNPs • 545 HVS I SNPs
Haplotypes/Haplogroups • Haplotype – combination of SNPs • Haplotype: HVS-I variants • 21164 samples • Dataset - Coarse- Hg labels • Coding region SNPs/ HVS – I motifs • Considered the ‘gold – standard’
Cladistics • Classification based on shared ancestry • Back Mutations • Homoplasy http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0030104
Cladistics cont. http://upload.wikimedia.org/wikipedia/en/d/dd/Migration_map4.png
The Genographic Project • The National Geographic Society • Anthropological and Forensic Questions • 78,590 Samples • 21,141 consented database • Hg labeling is done with both HVR-I motifs and the 22-SNP panel results • Utilizes Nearest Neighbor Algorithm (1-NN)
Nearest Neighbor • Pattern recognition | Instance Based Learning • Simple and Power • High accuracy rate with large data sets • Data point is classified by a vote of its k nearest neighbors • Training data is separated in space into regions • Data is classified to the highest number of votes amongst its neighbors.
Support Vector Machines -Training and Testing -Data Vectors -Model Production -Mapping into higher dimensional plane -Maximum separating margin
Data Processing (SVM) • -Numbering of detailed data • -{x,y,z} (0,0,1), (0,1,0), (1,0,0) • -Radial Basis Function (RBF) Kernel • -Higher plane mapping • -Simplicity • -Opitmal Parameters: • - Grid Search:
Random Forest • Tree-based classification algorithm • Fortran (computationally oriented programming language) original package • Leo Breiman and Adele Cutler • Ensemble learning algorithm • Implementation through R environment
RF • Voting for classification • Random inputs • Variables • Samples • Ntree single decision trees • Mtry variables • Random sampling • Training set obtained through bootstrap sampling • OOB data/error estimate • Inputted dataset excludes certain cases • 1/3 of input cases left out
Random Forest http://images.google.com/imgres?imgurl=http://proteomics.bioengr.uic.edu/malibu/docs/images/random_forest_thumb.png&imgrefurl=http://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html&usg=__oCugzsEOtYKwtBLo2Mi11kcgkcE=&h=240&w=420&sz=101&hl=en&start=1&um=1&tbnid=bGAVW705VPSR9M:&tbnh=71&tbnw=125&prev=/images%3Fq%3DrandomForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1 http://images.google.com/imgres?imgurl=http://cg.scs.carleton.ca/~luc/bst.gif&imgrefurl=http://cg.scs.carleton.ca/~luc/trees.html&usg=__gYANVMgGa_H8CUhJZApOczZD5Xs=&h=447&w=548&sz=21&hl=en&start=3&um=1&tbnid=lsdiIpjqYFENXM:&tbnh=108&tbnw=133&prev=/images%3Fq%3Drandom%2BForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1
5F - Cross Validation • Test out predictive model • Divide into 5 subsets (5-fold) • Training set • Test set • ‘unseen’ data • Five fold cross validation • Training set/ testing set • Random Forest with training set • Testing set
5F CV https://esus.genome.tugraz.at/ProClassify/help/contents/pages/images/xv_folds.gif
RF model • Genotyped mtDNA • Dataset • 545 SNPs in HVS – I • HVS-I haplotypes • 21164 samples • Hg classification from similar haplotypes • SNPs dictate Hg classifications • SNPs = variables • Coarse – Hg classifications in dataset • ‘gold standard’
Model - Optimal mtryand ntree values for entire dataset • Pair of parameters with lowest OOB error (training set) • Mtry SNPs used to construct each tree • Ntree decision trees constructed • 5 fold Cross validation • Random forest on training set • Training set : random sampling with replacement • Bootstrap sampling • random sampling with replacement of cases • OOB data • Model : random forest object outputted • Apply random forest model on test set • Output = predicted Hg classifications • Compare back to ‘observed’ Hg classifications
R environment • Bill Venables and David M. Smith • Primary programming language: ‘S’ (statistical) • Coherent system integrating data manipulation, calculation and graphical display
PCA • PCA = Principal Component Analysis • Feature Selection tool • Which variables more informative than others? • Confusing dataset • Too many variables – 545
PCA • Reexpress dataset in another basis, the principal components (PCs) • Change of basis • Possible dimensional reduction • Reveal hidden structure, underlying relationships • Which basis best represents the dynamics of interest? • Maximize variance • Minimize covariance (redundancy) • Find PCs – new basis vectors
PCA on dataset • Eigendecomposition on Cx • Original dataset = X • CX. = 1/(number of samples) *XXT • Transformed dataset = Y • Y = PX • P = orthonormal matrix • Rows = principal components of X • Rows = eigenvectors of Cx • CY = diagonal covariance matrix of transformed X, Y. • Diagonal entries = eigenvalues = variances
PCA • Eigenvalues represent variance • Rank order PCs= eigenvectors of original covariance matrix(new variables) by corresponding eigenvalues • Subselection of k new variables (PCs) from available pool • K = 64, 100, 200, 300, 400, 545 • Select the first k rank ordered PCs for input into RF • Transformed dataset = 21164 by k dimensions
SVM Findings Macro: 88.06% Micro: 96.59%
Discussion • Unbalanced dataset • Underrepresented haplogroups • Overrepresented haplogroups • Possibility: change weights/ coarser Hgs • Bootstrap sampling in RF • Cross validation • RF vs. SVM
Conclusions • RF: ? • Random sampling of variables • Random sampling of training cases (samples) • Repeated trials • SVM vs. 1 – NN • Deterministic models • SVM (5FCV) > 1-NN (5FCV)
Acknowledgements • Advisor: Chih Lee • Dr. Chun-Hsi Huang • National Science Foundation REU grant CCF-0755373 • Univ. of Connecticut
Thank you! • Any questions?