740 likes | 895 Views
AI class by Dr. Peter Molnar Presented by Omar ElTayeby & Jonathan Lutu. Midterm Project (Bioinformatics). Dataset Preprocess Clustering techniques Visualized Results Weka’s report. Data set. GROMOS format:
E N D
AI class by Dr. Peter Molnar Presented by Omar ElTayeby & Jonathan Lutu Midterm Project (Bioinformatics) Dataset Preprocess Clustering techniques Visualized Results Weka’s report
Data set GROMOS format: is a general-purpose molecular dynamics computer simulation package for the study of biomolecular systems. It also incorporates its own force field covering proteins, nucleotides, sugars etc. and can be applied to chemical and physical systems ranging from glasses and liquid crystals, to polymers and crystals and solutions of biomolecules.
GROMOS fields • Molecule name • Atom name • Index • X-position • Y-position • Z-position • Velocity in X • Velocity in Y • Velocity in Z
The class attribute MolName is ignored using the ignore attribute panel in order to allow later classes to cluster evaluation
Cobweb Clustered Instances 5 1 ( 1%) 6 1 ( 1%) 10 3 ( 2%) 26 2 ( 1%) 32 1 ( 1%) 192 2 ( 1%) 193 2 ( 1%) 194 3 ( 2%) 198 2 ( 1%) 209 3 ( 2%) 210 1 ( 1%) 211 3 ( 2%) 212 1 ( 1%) 213 3 ( 2%) 218 1 ( 1%) 220 51 ( 26%) Cobweb generates hierarchical clustering, where clusters are described probabilistically. Time taken to build model (percentage split) : 1.08 sec Model and evaluation on training set Number of merges: 264 Number of splits: 242 Number of clusters: 159
DBscan DBSCAN (for density-based spatial clustering of applications with noise) it finds a number of clusters starting from the estimated density distribution of corresponding nodes. OPTICS is generalized version of it Epsilon: 0.9; minimum Points: 6 Distance-type: weka.clusterers.forOPTICSAndDBScan.DataObjects.EuclideanDataObject Number of generated clusters: 1 Elapsed time: .28 Time taken to build model (percentage split) : 0.27 seconds Clustered Instances 0 200 (100%)
EM Expectation Maximization is an iterative method for finding maximum likelihood (maximum a posteriori (MAP)) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Model and evaluation on training set Number of clusters selected by cross validation: 16 Cluster Attribute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (0.07) (0.09) (0.04) (0.1) (0.04) (0.03) (0.07) (0.08) (0) (0.09) (0.1) (0.15) (0.02) (0.03) (0.04) (0.06) idx mean 246.6178 804.886 20.4617 612.7931 922.75 927.3333 315.8491 81.3924 897 167.6901 712.7986 484.9131 930.4286 910.6667 937.5 380.1199 std. dev. 19.9752 26.7839 11.5775 30.4438 49.9435 38.9558 20.7131 23.7571 0.8165 26.2003 28.5792 44.3361 48.0184 24.0601 44.7148 17.3318 x mean 7.9852 7.7311 13.8272 7.4465 2.0155 12.3704 8.5341 14.4801 7.604 7.138 8.0863 8.0573 8.6474 1.9581 11.8996 8.2963 std. dev. 0.3271 0.3958 0.2021 0.2575 0.7737 0.5815 0.2816 0.2197 0.0221 0.2857 0.2578 0.2557 3.4636 0.7386 0.4452 0.1815 y mean 8.4068 9.2372 7.5027 8.968 11.6244 12.3234 8.7849 7.3115 17.418 9.2184 9.5915 8.8124 7.1217 1.1956 2.0832 9.3408 std. dev. 0.27 0.2147 0.1521 0.4129 0.7935 0.7521 0.3153 0.3351 0.0434 0.4128 0.2687 0.495 1.2159 0.5532 1.2309 0.3056 z mean 3.3292 2.3716 5.7055 4.6546 1.332 1.969 3.9916 4.662 8.1537 3.1098 3.5038 5.4616 7.8138 0.7634 1.481 4.6643 std. dev. 0.18 0.2703 0.2598 0.2958 0.4 0.8912 0.2006 0.4319 0.0531 0.2129 0.3972 0.277 0.3167 0.3674 0.6436 0.2155 Time taken to build model (full training data) : 253.03 seconds
Model and evaluation on test splitNumber of clusters selected by cross validation: 21 Cluster Attribute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (0.06) (0.05) (0.04) (0) (0.02) (0.06) (0.1) (0.05) (0.03) (0.05) (0.04) (0.06) (0.07) (0.07) (0.04) (0.04) (0.05) (0.04) (0.04) (0.03) (0.07) idx mean 340.2617 747.7811 192.434 897 933.1538 636.9508 480.8214 27.4526 935.2857 235.4715 149.1512 286.6632 812.753 400.2187 925.3793 106.2973 698.0267 70.8656 940.4333 910.65 568.1085 std. dev. 17.147 15.2513 12.1503 0.8165 52.9322 17.4918 31.268 14.1443 37.2663 12.8201 14.7659 16.2191 21.6405 18.0807 49.4141 9.096 15.9375 11.0441 43.6621 24.6521 20.9748 X mean 8.497 8.2507 7.2938 7.604 9.569 7.4795 8.0972 13.9794 12.4413 7.8499 7.0417 8.4184 7.6669 8.1934 2.0321 14.5495 8.0057 14.4446 11.8423 1.9335 7.6212 std. dev. 0.2884 0.2511 0.2536 0.0221 4.1074 0.2279 0.2877 0.302 0.6191 0.2739 0.2514 0.2665 0.3613 0.1803 0.7911 0.1834 0.1929 0.2498 0.4449 0.8158 0.3249 y mean 9.0012 9.4721 8.861 17.418 6.8537 9.2455 8.8418 7.512 12.4283 8.3597 9.4832 8.5554 9.2178 9.4361 11.5595 7.5845 9.6291 7.0234 2.0213 1.1985 8.5245 std. dev. 0.2471 0.2078 0.23 0.0434 0.9836 0.2965 0.4287 0.1569 0.7757 0.2937 0.2574 0.2856 0.2311 0.2947 0.802 0.1449 0.2832 0.231 1.246 0.61 0.3228 z mean 4.2321 2.9967 3.0684 8.1537 7.91 4.423 5.5719 5.6006 1.9575 3.223 3.091 3.7014 2.2863 4.9377 1.4421 4.1988 3.7211 4.8602 1.5025 0.671 5.0285 std. dev. 0.1705 0.2349 0.2255 0.0531 0.2977 0.195 0.236 0.2864 0.9295 0.1207 0.2089 0.1553 0.2364 0.2728 0.3462 0.2254 0.2091 0.141 0.6132 0.2932 0.1529 Time taken to build model (percentage split) : 306.58 seconds
Clustered Instances 0 9 ( 5%) 1 11 ( 6%) 2 5 ( 3%) 4 8 ( 4%) 5 14 ( 7%) 6 22 ( 11%) 7 10 ( 5%) 8 6 ( 3%) 9 7 ( 4%) 10 18 ( 9%) 11 11 ( 6%) 12 15 ( 8%) 13 6 ( 3%) 14 7 ( 4%) 15 3 ( 2%) 16 12 ( 6%) 17 6 ( 3%) 18 6 ( 3%) 19 7 ( 4%) 20 17 ( 9%) Log likelihood: -8.08141
FarthestFirst Model and evaluation on test split FarthestFirst Cluster centroids: Cluster 0 913.0 12.556 12.62 2.038 Cluster 1 10.0 13.625 7.267 5.991 Time taken to build model (percentage split) : 0 seconds Model and evaluation on training set Cluster centroids: Cluster 0 986.0, 2.631, 10.303, 0.704 Cluster 1 18.0, 13.987, 7.58, 6.04 Time taken to build model (full training data) : 0 seconds Clustered Instances 0 100 ( 50%) 1 100 ( 50%)
Filtered Clusterer Clusterer Model kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 117.60638481179521 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 (1000) (122) (878) ============================================ idx 500.5 61.5 561.5 x 8.5681 14.2669 7.7762 y 8.5418 7.3739 8.7041 z 3.8938 5.0026 3.7397 Time taken to build model (full training data) : 0.01 seconds Model and evaluation on test split === FilteredClusterer using weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 on data filtered through weka.filters.AllFilter Filtered Header @relation bio-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.AllFilter Clusterer Model kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 92.56130760800468 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 (800) (103) (697) ============================================ idx 495.4513 63.1068 559.3415 x 8.6563 14.2843 7.8246 y 8.5432 7.3803 8.7151 z 3.892 4.9765 3.7317 Time taken to build model (percentage split) : 0.01 seconds Clustered Instances 0 19 ( 10%) 1 181 ( 91%)
Hierarchal Clusterer Spatial analysis technique
Time taken to build model (full training data) : 1.19 secondsTime taken to build model (percentage split) : 0.78 seconds Clustered Instances 0 200 (100%)
Relation: bio Instances: 1000 Attributes: 5 idx x y z