1 / 74

Midterm Project (Bioinformatics)

AI class by Dr. Peter Molnar Presented by Omar ElTayeby & Jonathan Lutu. Midterm Project (Bioinformatics). Dataset Preprocess Clustering techniques Visualized Results Weka’s report. Data set. GROMOS format:

redell
Download Presentation

Midterm Project (Bioinformatics)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AI class by Dr. Peter Molnar Presented by Omar ElTayeby & Jonathan Lutu Midterm Project (Bioinformatics) Dataset Preprocess Clustering techniques Visualized Results Weka’s report

  2. Data set GROMOS format: is a general-purpose molecular dynamics computer simulation package for the study of biomolecular systems. It also incorporates its own force field covering proteins, nucleotides, sugars etc. and can be applied to chemical and physical systems ranging from glasses and liquid crystals, to polymers and crystals and solutions of biomolecules.

  3. GROMOS fields • Molecule name • Atom name • Index • X-position • Y-position • Z-position • Velocity in X • Velocity in Y • Velocity in Z

  4. Preprocessing

  5. Visualized Results

  6. The class attribute MolName is ignored using the ignore attribute panel in order to allow later classes to cluster evaluation

  7. Cobweb Clustered Instances   5        1 (  1%)   6        1 (  1%)  10        3 (  2%)  26        2 (  1%)  32        1 (  1%) 192        2 (  1%) 193        2 (  1%) 194        3 (  2%) 198        2 (  1%) 209        3 (  2%) 210        1 (  1%) 211        3 (  2%) 212        1 (  1%) 213        3 (  2%) 218        1 (  1%) 220       51 ( 26%) Cobweb generates hierarchical clustering, where clusters are described probabilistically. Time taken to build model (percentage split) : 1.08 sec Model and evaluation on training set Number of merges: 264 Number of splits: 242 Number of clusters: 159

  8. DBscan DBSCAN (for density-based spatial clustering of applications with noise)    it finds a number of clusters starting from the estimated density distribution of corresponding nodes. OPTICS is generalized version of it Epsilon: 0.9; minimum Points: 6 Distance-type: weka.clusterers.forOPTICSAndDBScan.DataObjects.EuclideanDataObject Number of generated clusters: 1 Elapsed time: .28 Time taken to build model (percentage split) : 0.27 seconds Clustered Instances 0      200 (100%)

  9. EM Expectation Maximization is an iterative method for finding maximum likelihood (maximum a posteriori (MAP)) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

  10. Model and evaluation on training set Number of clusters selected by cross validation: 16              Cluster Attribute          0        1        2        3        4        5        6        7        8        9       10       11       12       13       14       15               (0.07)   (0.09)   (0.04)    (0.1)   (0.04)   (0.03)   (0.07)   (0.08)      (0)   (0.09)    (0.1)   (0.15)   (0.02)   (0.03)   (0.04)   (0.06) idx   mean       246.6178  804.886  20.4617 612.7931   922.75 927.3333 315.8491  81.3924      897 167.6901 712.7986 484.9131 930.4286 910.6667    937.5 380.1199   std. dev.   19.9752  26.7839  11.5775  30.4438  49.9435  38.9558  20.7131  23.7571   0.8165  26.2003  28.5792  44.3361  48.0184  24.0601  44.7148  17.3318 x   mean         7.9852   7.7311  13.8272   7.4465   2.0155  12.3704   8.5341  14.4801    7.604    7.138   8.0863   8.0573   8.6474   1.9581  11.8996   8.2963   std. dev.    0.3271   0.3958   0.2021   0.2575   0.7737   0.5815   0.2816   0.2197   0.0221   0.2857   0.2578   0.2557   3.4636   0.7386   0.4452   0.1815 y   mean         8.4068   9.2372   7.5027    8.968  11.6244  12.3234   8.7849   7.3115   17.418   9.2184   9.5915   8.8124   7.1217   1.1956   2.0832   9.3408   std. dev.      0.27   0.2147   0.1521   0.4129   0.7935   0.7521   0.3153   0.3351   0.0434   0.4128   0.2687    0.495   1.2159   0.5532   1.2309   0.3056 z   mean         3.3292   2.3716   5.7055   4.6546    1.332    1.969   3.9916    4.662   8.1537   3.1098   3.5038   5.4616   7.8138   0.7634    1.481   4.6643   std. dev.      0.18   0.2703   0.2598   0.2958      0.4   0.8912   0.2006   0.4319   0.0531   0.2129   0.3972    0.277   0.3167   0.3674   0.6436   0.2155 Time taken to build model (full training data) : 253.03 seconds

  11. Model and evaluation on test splitNumber of clusters selected by cross validation: 21              Cluster Attribute          0        1        2        3        4        5        6        7        8        9       10       11       12       13       14       15       16       17       18       19       20               (0.06)   (0.05)   (0.04)      (0)   (0.02)   (0.06)    (0.1)   (0.05)   (0.03)   (0.05)   (0.04)   (0.06)   (0.07)   (0.07)   (0.04)   (0.04)   (0.05)   (0.04)   (0.04)   (0.03)   (0.07) idx   mean       340.2617 747.7811  192.434      897 933.1538 636.9508 480.8214  27.4526 935.2857 235.4715 149.1512 286.6632  812.753 400.2187 925.3793 106.2973 698.0267  70.8656 940.4333   910.65 568.1085   std. dev.    17.147  15.2513  12.1503   0.8165  52.9322  17.4918   31.268  14.1443  37.2663  12.8201  14.7659  16.2191  21.6405  18.0807  49.4141    9.096  15.9375  11.0441  43.6621  24.6521  20.9748 X   mean          8.497   8.2507   7.2938    7.604    9.569   7.4795   8.0972  13.9794  12.4413   7.8499   7.0417   8.4184   7.6669   8.1934   2.0321  14.5495   8.0057  14.4446  11.8423   1.9335   7.6212   std. dev.    0.2884   0.2511   0.2536   0.0221   4.1074   0.2279   0.2877    0.302   0.6191   0.2739   0.2514   0.2665   0.3613   0.1803   0.7911   0.1834   0.1929   0.2498   0.4449   0.8158   0.3249 y   mean         9.0012   9.4721    8.861   17.418   6.8537   9.2455   8.8418    7.512  12.4283   8.3597   9.4832   8.5554   9.2178   9.4361  11.5595   7.5845   9.6291   7.0234   2.0213   1.1985   8.5245   std. dev.    0.2471   0.2078     0.23   0.0434   0.9836   0.2965   0.4287   0.1569   0.7757   0.2937   0.2574   0.2856   0.2311   0.2947    0.802   0.1449   0.2832    0.231    1.246     0.61   0.3228 z   mean         4.2321   2.9967   3.0684   8.1537     7.91    4.423   5.5719   5.6006   1.9575    3.223    3.091   3.7014   2.2863   4.9377   1.4421   4.1988   3.7211   4.8602   1.5025    0.671   5.0285   std. dev.    0.1705   0.2349   0.2255   0.0531   0.2977    0.195    0.236   0.2864   0.9295   0.1207   0.2089   0.1553   0.2364   0.2728   0.3462   0.2254   0.2091    0.141   0.6132   0.2932   0.1529 Time taken to build model (percentage split) : 306.58 seconds

  12. Clustered Instances  0        9 (  5%)  1       11 (  6%)  2        5 (  3%)  4        8 (  4%)  5       14 (  7%)  6       22 ( 11%)  7       10 (  5%)  8        6 (  3%)  9        7 (  4%) 10       18 (  9%) 11       11 (  6%) 12       15 (  8%) 13        6 (  3%) 14        7 (  4%) 15        3 (  2%) 16       12 (  6%) 17        6 (  3%) 18        6 (  3%) 19        7 (  4%) 20       17 (  9%) Log likelihood: -8.08141

  13. FarthestFirst Model and evaluation on test split FarthestFirst Cluster centroids: Cluster 0  913.0 12.556 12.62 2.038 Cluster 1  10.0 13.625 7.267 5.991 Time taken to build model (percentage split) : 0 seconds Model and evaluation on training set Cluster centroids: Cluster 0  986.0, 2.631, 10.303, 0.704 Cluster 1  18.0, 13.987, 7.58, 6.04 Time taken to build model (full training data) : 0 seconds Clustered Instances 0      100 ( 50%) 1      100 ( 50%)

  14. Filtered Clusterer Clusterer Model kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 117.60638481179521 Missing values globally replaced with mean/mode Cluster centroids:                          Cluster# Attribute    Full Data          0          1                 (1000)      (122)      (878) ============================================ idx              500.5       61.5      561.5 x               8.5681    14.2669     7.7762 y               8.5418     7.3739     8.7041 z               3.8938     5.0026     3.7397 Time taken to build model (full training data) : 0.01 seconds Model and evaluation on test split === FilteredClusterer using weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 on data filtered through weka.filters.AllFilter Filtered Header @relation bio-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.AllFilter Clusterer Model kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 92.56130760800468 Missing values globally replaced with mean/mode Cluster centroids:                          Cluster# Attribute    Full Data          0          1                  (800)      (103)      (697) ============================================ idx           495.4513    63.1068   559.3415 x               8.6563    14.2843     7.8246 y               8.5432     7.3803     8.7151 z                3.892     4.9765     3.7317 Time taken to build model (percentage split) : 0.01 seconds Clustered Instances 0       19 ( 10%) 1      181 ( 91%)

  15. Hierarchal Clusterer Spatial analysis technique

  16. Time taken to build model (full training data) : 1.19 secondsTime taken to build model (percentage split) : 0.78 seconds Clustered Instances 0       200 (100%)

  17. Relation: bio Instances: 1000 Attributes: 5 idx x y z

More Related