140 likes | 639 Views
BIOLOGICAL Data Mining. John P. Russo Associate Professor Department of Computer Science and Systems Wentworth Institute of Technology. Overview. Part of Bioinformatics “Concentration” Follow-on to other Bioinformatics courses in curriculum Required Background Database Management Systems
E N D
BIOLOGICAL Data Mining John P. Russo Associate Professor Department of Computer Science and Systems Wentworth Institute of Technology
Overview • Part of Bioinformatics “Concentration” • Follow-on to other Bioinformatics courses in curriculum • Required Background • Database Management Systems • Introduction to Bioinformatics • Probability and Statistics for Engineers • (Optional) Biostatistics
Course Structure • Three 50 minute lectures per week • One two hour lab per week • Additional homework assignments • Semester long team project • Textbook: • Witten, Frank and Hall: Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition. Morgan Kaufman. ISBN: 978-0-12-374856-0 • Additional selected readings
Topics • Classification • OneR • Naïve Bayes • Decision Trees • Rules • Association • Item Sets • Association Rules • Linear Models • Linear Regression • Logistic Regression • Instance-Based Learning • Nearest Neighbor • Clustering • K-Means • Hierarchical • Visualization
Software • Scripting Language (Perl) • Used for data cleaning and preparation • Weka • Open source • Machine learning algorithms • Regression • Association Rules • Classification • Clustering • Decision tree tools
Sample Lab Preparing and Mining Data • Convert cystic-fibrosis-genes.csv to a Weka file: cystic.arff • Use CLASS as a target field and run J48 on the data set with “Use training set” • Create two subsets: cystic-train.arff with the first 90 samples of the data and cystic-data.arffwith the remaining samples • Train J48 on the training data set. What decision tree do you get and what accuracy? • Next, use cystic-data.arff as the test set. How does the decision tree compare to step 4? • Remove the field Source from the classifier. Repeat steps 4 and 5. What do you observe? Does the accuracy on the test set improve?
Sample Project • Predict disease classes using genetic microarray data • Data • Gene data is in genes-in-rows format, comma-separated values. Take final_project_data.zip file from Data directory, and unzip to extract 3 files: pp5i_train.gr.csv (training data, 1.7 MB) • pp5i_train_class.txt (training data classes) • pp5i_test.gr.csv (test data, 0.6MB) • Instructions • Training data: file pp5i_train.gr.csv, with with 7070 genes (no Affy controls) for 69 samples. A separate file pp5i_train_class.txt has classes for each sample, in the order corresponding to the order of samples in pp5i_train.gr.csv. There are 5 classes, labelled EPD, JPA, MED, MGL, RHB. • Test data: file pp5i_test.gr.csv, with 23 unlabelled samples and same genes. You can assume that the class distribution is similar. • Your goal is to learn the best model from the training data and use it to predict the label (class) for each sample in test data. You will also need to write a paper describing your effort. • Randomization experiments showed that one can get about 10-12 (from 23) correct answers with random guessing. • The final grade will be a combination of effort (40%), presentation (30%), and the accuracy of prediction, measured as 3*(Number_correct_answers - 11). The maximum grade is 106. • Below are suggested steps for doing this experiment, but you can vary and improve on the suggested approach, as long as you produce a prediction for the test set and describe your results. Source: KDNuggets.com data mining course: http://www.kdnuggets.com/data_mining_course/assignments/final-project.html
Conclusion • Course provides students with an introduction to data mining with biological data sets • Student projects could use toolset introduced in class or other programming languages • Possible for students to build upon knowledge gained in Bioinformatics Algorithms