150 likes | 341 Views
Tutorial Overview & Learning Python. COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk. Course Work. Three or Four assignments (20%) Progress report of the first project Add cross-validation to the first project 3 -4 quick questions about data mining Two projects (60%)
E N D
Tutorial Overview & Learning Python COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk
Course Work • Three or Four assignments (20%) • Progress report of the first project • Add cross-validation to the first project • 3-4 quick questions about data mining • Two projects (60%) • KDD Cup 2009 • The first task of KDD Cup 2012 • One term paper (10%) • One presentation (10%)
Project-oriented tutorials • Project and assignments count for 80% of your grade. • You will write code in a few languages/tools. • More importantly, you will do experiments! • Very different from COMP4331. Light on concepts/math. Heavy hands-on course. COMP 4332 = COMP 4331+ COMP 4331
A data mining project requires ... • 1. Explore data and data preprocessing. • 2. Trying algorithms, SVM, Logistic Regression, Decision Trees, Dimensionality Reduction, etc... And try varying parameters in each algorithm. • Labor intensive! • Sometimes frustrated. • 3. Summarize findings and design new methods and go back to step 2. Repeatedly go to step 1 to re-processing the data to feed into different tools. The creative part!
1. Explore data/look at the data • Visualization: • 1D data summary: mean, variance, median, skewness; density estimation(pdf), cdf; outliers, etc. • 2D data summary: scatter plot, QQ-plot, correlation scores, etc. • High-dimensional data summary: dimensionality reduction and plot to 2D or 3D • Store data and extract wanted part. • Organized: SQL like queries... • Quick and dirty: write a script for each operation...
2. Run experiments using tools • Most of the time, tools are available. • Weka, libsvm, etc.. • Sometimes, you need to implement a variant of existing algorithm. • A different decision tree • A classifier handles unbalanced data • Run the methods and vary parameters and plot results and trends. Good news:) Numerical code is generally hard to write correctly (hard to DEBUG!). You will do this in this course!
3. Summarize findings and design new methods • After each iteration of step 1 and 2, you know more about the data, you may have new ideas and go back to step 1 and 2. • But before that, first document your findings.
A cloud of tools ... • Data preprocessing: Python, Java/C++, SQL, Excel, text editors.... • Visualization: Excel, Matlab, R, matlibplot • SVM: libsvm, svmlight, liblinear packages • Logistic regression: liblinear • Decision Trees & tree ensemble: Weka, FEST • Matrix factorization: libfm, GraphLab *Bolded tools are that we will teach in the tutorials.
Teaching all of them is impossible! • You have to take time to read the manuals of these tools, and sometimes source code of them! • Through this course, we will use Python to illustrate • Data preprocessing (mostly its string processing) • Algorithm implementation (numpy/scipy) • Automaticly perform experiments • Simple plotting (matlibplot) • Sometimes, we use R’s plotting packages (core, ggplot2) if matlibplot does not fit the requirement.
Why Python • Easy to learn and easy to use. • A good tool for us to illustrate the three steps of doing a data mining project. • A concise and powerful language. • A glue language. Easily integrate components written in other languages. • Widely used in IT industries. Organizations using Python
Setup Python • Install Python 2.6 (not 2.7, not 3.x) and three packages numpy, scipy and matlibplot. • [Recommend] For numpy/scipy in Windows, use http://www.lfd.uci.edu/~gohlke/pythonlibs/ • Default IDE is weak. Recommended IDEs: • PyScripter (only in Windows) • ulipad (cross platform) • Eclipse + pydev (cross platform) • Or simply Notepad++ editor with syntax highlighting (only in Windows)
Learn Python • The official Python tutorial. Written for experienced programmers. • Read it twice and try every code snippet in the tutorial. • Code Like a Pythonista: Idiomatic Python • Python Howto: sort, logging, functional programming, etc. • MIT 6.00 course material. • Liang Huang’s Python Short Course. • numpy examples and scipy tutorial. • Best place to ask a Python-related question: http://stackoverflow.com/. It is better to send your Python question to Stackoverflow rather than to our mailing list.
Play with Python data structures • basic types: bool, integer, float, complex • tuple: (x, y, ..) • list: [x, y, ...] • string: ‘hello’, “world” • dictionary: { x: a, y: b, ... } • set: set([a, b, c, d]) • iteratable/sequence: a unified view for data structures • tuple/list/dictionary/set/string are all iteratable.
DEMO • 1. Go through basic Python data structures and their operations. • 2. Show Python’s functions and control structures (if-then-else/for/while).
A complete example • Convert the sparse matrix format to libsvm’s format.