1 / 15

COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk

Tutorial Overview & Learning Python. COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk. Course Work. Three or Four assignments (20%) Progress report of the first project Add cross-validation to the first project 3 -4 quick questions about data mining Two projects (60%)

uma
Download Presentation

COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial Overview & Learning Python COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk

  2. Course Work • Three or Four assignments (20%) • Progress report of the first project • Add cross-validation to the first project • 3-4 quick questions about data mining • Two projects (60%) • KDD Cup 2009 • The first task of KDD Cup 2012 • One term paper (10%) • One presentation (10%)

  3. Project-oriented tutorials • Project and assignments count for 80% of your grade. • You will write code in a few languages/tools. • More importantly, you will do experiments! • Very different from COMP4331. Light on concepts/math. Heavy hands-on course. COMP 4332 = COMP 4331+ COMP 4331

  4. A data mining project requires ... • 1. Explore data and data preprocessing. • 2. Trying algorithms, SVM, Logistic Regression, Decision Trees, Dimensionality Reduction, etc... And try varying parameters in each algorithm. • Labor intensive! • Sometimes frustrated. • 3. Summarize findings and design new methods and go back to step 2. Repeatedly go to step 1 to re-processing the data to feed into different tools. The creative part!

  5. 1. Explore data/look at the data • Visualization: • 1D data summary: mean, variance, median, skewness; density estimation(pdf), cdf; outliers, etc. • 2D data summary: scatter plot, QQ-plot, correlation scores, etc. • High-dimensional data summary: dimensionality reduction and plot to 2D or 3D • Store data and extract wanted part. • Organized: SQL like queries... • Quick and dirty: write a script for each operation...

  6. 2. Run experiments using tools • Most of the time, tools are available. • Weka, libsvm, etc.. • Sometimes, you need to implement a variant of existing algorithm. • A different decision tree • A classifier handles unbalanced data • Run the methods and vary parameters and plot results and trends. Good news:) Numerical code is generally hard to write correctly (hard to DEBUG!). You will do this in this course!

  7. 3. Summarize findings and design new methods • After each iteration of step 1 and 2, you know more about the data, you may have new ideas and go back to step 1 and 2. • But before that, first document your findings.

  8. A cloud of tools ... • Data preprocessing: Python, Java/C++, SQL, Excel, text editors.... • Visualization: Excel, Matlab, R, matlibplot • SVM: libsvm, svmlight, liblinear packages • Logistic regression: liblinear • Decision Trees & tree ensemble: Weka, FEST • Matrix factorization: libfm, GraphLab *Bolded tools are that we will teach in the tutorials.

  9. Teaching all of them is impossible! • You have to take time to read the manuals of these tools, and sometimes source code of them! • Through this course, we will use Python to illustrate • Data preprocessing (mostly its string processing) • Algorithm implementation (numpy/scipy) • Automaticly perform experiments • Simple plotting (matlibplot) • Sometimes, we use R’s plotting packages (core, ggplot2) if matlibplot does not fit the requirement.

  10. Why Python • Easy to learn and easy to use. • A good tool for us to illustrate the three steps of doing a data mining project. • A concise and powerful language. • A glue language. Easily integrate components written in other languages. • Widely used in IT industries. Organizations using Python

  11. Setup Python • Install Python 2.6 (not 2.7, not 3.x) and three packages numpy, scipy and matlibplot. • [Recommend] For numpy/scipy in Windows, use http://www.lfd.uci.edu/~gohlke/pythonlibs/ • Default IDE is weak. Recommended IDEs: • PyScripter (only in Windows) • ulipad (cross platform) • Eclipse + pydev (cross platform) • Or simply Notepad++ editor with syntax highlighting (only in Windows)

  12. Learn Python • The official Python tutorial. Written for experienced programmers. • Read it twice and try every code snippet in the tutorial. • Code Like a Pythonista: Idiomatic Python • Python Howto: sort, logging, functional programming, etc. • MIT 6.00 course material. • Liang Huang’s Python Short Course. • numpy examples and scipy tutorial. • Best place to ask a Python-related question: http://stackoverflow.com/. It is better to send your Python question to Stackoverflow rather than to our mailing list.

  13. Play with Python data structures • basic types: bool, integer, float, complex • tuple: (x, y, ..) • list: [x, y, ...] • string: ‘hello’, “world” • dictionary: { x: a, y: b, ... } • set: set([a, b, c, d]) • iteratable/sequence: a unified view for data structures • tuple/list/dictionary/set/string are all iteratable.

  14. DEMO • 1. Go through basic Python data structures and their operations. • 2. Show Python’s functions and control structures (if-then-else/for/while).

  15. A complete example • Convert the sparse matrix format to libsvm’s format.

More Related