1 / 43

Data Mining with Weka Putting it all together

Data Mining with Weka Putting it all together. Team members: Hao Zhou, Yujia Tian , Pengfei ‎ Geng , Yanqing Zhou, Ya Liu‎, Kexin Liu Director: Prof. Elke A.  Rundensteiner Prof. Ian H. Witten. Outline. 5.1 The data mining process By Hao 5.2 Pitfalls and pratfalls

nishi
Download Presentation

Data Mining with Weka Putting it all together

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining with WekaPutting it all together Team members: HaoZhou, YujiaTian, Pengfei‎ Geng, Yanqing Zhou, Ya Liu‎, Kexin Liu Director: Prof. Elke A. Rundensteiner Prof. Ian H. Witten

  2. Outline • 5.1 The data mining process • By Hao • 5.2 Pitfalls and pratfalls • By Yujia, Pengfei • 5.3 Data mining and ethics • By Yanqing • 5.4 Summary • By Ya, Kexin

  3. 5.1 The data mining process By Hao

  4. 5.1 The data mining process Feel Lucky: - Weka is not everything I need to talk about in my part (Know how rather than why to use Weka) Maybe Not so Lucky: - Talking about Weka is time-consuming. =)

  5. From Weka to real life When we use weka for MOOC, we never care about the dataset, as it has been already collected.

  6. Procedures in real life Why we do data mining in real life? - for course projects (This is my current situation) - for solving real life problem - for fun - for … Now, we have specified our “question[1]”, then what we do is to gather the data[2] we need.

  7. Real life project This summer vacation, I worked as volunteer programmer(no payment) for a start-up, whose objective is to provide article recommendations for developers[1]. In this case, we must keep our database, which will index all the up-to-date articles we gather from the whole Internet(mongoDB). We use many ways to gather articles, and I just focused on one of them – Get articles links from influencers’ tweets through APIs.

  8. Procedures in real life Do all the links I gathered work? - Never, even I wish they did 1. Due to algorithm issue, some links I got are in bad format. 2. Even links are correct, I cannot get articles from all links, as some of them are not links for articles. [3. More problems after getting articles from links] -- We must do some clean up[3], after we gathered our data, to better use it.

  9. Procedures in real life OK, assume that now we have all the [raw] data(articles here) we need. The most important jobs comes – one of them is how to rank articles for different keywords[how to define keywordscollection]. (It is more about mathematics issue than computer science here, and I did not participate in this part) -- Define new features

  10. Procedures in real life After new featuresdefined, the last step is to generate a web app, so that users can enjoy “our” work. Now the last step of this project is still under construction, which means “we” still need more time to “deploy the result”. We will go to section 5.2 now -->

  11. 5.2 Pitfalls and pratfalls By Yujia, Pengfei

  12. 5.2 Pitfalls and pratfalls Pitfall: A hidden or unsuspected danger or difficulty Pratfall: A stupid and humiliating action Tricky parts and how to deal with them

  13. Be skeptical • In data mining, it’s very easy to cheat • whether consciously or unconsciously • For reliable tests, use a completely fresh sample of data that has never been seen before

  14. Overfitting has many faces Don’t test on the training set (of course!) Data that has been used for development (in any way) is tainted Leave some evaluation data aside for the very end Key: always test on completely fresh data.

  15. Missing values • “Missing” means what … •  Unknown? •  Unrecorded? •  Irrelevant? • Missing values • Omit instances where the attribute value is missing? • or • Treat “missing” as a separate possible value? • Is there significance in the fact that a value is missing? • Most learning algorithms deal with missing values • – but they may make different assumptions about them

  16. An Example OneR and J48 deal with missing values in different ways Load weather‐nominal.arff OneRgets 43%, J48 gets 50% (using 10‐fold cross‐validation) Change the outlook value to unknown on the first four no instances OneRgets 93%, J48 still gets 50% Look at OneR’s rules: it uses “?” as a fourth value for outlook

  17. An Example

  18. 5.2Pitfalls and pratfalls Part 2 By Pengfei

  19. No “universal” best algorithm, No free lunch • 2‐class problem with 100 binary attributes • Say you know a million instances, and their classes (training set) • You don’t know the classes of 99.9999…% of the data set • How could you possibly figure them out • Example

  20. No “universal” best algorithm, No free lunch In order to generalize, every learner must embody some knowledge or assumptions beyond the data it’s given Delete less useful attributes Find better filter Data mining is an experimental science

  21. 5.3Data mining and ethics By Yanqing

  22. 5.3 Data mining and ethics Information privacy laws Anonymization The purpose of data mining Correlation causation Source: www.zerohedge.com Source: www.mum.edu Source: www.johnmyleswhite.com Source: www.ediscoveryreadingroom.com

  23. Information privacy laws • In Europe • Purpose; Keep secret; Accurately update; • Provider can review; Deleted asap; • Un-transmittable (if less protection) • No sensitive data (sexual orientation, religion ) • In US • Not highly legislated or regulated • Computer Security, Privacy and Criminal Law • But hard to be anonymous... • Be aware ethical issues and laws • AOL (2006) • 650,000 users (3days in public web) • at least $5,000 for the identifiable person Source: www.livingwithgod.org Source: blog.brainhost.com

  24. Anonymization • It is much harder than you think. • Story: MA release medical records (mid‐1990s) • No name, address, social security number • Re-identification technique • Public records: • City, Birthday, gender: 50% of US can be identify • One more attribute – zipcode: 85% identification • Netflix • Use movie rating system to identify people • 99% 6 movies • 70% 2 movies Source: eofdreams.com www.resteasypestcontrol.com

  25. The purpose of data mining • The purpose of data mining is to discriminate … • who gets the loan • who gets the special offer • Certain kinds of discrimination are unethical, and illegal • racial, sexual, religious, … • But it depends on the context • sexual discrimination is usually illegal • … except for doctors, who are expected to take gender into account • … and information that appears innocuous may not be • ZIP code correlates with race • membership of certain organizations correlates with gender

  26. Correlation and Causation • Correlation does not imply causation • As icecream sales increase, so does the rate of drownings. • Therefore icecream consumption causes drowning??? • Data mining reveals correlation, not causation • but really, we want to predict the effects of our actions Source: www.michaelnielsen.org Source: www.thevisualeverything.com Source: commons.wikimedia.org

  27. 5.3 Summary Privacy of personal information Anonymizationis harder than you think Reidentificationfrom supposedly anonymized data Data mining and discrimination Correlation does not imply causation

  28. 5.4 Summary By Ya, Kexin

  29. 5.4 SUMMARY There’s no magic in data mining– Instead, a huge array of alternative techniques There’s no single universal “best method” – It’s an experimental science!– What works best on your problem?

  30. 5.4 SUMMARY

  31. 5.4 SUMMARY Weka makes it easy – ... maybe too easy? There are many pitfalls– You need to understand what you’re doing!

  32. 5.4 SUMMARY Focus on evaluation ... and significance– Different algorithms differ in performance – but is it significant?

  33. Advanced Datamining with Weka Some missing parts in the lectures Filtered Classifier Cost-sensitive evaluation and classification Attribute selection Clustering Association rules Text classification Weka Experimenter

  34. Filtered Classifier Filter the training data, not testing data Why do we need Filtered Classifier?

  35. Cost-sensitive evaluation and classification Costs of different decisions and different kinds of errors Costs in datamining Misclassification Costs Test Costs Costs of Cases Computation Costs

  36. Attribute Selection Uesful parts of Attribute Selection Select relevant attributes Remove irrelevant attributes Reasons for Attribute Selction Simpler model More Transparent and easier to understand Shorter Training time Knowing which attribute is important

  37. Clustering Cluster the instances according to their attribute values Clustering method: k-means k-means++

  38. Experimenter

  39. Experimenter

  40. Experimenter

  41. Acknowledgement Thanks Prof. Ian H. Witten and his Weka MOOC direction.

  42. Thank you for your kind attention

More Related