1 / 84

Aplikasi Data Mining

Aplikasi Data Mining . Seminar Data Mining Business Trouble and Industrial Applications Lab Data Mining, Teknik Industri Universitas Islam Indonesia 10 Mei, 2008. Isi. Pendahuluan Data Association rules Klasifikasi Clustering Aplikasi data mining Commercial tools Kesimpulan.

jewell
Download Presentation

Aplikasi Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aplikasi Data Mining Seminar Data Mining Business Trouble and Industrial Applications Lab Data Mining, TeknikIndustri Universitas Islam Indonesia 10 Mei, 2008 Budi Santosa

  2. Isi Pendahuluan Data Association rules Klasifikasi Clustering Aplikasi data mining Commercial tools Kesimpulan Budi Santosa

  3. Pendahuluan Apa data mining? Mengapa kita perlu untuk ‘mine’ data? Jenis data seperti apa yang bisa kita ‘mine’? Budi Santosa

  4. Pengertian data mining • Data mining adalah gabungan metode-metode analisis data secara statistik dan algoritma-algoritma untuk memproses data berukuran besar. Data mining merupakan proses menemukan informasi atau pola yang penting dalam basis data berukuran besar. • Bagian dari proses Knowledge Discovery in Data (KDD). • Explorasi dan analisis large quantities of data • Dengan tools secara automatic or semi-automatic • Menemukan meaningful patterns dan rules. Patterns ini memungkinkan suatu company untuk • better understand its customers • improve its marketing, sales, and customer support operations Budi Santosa

  5. Budi Santosa

  6. Mengapa data mining? Pertumbuhan yang explosive dalam data collection • Penyimpanan data dalam data warehouses • Ketersediaan akses data yang semakin meningkat dari Web dan intranet  Kita perlu menemukan cara yang lebih efektif untuk menggunakan data ini dalam proses decision support dari sekedar menggunakan traditional querry languages Budi Santosa

  7. Jenis data apa? Structure - 3D Anatomy Function – 1D Signal Metadata – Annotation • Data warehouses • Transactional databases • Advanceddatabasesystems • Spacial and Temporal • Time-series • Multimedia, text • WWW • … Budi Santosa

  8. Working with data Kebanyakan algoritma data mining cocok hanya untuk data numerik Semua data seharusnya direpresentasikan sebagai bilangan/data numerik sehingga algoritma bisa diterapkan Data sales, crime rates, text, atau images, kita harus menemukan cara yang tepat untuk mentransform data menjadi bilangan/number. Budi Santosa

  9. Knowledge Discovery dan Data Mining • Non-trivial extraction of implicit, unknown, and potentially useful information from databases. • Proses Knowledge discovery terdiri dari fase: Budi Santosa

  10. Tugas (task) dari Data Mining • Prediksi: Bagaimanaperilakuatributtertentudalam data dimasadatang? (predictive) • Time series • Pattern Sequence • Independent-dependent relation • Klasifikasi: mengelompokkan data kedalamkategoriberdasarkansampel yang ada (label diskrit) • Feature selection • Clustering: mengklasterkanobyektanpaadasampelsebagaicontoh(descriptive) • Association: object association Budi Santosa

  11. Association Rules • Tujuan • Memberikan aturan yang berkaitan dengan kehadiran set item dengan set item yang lain • Contoh: Budi Santosa

  12. Association Rules • Market-basket model • Mencarikombinasibeberapaproduk • Letakkan SHOES dekatdengan SOCK sehinggajikaseorang customer membelisatudiaakanmembeli yang lain • Transaksi: seseorangmembelibeberapa items dalamitemsetdisupermarket Budi Santosa

  13. Klasifikasi married Yes no salary Acct balance >5k <20k <5k >=20k <50k >=50 age Poor risk <25 Poor risk >=25 Good risk Fair risk Fair risk Good risk Budi Santosa

  14. Class attribute E(Married)=0.92 Gain(Married)=0.08 E(Salary)=0.33 Gain(Salary)=0.67 E(A.balance)=0.82 Gain(A.balance)=0.18 Expected information E(Age)=0.81 Salary Gain(Age)=0.19 >=50k <20k 20k..50k I(3,3)=1 age Class is “yes” {1,2} Class is “no” {4,5} Entropy <25 >=25 Class is “no” {3} Class is “yes” {6} Information gain Budi Santosa Gain(A) = I-E(A)

  15. Klasifikasi categorical categorical continuous class Test Set Model Learn Classifier Training Set Budi Santosa

  16. Text Classification Test Set Model text class Learn Classifier Training Set Budi Santosa

  17. Klastering • Klastering adalah proses mengelompokkan obyek-obyek yang mirip ke dalam satu klaster. • Obyek bisa berasal dari data base customer, produk, gen, mahasiswa, dsb. Budi Santosa

  18. Klastering • Berapa Konsep • Salah satu hal yang sangat penting adalah penggunaan ukuran kemiripan (similarity) • Jika datanya numerik, fungsi kemiripan ( similarity function) berdasarkan jarak sering digunakan • Euclidean metric (Euclidean distance), Minkowsky metric, Manhattan metric. • Korelasi, cosinus, kovariance • Hiraki, Kmeans, Fuzzy, SOM, Support Vector Clustering Budi Santosa

  19. Klaster Budi Santosa

  20. Aplikasi data mining • Cuaca • Bisnis • Mikrobiologi • Market analysis • Manufacturing and production • Fraud detection dan detection of unusual patterns (outliers) • Telecommunication • Financial transactions Budi Santosa

  21. Aplikasi data mining • Text mining (news group, email, documents) and Web mining • DNA and bio-data analysis • Diseases outcome • Effectiveness of treatments • Identify new drugs Budi Santosa

  22. Cuaca Elevation Chandler 54 km 180 km

  23. North Azimuth angle Chandler 54 km WSR-88D records digital database containing 3 variables: velocity (V), reflectivity (Z), and spectrum width (W).

  24. MDA Algorithm UntukDeteksi Tornado • The current Mesocyclone Detection Algorithm (MDA) was created at the National Severe Storms Laboratory (NSSL) , Oklahoma, to work with native variables derived from the WSR-88D • In order to detect circulations associated with vortices that spin up into tornadoes, the velocity data are exploited • The data are measured for circulation depth, height above the ground, strength of the circulation, shear (change in wind speed or direction with distance), etc. • By relaxing previous threshold values, the MDA is capable of detecting weaker circulations that may eventually spin up into mesocyclones (thereby enhancing the probability of detection)

  25. MDA Attributes 1. base (m) [0-12000] 2. depth (m) [0-13000] 3. strength rank [0-25] 4. low-level diameter (m) [0-15000] 5. maximum diameter (m) [0-15000] 6. height of maximum diameter (m) [0-12000] 7. low-level rotational velocity (m/s) [0-65] 8. maximum rotational velocity (m/s) [0-65] 9. height of maximum rotational velocity (m) [0-12000] 10. low-level shear (m/s/km) [0-175] 11. maximum shear (m/s/km) [0-175] 12. height of maximum shear (m) [0-12000] 13. low-level gate-to-gate velocity difference (m/s) [0-130] 14. maximum gate-to-gate velocity difference (m/s) [0-130] 15. height of maximum gate-to-gate velocity difference (m) [0-12000] 16. core base (m) [0-12000] 17. core depth (m) [0-9000] 18. age (min) [0-200] 19. strength index (MSI) wghtd by avg density of integrated layer [0-13000] 20. strength index (MSIr) "rank" [0-25] 21. relative depth (%) [0-100] 22. low-level convergence (m/s) [0-70] 23. mid-level convergence (m/s) [0-70]

  26. Medis • Bisa kah saya menggunakan contact lenses? • Possible output: none, soft, hard. • Decision berdasar pada: • - age • - spectacle prescription • - astigmatism • - tear production rate Budi Santosa

  27. contoh Budi Santosa

  28. Prosedurpengklasifikasian A set of “if-then” rules A decision tree A Neural Network SVM, LSVM, LS-SVM LDA KNN Minimax Prob Machine Analytic Center Machine Relevance Vector Machine Budi Santosa

  29. Prosedur if -then If umur = muda and astigmatic = tidak dan tear production rate = normal then rekomendasi = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then rekomendasi = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then rekomendasi = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then rekomendasi = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then rekomendasi = hard If age = young and astigmatic = yes and tear production rate = Normal then rekomendasi = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then rekomendasi = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then rekomendasi = none Budi Santosa

  30. Decision tree Budi Santosa

  31. Regression • Regression is similar to classification • First, construct a model • Second, use model to predict unknown value • Methods • Linear and multiple regression • Non-linear regression, Neural network, SVR • Regression is different from classification • Classification refers to predict categorical class label • Regression models continuous-valued functions

  32. Bisnis Contoh: pemakai Credit card bisadiklasterkanmenurut Berapaseringmenggunakankartu: • frequent/seldom usage • domestic/foreign transactions • high/low amounts of money • transactions of specific type • … Untuksetiapklaster, sistem fraud detection bisadikembangkan. Atausejumlahproduk yang lain yang bisaditawarkan Budi Santosa

  33. Credit • Attribute 1: (qualitative) Status of existing checking account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM /salary assignments for at least 1 year A14 : no checking accountAttribute 2: (numerical) Duration in monthAttribute 3: (qualitative) Credit history A30 : no credits taken/all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/other credits existing (not at this bank) Budi Santosa

  34. Attribute 4: (qualitative)Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others Budi Santosa

  35. Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for freeAttribute 16: (numerical) Number of existing credits at this bankAttribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer Budi Santosa

  36. Cross Selling • Cross selling salah satu aplikasi data mining penting yang lain • Apa yang merupakan best additional or best next offer (BNO) untuk setiap customer? • Misal, sebuah bank ingin bisa menjual automobile insurance ketika seorang customer mendapatkan car loan • Bank tersebut mungkin memutuskan untuk mendapatkan a full-service insurance agency Budi Santosa

  37. Paying Claims • A major manufacturer of diesel engines must also service engines under warranty • Warranty claims come in from all around the world • Data mining is used to determine rules for routing claims • some are automatically approved • others require further research • Result: The manufacturer saves millions of dollars • Data mining also enables insurance companies and the Fed. Government to save millions of dollars by not paying fraudulent medical insurance claims Budi Santosa

  38. Finding Prospects • A cellular phone company wanted to introduce a new service • They wanted to know which customers were the most likely prospects • Data mining identified “sphere of influence” as a key indicator of likely prospects • Sphere of influence is the number of different telephone numbers that someone calls Budi Santosa

  39. Clustering is an undirected data mining technique that finds groups of similar items Based on previous purchase patterns, customers are placed into groups Customers in each group are assumed to have an affinity for the same types of products New product recommendations can be generated automatically based on new purchases made by the group This is sometimes called collaborative filtering AntisipasiCustomer Needs Budi Santosa 39

  40. Microbiology Budi Santosa

  41. Microarray Problem Biology Application Domain validasi Data Analysis Microarray Experiment Image Analysis Data Mining Experiment Design and Hypothesis Data Warehouse Knowledge discovery in databases (KDD) Budi Santosa

  42. Data Mining UntukManufaktur • Enterprise Resources Planning (ERP) systems generate large volumes of data. • Examples of data sources in manufacturing include: • Schedules. • Production capacity, efficiency, failures, etc. • Manufacturing parameters. • Process quality. • Process plans. Budi Santosa

  43. Generate Data dalam ERP System Budi Santosa

  44. Budi Santosa

  45. Methodologi for the Selection of Manufacturing Processes with Data Mining Thelearning stage focuses on discovering knowledge from manufacturing processes: Step 1: Similar parts and processes are grouped into clusters. Step 2: Relevant processes are associated with each cluster. Theexploitation stage takes advantage of the clusters to improve the efficiency of generation of process plans for new parts: Step 3: A new part to be manufactured is matched with a suitable cluster. Step 4: The new part is assigned the relevant process plan. Thespecialization stage adapts the relevant process for the new part: Step 5: The relevant process is adapted to the new part. Step 6: The new process plan data is incorporated into the database. Budi Santosa

  46. Budi Santosa

  47. Data Mining to select supplier Input feature set of a performance measure for suppliers Budi Santosa

  48. PabrikSampoerna • Perencanaandimulaidari forecasting demand • Dari demand forecasting didapatkanpetunjuk: • Apasajabahan yang dibutuhkan? Berapakebutuhan per jenisbahan? • Alokasitenagakerja Apasajavariabel yang diperlukan? harga, nilaipromosi, promosipesaing, usia customer, permintaanmasalalu Hybrid time series forecasting dan causal relation Budi Santosa

  49. Sequential Pattern Analysis • Given a set of sequences, find the complete set of frequent subsequences • Applications of sequential pattern • Customer shopping sequences: • First buy computer, then CD-ROM, and then digital camera, within 3 months. • Weblog click streams • Telephone calling patterns Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

  50. Contoh lain • Direct mailing: siapa yang harus ditawari produk tertentu? • Remote sensing: menentukan water pollution dari spectral images • Forecast beban: prediksi permintaan untuk electric power • Intelligent ATM’s : how much cash will be there tomorrow? • City-planning: Identifying groups of houses according to their house type, value, and geographical location Budi Santosa

More Related