170 likes | 239 Views
Research Bytes 2004. Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute. Need for Data Mining. Data are being gathered and stored extremely fast
E N D
Research Bytes 2004 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute
Need for Data Mining • Data are being gathered and stored extremely fast • Currently, the amount of new data stored in digital computer systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000). • Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996] Raw Data Data Mining Patterns Analytical and Statistical Patterns (rules, decision trees, …) Visual Patterns What is Data Mining?or more generally, Knowledge Discovery in Databases (KDD) Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
data analysis • data mining • analytical • statistical • visual clean data models • data “pre”- • processing • noisy/missing data • dim. reduction data sources • data • management • databases • data warehouses • model/pattern • evaluation • quantitative • qualitative data “good” model • model/patterns • deployment • prediction • decision support new data Data Analysis (KDD)Process
Machine Learning (AI) Contributes (semi-)automatic induction of empirical laws from observations & experimentation Statistics Contributes language, framework, and techniques Pattern Recognition Contributes pattern extraction and pattern matching techniques Databases Contributes efficient data storage, data cleansing, and data access techniques Data Visualization Contributes visual data displays and data exploration High Performance Comp. Contributes techniques to efficiently handling complexity Application Domain Contributes domain knowledge KDD is Interdisciplinarytechniques come from multiple fields
IF A & B THEN IF A & D THEN 0.5 IF a & b & c THEN d & k IF k & a THEN e A B C D A, B -> C 80% C, D -> A 22% 0.75 0.3 What do you want to learn from your data?KDD approaches regression classification clustering Data change/deviation detection summarization dependency/assoc. analysis
Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining • Systems performance Data • Sleep Data • Financial Data • Web Data • Data Mining for Genetic Analysis • Correlating genetic information with diseases • Predicting gene expression patterns • Data Mining for Electronic Commerce • Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks
Analyzing Sleep Data • Purpose: • Associations between sleep patterns and health/pathology • Obtain patterns of different sleep stages (4 sleep+REM +Wake) • DATA SET • Clinical (sequential) • Electro-encephalogram (EEG), • Electro-oculogram (EOG), • Electro-myogram (EMG), • Probe measuring flow of Oxygen in blood etc. Diagnostic (tabular) • Questionnaire responses • Patient’s demographic info. • Patient’s medical history (Source: http://www. blsc.com) • Potential Rules: • Association Rules • (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% • (B) Classification Rules • (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** • => (Race = Caucasian)confidence=70%, support= 8% • *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI, UMassMedical, BC
Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …
Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data • sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: • If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases
Events – Financial DataBasic events: 16 or so financial templates [Little&Rhodes78]difficult pattern matching – alignments and time warping Panic Reversal Head & Shoulders Reversal Rounding Top Reversal Descending Triangle Reversal
Closer Look: WPI WekaTool for mining complex temporal/spatial associations
Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis • discovering correlations between sequence variations and diseases • Gene expression • discovering patterns that cause a gene to be expressed in a particular cell
Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.
Genomic Data Resources Wirth, B. et al. Journal of Human Molecular Genetics
Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell CAGE Gene 1 Gene 3 Gene 2 Seam Cells On Gene 1 Gene 3 Gene 2 Off
Ali Benamara Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird, Jay Farmer, Rebecca Gougian, Ken Monterio, Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch, Ben Lucas, Sarah Towey Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu, Ian Pushee, Frederick Tan. Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano. Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock. Grad. & Undergrad. Students