170 likes | 270 Views
KDDRG Research Projects. Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer Science Worcester Polytechnic Institute. Some Current Analytical Data Mining Research Projects at WPI. Mining Complex Data: Set and Sequence Mining Systems performance Data Sleep Data Financial Data
E N D
KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer Science Worcester Polytechnic Institute
Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining • Systems performance Data • Sleep Data • Financial Data • Web Data • Data Mining for Genetic Analysis • Correlating genetic information with diseases • Predicting gene expression patterns • Data Mining for Electronic Commerce • Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks
Analyzing Sleep Data • Purpose: • Associations between sleep patterns and health/pathology • Obtain patterns of different sleep stages (4 sleep+REM +Wake) • DATA SET • Clinical (sequential) • Electro-encephalogram (EEG), • Electro-oculogram (EOG), • Electro-myogram (EMG), • Probe measuring flow of Oxygen in blood etc. Diagnostic (tabular) • Questionnaire responses • Patient’s demographic info. • Patient’s medical history (Source: http://www. blsc.com) • Potential Rules: • Association Rules • (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% • (B) Classification Rules • (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** • => (Race = Caucasian)confidence=70%, support= 8% • *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI, UMassMedical, BC
Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …
Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data • sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: • If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases
Events – Financial DataBasic events: 16 or so financial templates [Little&Rhodes78]difficult pattern matching – alignments and time warping Panic Reversal Head & Shoulders Reversal Rounding Top Reversal Descending Triangle Reversal
WPI WekaTool for mining complex temporal/spatial associations
Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis • discovering correlations between sequence variations and diseases • Gene expression • discovering patterns that cause a gene to be expressed in a particular cell
Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.
Genomic Data Resources Wirth, B. et al. Journal of Human Molecular Genetics
Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell CAGE Gene 1 Gene 3 Gene 2 Seam Cells On Gene 1 Gene 3 Gene 2 Off
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene expression Analysis PR1 PROMOTER(S) CELL TYPES neural neural muscle neural muscle neural neural neural muscle M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
TF 1 TF 3 TF 2 GENE M1 M4 M2 240 100 Gene Expression • Transcription of DNA into RNA TRANSCRIPTIONAL PROTEINS PROMOTER REGION ..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA MOTIFS M1, M2, M4 MUSCLE CELL
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 PR1 PROMOTER(S) neural neural muscle neural muscle neural neural neural muscle M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
Coefficient of variation of distances (cvd) between two motifs: “Well-clustered” motifs M1 240 M4 100 M2 150 M5 M1 260 M4 210 M5 M4 360 M1 M1 100 M2 350 M5 M1 190 M4 IR1={M1,M2,M5} (M1,M2) = 120.1 (M1,M2) = 216.6 cvd(M1,M2) = 0.55 M3 120 M4 150 M5 M5 210 M2 100 M3 110 M1 M2 18 M4 21 M5 M4 60 M3
Distance-based Association Rules Sample distance-based assoc. rule • Given: • min-support • min-confidence • max-cvd thresholds • Mine: • all distance-based association rules
Ali Benamara. Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB), Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB) Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB). Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock. Grad. & Undergrad. Students