470 likes | 619 Views
ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841
E N D
ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;eamonn@cs.ucr.edu
The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
Data Mining Outline • Introduction • Techniques • Classification • Clustering • Association Rules • Examples Explore some interesting data mining applications
Introduction • Data is growing at a phenomenal rate • Users expect more sophisticated information • How? UNCOVER HIDDEN INFORMATION DATA MINING
But it isn’t Magic • You must know what you are looking for • You must know how to look for you Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner
Description Behavior Associations “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.” Classification Clustering Link Analysis (Profiling) (Similarity)
CLASSIFICATION Assign data into predefined groups or classes.
x <90 >=90 x A <80 >=80 x B <70 >=70 x C <50 >=60 D F Classification Ex: Grading
Katydids Given a collection of annotated data. (in this case 5 instancesof Katydidsand five ofGrasshoppers), decide what type of insect the unlabeled example is. Grasshoppers (c) Eamonn Keogh, eamonn@cs.ucr.edu
The classification problem can now be expressed as: • Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh, eamonn@cs.ucr.edu
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Antenna Length Abdomen Length Katydids Grasshoppers (c) Eamonn Keogh, eamonn@cs.ucr.edu
Facial Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu
1 0.5 0 50 100 150 200 250 300 350 400 450 0 Handwriting Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu George Washington Manuscript
Dallas Morning News October 7, 2005
CLUSTERING Partition data into previously undefined groups.
What is Similarity? (c) Eamonn Keogh, eamonn@cs.ucr.edu
Two Types of Clustering Partitional Hierarchical (c) Eamonn Keogh, eamonn@cs.ucr.edu
Hierarchical Clustering ExampleIris Data Set Versicolor Setosa Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .
ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data
ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.
Data Mining Outline • Introduction • Techniques • Examples • Vision Mining • Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) • Bioinformatics
Vision Mining • License Plate Recognition • Red Light Cameras • Toll Booths • http://www.licenseplaterecognition.com/ • Computer Vision • http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/vid/
How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
Jialun Qin, Jennifer J. Xu, DaningHu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.
http://www.time.com/time/magazine/article/0,9171,1541283,00.htmlhttp://www.time.com/time/magazine/article/0,9171,1541283,00.html
DNA http://www.visionlearning.com/library/module_viewer.php?mid=63 Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together
DNA transcription RNA translation Protein Central Dogma: DNA -> RNA -> Protein CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA Amino Acid www.bioalgorithms.info; chapter 6; Gene Prediction
Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY? Almost identical to that of Chimps. What makes the difference? Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
RNAi – Nobel Prize in Medicine 2006 siRNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3
miRNA • Short (20-25nt) sequence of noncoding RNA • Known since 1993 but significance not widely appreciated until 2001 • Impact / Prevent translation of mRNA • Generally reduce protein levels without impacting mRNA levels (animal cells) • Functions • Causes some cancers • Guide embryo development • Regulate cell Differentiation • Associated with HIV • …
C Elegans Homo Sapiens Mus Musculus All Mature ACG CGC GCG UCG TCGR – Mature miRNA(Window=5; Pattern=3)
TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
Affymetrix GeneChip® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
Microarray Data Analysis • Each probe location associated with gene • Measure the amount of mRNA • Color indicates degree of gene expression • Compare different samples (normal/disease) • Track same sample over time • Questions • Which genes are related to this disease? • Which genes behave in a similar manner? • What is the function of a gene? • Clustering • Hierarchical • K-means
Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004
BIG BROTHER ? • Total Information Awareness • http://infowar.net/tia/www.darpa.mil/iao/index.htm • http://www.govtech.net/magazine/story.php?id=45918 • http://en.wikipedia.org/wiki/Information_Awareness_Office • Terror Watch List • http://www.businessweek.com/technology/content/may2005/tc20050511_8047_tc_210.htm • http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ • http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html • CAPPS • http://www.theregister.co.uk/2004/04/26/airport_security_failures/ • http://www.heritage.org/Research/HomelandDefense/BG1683.cfm • http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ • http://en.wikipedia.org/wiki/CAPPS
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236