120 likes | 275 Views
Mining Large Data at SDSC. Natasha Balac , Ph.D. Geosciences. Data Management and Mining. Modeling and Simulation. A Deluge of Data. Today, data comes from everywhere Scientific instruments Experiments Sensors and sensor nets New devices And is used by everyone Scientists Consumers
E N D
Mining Large Data at SDSC Natasha Balac, Ph.D.
Geosciences Data Managementand Mining Modeling and Simulation A Deluge of Data • Today, data comes from everywhere • Scientific instruments • Experiments • Sensors and sensor nets • New devices • And is used by everyone • Scientists • Consumers • Educators • General public • IT environments must support unprecedented diversity, globalization, integration, scale, and use • Turning the deluge of data into usable information requires an unprecedented level of integration, globalization, scale, and access Life Sciences Preservationand Archiving Astronomy
Why DATA MINING? • Necessity is mother of invention • Huge amounts of data • Electronic records of our decisions • Choices in the supermarket • Financial records • Our comings and goings • We swipe our way through the world – every swipe is a record in a database • Data rich – but information poor • Lying hidden in all this data is information!
What is DATA MINING? • Extracting or “mining” knowledge from large amounts of data • Data-driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data • Fundamental idea: learn rules/patterns/relationships automatically from the data
Terminology • Gold Mining vs. Sand Mining • Knowledge mining from databases • Knowledge extraction • Data/pattern analysis • Knowledge Discovery Databases (KDD) • Predictive Modeling • Machine Learning • Business Intelligence
CRISP-DM (Cross Industry Standard Process for Data Mining) CRISP-DM Process Model
Data Mining Driven Engineering Product Design • Incorporate parallel computing and data mining capabilities into engineering and optimizing product design models • Complex challenges new product design • accurate acquisition/ interpretation of raw customer data • Integrating newly found knowledge in the engineering design process • developing analytical techniques that help reduce the computational time required to generate product portfolios. • Mining paid search on-line customer preference data
A java based Data Driven Product Design (DDPD) • Platform is developed that integrates the supercomputing resources at the SDSC with complex engineering design simulation platforms such as Matlab in an effort to streamline the product design and development process
Tools in the GUI • Data Mining algorithms: Weka, Parallel Weka and Parallel C4.5, Parallel K-means • Data Driven Product Design Platform utilizes Matlab’s powerful computation engine directly from the GUI. Optimization choices available from the user interface include Matlab , Tomlab, Excel Solver, Star-P, Parallel Matlab, Parallel CPLEX, etc.
Visual Representation of Data Mining results linking with serial optimization models