150 likes | 301 Views
Scientific Data Mining in ESP2Net. Silvia Nittel University of California, Los Angeles. Overview. Motivation What is scientific data mining ? Examples of scientific data mining at UCLA CS interests in scientific data mining Tools Collaboration paradigms Interoperability. Motivation.
E N D
Scientific Data Mining in ESP2Net Silvia Nittel University of California, Los Angeles
Overview • Motivation • What is scientific data mining ? • Examples of scientific data mining at UCLA • CS interests in scientific data mining • Tools • Collaboration paradigms • Interoperability GeoSKI 2000 24. Februar 2000
Motivation • The advent of the computer has brought with it the ability togenerate and store huge amounts of data. • Business data (DBs) • Scientific Data • What is it ? The process of extracting useful information has become more formalized and the term Data Mining has been coined for it. GeoSKI 2000 24. Februar 2000
What is data mining ? • Definition: Data mining isthe process of extracting previously unknown, comprehensible, valid and actionable information from large data stores (and using it to make crucial business decisions). • There are two approaches: • verification driven, whose aim is to validate a hypothesis postulated by a user, or • discovery driven, which is the automatic discovery of information by the use of appropriate tools. • The discovery driven approach depends on a more sophisticated and structured search of the data for associations, patterns, rules or functions, and then having the analyst review them for value. GeoSKI 2000 24. Februar 2000
Process GeoSKI 2000 24. Februar 2000
What is scientific data mining ? • Data mining started with “simple info” (business data) like in DBMS; this is called OLAP (online analytical processing). • Scientific data mining: • Data is more complex. • Data is much larger. • Often discovery-oriented approach used. • Medicine, Biology, Physics, Weather… • Principles of a science method: • observation-hypothesis-experiment cycle • Data mining for science: • “observation-hypothesis” supported by discovery driven mining • “hypothesis-experiment” supported by verification driven mining GeoSKI 2000 24. Februar 2000
Example: Farming Environment • Goal: • optimization of crop yield while minimizing the resources supplied. • How: identify what factors affect the crop yield, • One analysis looked at over 64 separate items measured over a number of years to extract the items that were significant. • Initially analysis: discovery driven mining • To attempt to find what parameters were significant, either by themselves or in conjunction with others. • Use of statistical methods to determine the parameters that are significant and their relative influence. • Result: derive equation of interdependence • Later on: verify equation via verification driven mining against new datasets. GeoSKI 2000 24. Februar 2000
Example: Global Climate Change • Often a verification driven mining approach. • Climate data has been collected for many centuries. • It is extended into the more distant past through such activities as analysis of ice core samples from the Antarctic. • At the same time, a number of different predictive models have been proposed for future climatic conditions. • Use predicitive model: • Use sample data from the past • Verify the predictive models by • Using them on historical data then compared the results with the sample data. • From this, the models can then be refined further and used for another round of verification driven mining. GeoSKI 2000 24. Februar 2000
Scientific Data Mining at UCLA • Project scope: • ESP2Net: Earth Science Partners’ Private Network • Computer science: UCLA, HRL, • Earth science: JPL, Scripps, U Arizona • Scientific data mining: • Verification driven approach • Large amounts of raster satellite data GeoSKI 2000 24. Februar 2000
JPL Scripps Statistical operators Correlation operators Tracking operators GLINT operators TOVS, NVAP, MLS Precipitation Scientific Data Mining at UCLA 3 Vigorous convection produces very high cold clouds 4 Storm systems push “moisture flare” Eastward ISCCP DX, CL UA Cluster operators Matching operators VPN 2 Warm moist air rapidly rises 5 Heavy rainfall over Southwest U.S. 1 “Warm pool” develops in tropical Pacific ocean Hypothesis: Coastal rainfall correlated with remote convective events in tropical Pacific GeoSKI 2000 24. Februar 2000
Visualization • Convective cloud cluster motion • ISCCP CL, March 8-21 1993 (UA) • Water vapor motion in the atmosphere • NVAP, March 1-31 1993 (Scripps) • Different perspective reveals new info • NVAP stacking and slicing (JPL) Cloud movie Water vapor movie GeoSKI 2000 24. Februar 2000
Challenges: Distributed collaboration share results (passive) share analysis processes (active) Leverage partners expertise and efforts Re-use core analysis tools (operators) Large datasets, decadal time spans (> ½ TB data) Project goal: Build a flexible and extensible framework for scientific investigations which are Distributed and internet-based, provide reusable, extensible, efficient tools, address interoperability and collaboration Challenges of Scientific Data Mining GeoSKI 2000 24. Februar 2000
UCLA Support of Scientific Data Mining • Re-usable Tools: • Conquest (CONcurrent Queries in Space and Time) • Collaboration Support: • Scientific Markup Language (SEML): XML-based Scientific Experiment Logbook • Conquest (Distributed Queries) • Secure Collaboration (Virtual Private Networks) • Interoperability • OpenGIS standard to represent data • CORBA • Java GeoSKI 2000 24. Februar 2000
Summary • Scientific data mining is a relatively new research area (first conference in 1994, KKD) Science (hypothesis) Statistics (methods) Computer Science (visualization, animation) GeoSKI 2000 24. Februar 2000