470 likes | 652 Views
Agenda. BackgroundData ManagementData CollectionData Cleaning, Preparation
E N D
1. MD 240Data Management: Warehousing, Analyzing, Mining and Visualization
2. Agenda Background
Data Management
Data Collection
Data Cleaning, Preparation & Warehousing
Data Analysis
Visual Methods for Discovery & Presentation
Marketing Transaction Databases
3. Background Until recently, it was difficult for analysts and managers to perform analyses related to their business activities
With the spread of PCs and networked devices …
it has become easier than ever to collect data about activities in an organization
it has become more feasible to transform analysis from a task of the statistician in the back office to salespeople, managers, and analysts closer to the front office
4. Background Difficulties with data analysis for business intelligence
Data amount increasing exponentially
Multiple sources of data … increasing all the time
Only a small portion of the total data collected are usually useful for making a decision
Increasing need for external data
Differing legal requirements about data collection in different countries
Selection of data management tool from the many available tools
Data security, quality, integrity, etc.
5. Data Management
6. Data ManagementData Management Process Data Life Cycle Process
Data collection
Data stored in databases
Pre-process databases
Clean out junk
Get data close to what decision-makers need
Transformation of data
Make it ready for analysis
Store in data warehouse
Use data mining tools to discover patterns
Create knowledge
Presentation of results
7. Data ManagementData Management Process
8. Data ManagementData Management Process
9. Data Collection
10. Step #1: Data CollectionData Sources
11. Step #1: Data CollectionData Strategy Fundamental philosophy guiding data collection
GIGO: “garbage in, garbage out”
12. Step #1: Data CollectionData Sources Internal data
data/info. about organizational activities
Personal data
data/info. documenting employees’ activities
External data
government, competitors, suppliers
The Internet
“screen scraping” data out of the browser
Commercial database services
Online databases
13. Step #1: Data CollectionData Capture and Input Past
Type in by hand
time consuming
costly
many typing errors
Now
Objective is to automate
save paper storage costs of leasing warehouses
faster access to documents and information in documents
Document Management Systems
scanners for digitizing archived paper documents
databases for archiving, search, retrieval
14. Step #1: Data CollectionData Quality (DQ) Intrinsic DQ:
Accuracy, objectivity, believability, and reputation
Accessibility DQ:
Accessibility and access security
Contextual DQ:
Relevance, value added, timeliness, completeness
Representation DQ:
Interpretability, ease of understanding, concise representation, and consistent representation
15. Data Cleaning, Preparation & Warehousing
16. Steps #2-#5: Data WarehousingTransactional Processing Store data in databases
Objectives of TPS
Standardized transactions
Simple computations
non-complex
not very mathematical or statistically oriented
High volume
Low cost
17. Steps #2-#5: Data WarehousingTransaction vs. Analytical Processing Task objectives for a useful analytical data delivery system
Easy data access by end users
Quicker decision making
Accurate and effective decision making
Flexible decision making
18. Steps #2-#5: Data Warehousing Transaction vs. Analytical Processing Characteristics of a useful analytical data delivery system
Business representation of data for end users
Client-server or Web-based environment that provides end users with query and reporting capability
Server-based repository (data warehouse)
19. Steps #2-#5: Data Warehousing Data Warehouse and Data Marts Data Warehouse
establishes a data repository, that ...
makes operational data accessible in a form readily acceptable for analytical processing activities
“Metadata”:
data summaries for faster indexing and searching within data warehouse
data summaries
information on how the data have been organized
Data Mart
dedicated to a functional area, or ...
dedicated to a regional area
20. Steps #2-#5: Data Warehousing Data Warehouse and Data Marts
21. Steps #2-#5: Data Warehousing Characteristics of Data Warehousing Desirable Characteristics for a Data Warehouse
Organization
organized by subject; extraneous items removed
Consistency
identical measurement and representation of same data
Time variant
varies over time; “time-series” data
Nonvolatile
data are not updated once entered
Relational
table-based structure (RDBMS)
22. Steps #2-#5: Data Warehousing Characteristics of Data Warehousing Data Warehousing is most suitable for organizations in which …
End users need to access large amounts of data
Operational data are stored in several different systems
Different systems represent the same data in different formats
Management relies on information for decision making
There is a large, diverse customer base
Extensive end-user computing is performed
23. Data Analysis
24. Step #6: Data AnalysisKnowledge Discovery in Databases (KDD) Foundations of KDD
Massive data collection
Powerful multiprocessor computers
“Intelligent” data mining algorithms
Analyst/manager activities
Ad-Hoc Queries
OLAP Queries
Data Mining
25. Step #6: Data AnalysisAd Hoc Queries Ad Hoc Queries
Let users access, navigate, and explore data in real time to make business decisions
Ad hoc query tool requirements
Query creation is easy
Customized query creation
Easy to use interfaces for performing queries
Many data sources are supported
Seamless integration between analysis and reporting
26. Step #6: Data AnalysisOLAP Queries OLAP
An approach by which important queries and calculations are turned into online tools that managers can use over and over again
Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies
MOLAP … multidimensional OLAP
ROLAP … OLAP using relational databases
WOLAP … web-based OLAP
27. Step #6: Data AnalysisOLAP Queries Capabilities of Online Analytical Processing (OLAP)
Access very large amounts of data
Analyze the relationships between many types of business elements
Involve aggregated data
Compare aggregated data over hierarchical time periods
Present data in different perspectives
Involve complex calculations between data elements
Able to respond quickly to user requests
28. Step #6: Data AnalysisOLAP Queries OLAP Advantages
Adapt existing decision making tools to the WWW, integrate them with distributed data stores
facilitates “drill-down”
OLAP Shortcomings
Retrospective in nature
More of a reporting-oriented tool
A discovery-oriented tool for flexible data analysis of data already known to have importance
Less of a prediction-oriented tool
29. Step #6: Data AnalysisData Mining Objectives of Data Mining
Automate discovery of previously unknown patterns
Automate prediction of
trends
behaviors
events
30. Step #6: Data AnalysisData Mining Nature and Characteristics
Data often buried deep within large databases
“Data wants to be Free!”
Data may be consolidated in data warehouse or kept in internet and intranet servers
Usually client-server architecture
31. Step #6: Data AnalysisData Mining Nature and Characteristics (cont’d)
Data mining tools extract information buried in corporate files or archived public records
The “miner” is often an end user
“Striking it rich” usually involves finding unexpected, valuable results
Parallel processing computers often needed to make this analysis fast enough to be useful to manager
32. Step #6: Data AnalysisData Mining Common types of data mining
Mining of numerical data
Text mining … group documents or identify themes or information within documents
Documents
Web pages
Web site clickstream/event mining
33. Step #6: Data AnalysisData Mining Data Mining yields five types of information
Association
e.g., correlation = 0.5; slope between X and Y = 0.73
Sequences
e.g., biggest, second biggest, etc.
Classifications
e.g., There are 3 types of competitors, use data mining to classify Firm X as a “Type 1” competitor
Clusters
e.g., We don’t know how many types of customers there are … let’s try to discover if we can identify some similar customer groups
Forecasting
34. Step #6: Data Analysis Data Mining Techniques/Tools Computer Science
Case-based reasoning
Neural computing
Intelligent agents
Others: decision trees, genetic algorithms, nearest neighbor method, and rule reduction
Statistics
Cluster analysis
Most standard statistical tools (SAS, SPSS)
Optimization
35. Step #6: Data Analysis Data Mining Techniques/Tools
36. Step #6: Data Analysis Data Mining Vendors Vendors
SAS Enterprise Miner
SPSS Business Intelligence
Insightful (www.insightful.com)
Microsoft Research
IBM
Blue Martini
Amdocs
DBMiner (www.dbminer.com)
PrudSys (www.prudsys.de)
Boston Area … Torrent (www.torrent.com), ThinkAnalytics (www.thinkanalytics.com)
Learning Resources
Association of Computing Machinery (ACM) SIGKDD
KDD2002 conference (July 2002)
37. Visual Methods for Discovery & Presentation
38. Steps #6&7: Data Visualization Multidimensionality Multidimensionality
“real-world” data typically have more than 2 or 3 dimensions
managerial analyses may require presentation of up to 7 or 8 dimensions to fully communicate discoveries
Three factors
dimensions
measures
time
Solution:
technology that is flexible enough so that data can be organized the way managers prefer to see the data
40. Steps #6&7: Data VisualizationPresenting Multidimensional Data Data visualization involves presentation of data by digital technology
graphical user interfaces
digital images
geographical information systems
multidimensional tables and graphs
virtual reality
three-dimensional presentations
animation
41. Steps #6&7: Data Visualization Presenting Multidimensional Data Low Tech Solutions … for a few dimensions
Multidimensional Tables
reduce many dimensions down to 2D table format
“Slicing and Dicing”
Data “rotation”
ability to easily switch the 3 variables being analyzed and rotate 3D graphs on a computer screen
High Tech Solutions … for many dimensions
See Edward Tufte’s books
The Visual Display of Quantitative Information
Envisioning Information
Visual Explanations
42. Steps #6&7: Data Visualization Geographical Information Systems (GIS) GIS
A computer-based system for capturing, storing, checking, integrating, manipulating, and displaying data using digitized maps.
Plot data or present data analysis findings by …
latitude and longitude
cities, major metropolitan areas
counties
states
nations
43. Steps #6&7: Data Visualization Geographical Information Systems (GIS) Emerging GIS Applications
Sophisticated user interfaces
Multimedia, 3D graphics, animated and interactive maps
Integration of GIS and GPS
Reengineer aviation and shipping industries
Intelligent GIS (integration of GIS and ES)
Hand-held applications
Deploy mapping tools to PDAs and Java-based cell phones
Web applications
ESRI’s ArcData GIS
44. Steps #6&7: Data Visualization Geographical Information Systems (GIS) Vendors
ESRI (www.esri.com)
Arc/Info
ArcData Online (www.esri.com/data/online/index.html)
Resources
www.gis.com
www.gisday.com
www.state.ma.us/mgis/
www.northeastarc.org
45. Steps #6&7: Data Visualization Other Visualization Tools Visual Interactive Modeling
visual modeling of a system
Visual Interactive Simulation
a visual front end to a simulation program
presents animation of system activities and statistical results during a simulation run
Real-time simulation … users can interact with the simulation model (prototyping, training, entertainment, video games)
Virtual Reality
Fake environments that attempt to fool the viewer into perceiving that they are within a 3D world
Usually involves a headset, gloves, and other forms of sensory input/output devices
46. Marketing Transaction Databases
47. Application Area: MarketingMarketing Transaction Database (MTD) … a new kind of database, oriented toward targeting and personalizing marketing messages in real time.
48. Application Area: MarketingMarketing Transaction Database (MTD) Purpose: targeting and personalization
Structure: liquid - driven by real-time marketing
Updates: real-time
Data level: individual detail
Data type: demographic (descriptive), behavioral, derivative
Advantages: allows real-time analysis and decision-making, CRM
Issues: emerging, no standards, not integrated with other systems