400 likes | 424 Views
M. D. Metadata Solutions. Introduction to Dimensional Analysis Session 2. 5/20/2005. Dan McCreary President Dan McCreary & Associates dan@danmccreary.com (952) 931-9198. Agenda. General introduction to Data Dictionaries that drive Business Intelligence (BI) concepts and terminology
E N D
M D Metadata Solutions Introduction to Dimensional AnalysisSession 2 5/20/2005 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com (952) 931-9198
Agenda • General introduction to Data Dictionaries that drive Business Intelligence (BI) concepts and terminology • Understand why Data Dictionaries are so critical in accurate BI • Understand how BI looks at the world in different ways • Understand how data warehouse concepts and data dictionary impact analysis and research
What is a Data Warehouse? • Fast Retrieval • Internally Consistent • Slice and Dice Capability • Easy to “Browse” • Complete and Reliable • Data Quality Controls • GI-GO (Garbage-In, Garbage-Out) Source: Ralph Kimball
Factors Driving Business Intelligence • Computer process and store twice as much data per dollar every 18 months (Moore’s Law) • People can make better decisions if they have tools to quickly see only the data they are interested in seeing • People frequently want to analyze data in new ways that was unanticipated by people creating "canned reports" • Tools can be designed to allow non-technical (non-SQL programmers) to generate their own reports • People have an incredible ability to categorize things base on their properties and attributes but if they don't have consistent definitions of these properties they will not be generating consistent results
The BI Iterative Process Access Data Warehouse • The BI process in an on-going iterative process where the structure of the data warehouse changes based on what data is critical to an organizations business objectives. Analysis BI ProjectManagement Publishing, Change, DataGap Analysis, New DataGathered Insights, Conclusions andFindings
BI Evolution • Shorten the time-to-report interval • Allow users to "browse" data sets interactively • Remove programmers with "backlogs" of reports • Users frequently waited days, weeks for months to get a custom report created Increasing Responsiveness Monthly Green Bar Reports BrowseableGraphical Interface
Dimensions of BI Technical SophisticationRequired Low (analysts) Highly Responsive to "What If" Scenarios Degree ofEnd User Control High (programmers) Few Dimensions few parameters, few filters Many Dimensions many variables
Overlapping Terminology Data Mining Statistical Analysis Data Dictionaries Data Modeling Semantics PatternDiscovery Business Intelligence Data Warehousing Transaction Processing (OLTP) Indexing Dimensional Analysis Aggregates Data Storage (RDBMS)
Key Terms Covered in This Class • Properties • Dimension • Aggregation and Levels • Enumerations of Categorical Data • Labeling Categories • Giving precise definitions to Labels • Dimension Hierarchies and Levels • Cubes • Measures • Filters • Data Warehouse Presentation
Things Have Many "Properties" People are very good at recognizing and sorting things by their properties.
Sorting by an Object's Property • Sort objects by their color
Sorting by A Property • Sort objects by their shape
Sorting by Color AND Shape Shape “Dimension” Color “Dimension”
Dimensional Analysis • The science of figuring out intuitive ways that people want to categorize information using independent variables to graphically filter and browse their data
Dimension • List of categories used to partition the information based on a property of the objects • Dimension Names: Color, Shape
Labels • A name given to a non-overlapping category within a dimensions Labels "red" "blue" "green"
Enumeration • Whenever we decide to break the continuous observable world into a predefined list of categories when each category has a label we call this an "enumerated value domain". These will then become the "dimensions" of our cube. "green" "red" "blue" Note: NO OVERLAP! Statisticians call this type of "categorical data" and it requires the categories to be non-overlapping.
The Challenge of Semantic Classification • People are good at sorting based on a property they see • People are good at assigning names to a property type • People usually come up with different names for properties • Some dimensions people easily agree on • Some are very difficult to classifyand even more difficult go get peopleto agree on a non-overlapping classification system "Red Circle" "Polygon" "Square" "Blue" "Green" "Blue-Green" What happens with a small percentage of data does not quite fit into a discrete category?
Level • A layer of "aggregation" within a single dimension – categorization of properties All Shapes Shapes With Curves Shapes Without Curves Levels Circle Heart Moon Square Trapezoid Star Diamond
Measures (example weight) 5.7 7.4 3.5 6.6 10 1.1 8.2 6.1 5.5 3.8 9.1 9.3 2.6 8.4 A measure is any property that you can perform math on (sums, averages).
Measures • Something that you can do math on. - X + % / sum average
Sample Object "Fact Table" Measures tend to have data types of integers and floating point numbers. Note that categorical data can not beadded together. But we can count thefrequencies of items with a category!
Shape Dimension Note that there is no reference to "Has Curves" in the prior table. "HasCurves" is a property of the shape value domain because it can be "inferred"from the shape of the object. Some categorical definitions use "exclusionary" language.Note that "Has Curves" also must have a precise definition in the data dictionary.
Facts and Dimension Shape Facts Color Dim Color_FK Color Name Shape_FK Shape Dim Weight Has Curves Shape Name Note that "Has curves" does not need to be in the central fact table.It is a property of the shape!
Adding Dimensions 5.7 7.4 3.5 6.6 10 1.1 8.2 6.1 5.5 3.8 9.1 9.3 2.6 8.4 We have now added a 3rd dimension – "Dash Style"
Each New Property is Another Dimension Shape Facts Color Dim Color_FK Shape Dim Color Code Shape_FK Has Curves DashStyle_FK Shape Code Weight DashStype Dim Shape
Filters 5.7 7.4 3.5 6.6 10 1.1 8.2 6.1 5.5 3.8 9.1 9.3 2.6 8.4 A filter will exclude all objects with a specified property. For example we can exclude all shapes with a property of "Circle"
Example: Discarding Invalid Scores This example filter removes all scores EXCEPT the valid scores.
Selecting Only Scale Scores This filter removes all scores EXCEPT the assessments Scale Scoreusing the Test Score Type dimension.
Dim1 Dim2 Dim4 Dim3 Dim5 PK PK PK PK PK Cat1 Cat1 Cat1 Cat1 Cat1 Facts Primary Key Cat2 Cat2 Cat2 Cat2 Cat2 Foreign Key Cat3 Cat3 Cat3 Cat3 Cat3 Foreign Key Foreign Key Foreign Key Foreign Key Measure1 Measure1 The Star Schema
Adding Measures Shape Facts Color Dim Color_FK ColorCode Shape_FK Shape Dim DashStyle_FK ShapeCode WeightValue DashStype Dim Measures are Integers or floats that you can perform math on. HeightValue ShapeCode PriceAmount DensityValue Measures can be easily be added to the fact table without changing any of the dimensions.
5.7 7.4 3.5 6.6 10 1.1 8.2 6.1 5.5 3.8 9.1 9.3 2.6 8.4 Cube • A Cube is a pre-built structure that has facts and many dimensions (not necessarily just three) • Designed to have averages and sums for most levels "pre-calculated" to make analysis fast Color Dimension Dash-Style Dimension Shape Dimension
5.7 7.4 3.5 6.6 10 1.1 8.2 6.1 5.5 3.8 9.1 9.3 2.6 8.4 Build a Mental Model (aka "Page Fields") Filter Funnel Measure = count Vertical Dimension (rows) Presentation Horizontal Dimension (columns)
Using Cubes in Excel Filter Dropped Here Measures Dropped Here Row and Column Dropped Here
Count of Year vs. Assessment Name The measure is the count of records in the cube. The Column Dimension is the "Fiscal Year" There are around 25 million test results The Row Dimension is the "Test Name".
Conformed Dimensions • When building many cubes, there is a large benefit to "reusing" dimensions • Commonly reused dimensions • Time (Fiscal Year, Quarter) • Organization (School, District) • Expense Category • Student
Each bar represents the sum of all the expendituresin the category (Expenditures On girls athletics for the fiscalyear 1991)
Sample of National Conformed Dimensions School Incident Data School Technology Assessment Organization Financial Student Teacher Claims Process Date Student Attendance Student Assessment District Financial Reporting School and District Status Teacher Licensing School Food and Nutrition Student Disciplinary Reporting Student Safety Reporting District Technology Planning
Role of Data Architecture • Facilitate how business users want to identify and categorize data • Assist in the creation and documentation of categorical value domains and measures • Creation of machine-readable data dictionaries for use in building data warehouse structures
Summary • We found a way for non-SQL programmers to analyze complex data by looking at one dimension at a time • Users don't have to memorize "codes" • Users do need to understand how continuous data is mapped into categories and what the labels on these categories mean