ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

ISQS 6339, Business IntelligenceSupplemental Notes on the Term Project Zhangxi Lin Texas Tech University

Term project • 3-6 students form a team to fulfill a data mart development project. • Stage 1 (10%): SQL Server Project proposal. March 4 • Stage 2 (25%): Data mart Implementation. March 27 • Stage 3 (10%): Hadoop Project proposal. Due April 15 • Stage 4 (25%): Hadoop Project completed. Due April 29 • Stage 5 (10%): Presentation. May 1 & 6 • Stage 6 (30%): Final report. Due May 13 • Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects14.html

Merits of the project outcomes • Carefully developed project proposal demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives • Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes • Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc. • Advanced OLAP cube design, and/or optional MDX scripting by self-taught • Rich data analysis outcomes • Well-presented final report • Demonstrating the creative ideas and skillful data warehousing ability

Project Datasets ISQS 6339 2014-01

Qiyi (奇易网) Data of Online Purchases • Website: http://www.6695.com • Data format: MySQL • Data size: about 3GB • Data structure: 11 tables • This data is to be installed to MySQL. It can be applied for two purposes • Data warehousing with Hadoop/HBase, MySQL, or SQL Server 2008 • Data mining for credit assessment or purchase preference (ISQS 6347) • Note: some contents are in Chinese

Vehicle locations provided by Beijing 1039 traffic radio Website: http://www.fm1039.com/index/index.htm Data format: MongoDB Data Size: about 3BG for traffic status in a week Data structure: (x, y) of vehicles’ locations This data is to be installed on MongoDB This data can be installed either at Hadoop or SQL Serve 2008 for traffic condition analyses

Lending Club Data Website: https://www.lendingclub.com/ This is a famous P2P lending company to go IPO in 2014. Google bought its 7% share in 2013 in $125 million. Data size: 5 compressed tables, total about 80 MB

Propsper.com dataset Website: http://www.prosper.com Propsper.com is another famous P2P lending company with a differentiated business model from that used by Lending Club. Data size is to be determined.

Helsinki cell phone user data 15 users’ activity data, 20-22 tables, about 30 GB

Hadoop projects

Components • Load Balancer • Oozie • Solr, SolrCloud, SolrJ, HA • NewSQL • Kafka, Storm, Impala • REST • ZK • MySQL • Nginx/HA-Proxy • Flume • Sqoop • Ganglia • Technology stack • Tomcat, Jetty • Avro

Data Warehousing Methodology - Implementing data warehouse systematically 13

Dimensional Modeling Process • Preparation • Identify roles and participants • Understanding the data architecture strategy • Setting up the modeling environment • Establishing naming conventions • Data profiling and research • Data profiling and source system exploration • Interacting with source system experts • Identifying core business users • Studying existing reporting systems • Building Dimensional models • High-level dimensional model design • Identifying dimension and fact attributes • Developing the detailed dimensional model • Testing the model • Reviewing and validating the model

Business Dimensional Lifecycle Business Req’ts definition Technical Arch. Design Product Selection & Installation Growth Dimensional Modeling Physical Design ETL design & Development Deployment Project Planning BI Appl. Specification BI Application Development Maintenance Project Management 15

Data Profiling • Data profiling is a methodology for learning about he characteristics of the data • It is a hierarchical process that attempt to build an assessment of the metadata associated with a collection of data sets. • Three levels • Bottom – characterizing the values associated with individual attributes • Middle – the assessment looking at relationships between multiple columns within a single table. • Highest level – the profile describing relationships that exist between data attributes across different tables. • Can run a program against the sandbox source system to obtain the needed information. 16

ETL Methodology • Develop a high-level map • Build a sandbox source system (optional) • Detailed data profiling • Make decisions • The source-to-target mapping • How often loading tables • The strategy for partitioning the relational and Analysis Services fact table • The strategy for extracting data from each source system • De-duplicate key data from each source system (optional) • Develop a strategy for distributing dimension tables across multiple database servers (optional) 17

Sandbox Source System • Sandbox • A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system. • A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play. • In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks • How to create • Set up a static snapshot of the database • By sampling 18

Decision Issues in ETL System Design Source-to-target mapping Load frequency How much history is needed 19

Strategies for Extracting Data • Extracting data from packaged source systems –self-contained data sources • May not be good to use their APIs • May not be good to use their add-on analytic system • Extracting directly from the source databases • Strategies vary depending on the nature of the source database • Extracting data from incremental loads • How the source database records the changes of the rows • Extracting historical data 20

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project