160 likes | 168 Views
Learn about BigEcomm's tables, access data in SQLite browser, and handle dirty data in SQL. Understand SQL basics, types, permissions, and servers.
E N D
Module 1: Quick Notes on SQL & Getting Started With This Course
Our Data BigEcomm – a fictitious online retailer. As you’ll see in the browser walkthrough, we have 4 tables. These are all joined through customer_id.
Our Tables Customer Contains many features about the customer. How much they paid in commission in August 2018, location, social media usage, ever visited the company blog, etc. Webscrapingdata Page load time, word count, form count, language on their website, has a shopping cart, etc. Salescall Dates that calls with sales were opened and closed, type of call resolution, and call type. Billedservices Currency of the billed service, how much was paid, status, etc. And much more….
SQLite Browser The SQL browser with data already loaded can be found here: https://kristenkehrer.github.io/datamovesme-sqlcourse/ You’ll write all SQL queries for this course in this browser, and it’ll return your results. You can then use the results of those queries to answer the quiz questions throughout the course. Let’s do a walkthrough of the browser……..
Type of Dirty Data Incomplete Data NULLs. Incorrect Data US States shouldn’t include anything other than the 50 states. Page load time shouldn’t include entries that are not numeric. Inconsistent Data If data is redundant (same information in multiple tables) it is possible that someone could have 2 different addresses in the same database. This is a typical use case in healthcare data. Inaccurate Data If you had rates > 1, you would know that the underlying calculation is causing incorrect data. Because rates are between [0,1]. Business Rule Violation Someone can’t covert before they’ve ever visited the website. If a purchase date was before a first website visit, this would be a violation of the business rules. Duplicate Data In our data, the customer table should be unique (one row per customer). We’ll see this isn’t the case. We’ll cover these scenarios in this course….
What is SQL Structured Query Language – This is a language for relational databases. Data is persistent – data sits on a server that is local or remote. The data is in 2 dimensional tables (rows and columns) and they’re logically connected. NoSQL is not a relational database and is typically used for very large datasets sometimes stored in JSON format.
More on SQL SQL is not case sensitive Queries end in semicolons There is much more to learn about SQL than is covered in this course, but we’re looking to make you an effective data scientist. The aim is not to become a DBA (database admin).
Data Types Numeric (int, float) String (varchar) Dates (datetime)
STAR Schema Can make queries for modeling purposes quite long. Picture taken from Wikipedia.org
Database Permissions Typically as a data scientist we are not working with the production tables. You can create new tables, delete tables in a “sandbox”. Set your queries up to a “scheduler” to get your data regularly refreshed in the database or work with IT or BI to productionalize your table.
SQL Servers MySQL Oracle DB2 Postgres Microsoft SQL Server SQLite – what we’re using in this course.
This is the end of Module 1 Congrats! Now let’s have a quick quiz and then dive in to some SQL.