340 likes | 619 Views
CSE 482: Big Data Analysis. Lecture 4: SQL and Database Querying. Outline of Today’s Lecture. Previous lecture talks about how to collect online data from the Web Today’s lecture focuses on how to store the data into a database What is a database? What is a database management system?
E N D
CSE 482: Big Data Analysis Lecture 4: SQL and Database Querying
Outline of Today’s Lecture • Previous lecture talks about how to collect online data from the Web • Today’s lecture focuses on how to store the data into a database • What is a database? • What is a database management system? • What is SQL (structured query language)? • Using SQL with python
What is a Database? • Collection of related data central to a given enterprise • Examples: • Banking – savings/checking accounts, mortgage, etc • Vehicle registration – car registration, year, make, etc • Student registration – name, PID, GPA, last semester enrolled, etc • Electronic Medical Records – name, SSN, date of birth, address, symptoms, diseases, medication, test results, etc • Sales – customer purchase transactions, customer complaints, etc
Database Management System (DBMS) • A collection of programs that enables users to create and maintain a database • Examples of DBMS • MS Access, MS SQL Server, IBM DB2, Oracle, Sybase, Postgres, mySQL, and many more • These are relational DBMS (which mean, these systems would model the data as a collection of relations/tables) • We will use mySQL as our DBMS for this class
MySQL Database Account • Every registered student will have access to a MySQL account on mysql-user.cse.msu.edu • To log in, go to: • http://www.cse.msu.edu/facility/phpMyAdmin/index.php • Username is your CSE usernamePassword is your PID (if you’re using it the first time)Server Choice: mysql-user • Send an email to manager@cse.msu.edu if you have problems logging in • Note that your database password is not the same as your CSE account password (unless you set them to be the same)
SQL (Structured Query Language) • A language provided by relational DBMS to • Create a table • Alter the schema (structure) of a table • Insert data records into a table • Delete data records from a table • Query a table • etc
Edit is a relationship between user and article Example: Wikipedia Database
CREATE TABLE CREATE TABLE User(Username VARCHAR(50) PRIMARY KEY,IS_Admin BOOLEAN,Content VARCHAR(500)); CREATE TABLE Article (Title VARCHAR(50) PRIMARY KEY,Content VARCHAR(500) ); A primary key is used to uniquely identify the records in a table
CREATE TABLE CREATE TABLE Edit ( Editor VARCHAR(50), Edited_Title VARCHAR(50), Modified_Time DATETIME, PRIMARY KEY(Editor, Edited_Title, Modified_Time); FOREIGN KEY (Editor) REFERENCES User(Username),FOREIGN KEY (Edited_Title) REFERENCES Article(Title) ); A foreign key is used to enforce the constraint that rows which were referenced in the Edit table exists in the referenced table
Example of Foreign Key Inserting a row (XYZ, MySQL, 2013-01-01 10:10:10) is a violation of foreign key constraint because there is no Article titled MySQL
MySQL Example You can also use phpMyAdmin interface to directly access the database:https://www.cse.msu.edu/phpMyAdmin/ Option to create new table
MySQL Example Enter the columns of the table and their constraints
Summary of useful MySQL commands • set password=password(‘new password’); • show databases; -- show the list of databases available to you • use dbname; -- use the database called dbname • show tables; -- show tables available • describe User; • truncate table User; • drop table User; • source script-file; -- executing SQL commands from a script-file • load data infile /path/file.txt into table User;
DROP TABLE • Delete a table from the database • After deletion, the table can no longer be used in queries, updates, or any other commands since its description no longer exists • Example:DROP TABLE USER;
ALTER TABLE • Add a new column to an existing table: ALTER TABLE USER ADD BIRTHDATE DATETIME; • Remove a column from an existing table: ALTER TABLE USER DROP COLUMN BIRTHDATE; • Modify an existing column in a table: ALTER TABLE USER MODIFY BIRTHDATE INTEGER DEFAULT 1900; • Modify constraints of a table: ALTER TABLE EDIT DROP PRIMARY KEY;
INSERT or DELETE Operation INSERT INTO Article VALUES (‘Database’, ‘$@%@23’); INSERT INTO Article VALUES (‘Michigan’, ‘#525s36’); DELETE FROM Article WHERE Title=‘Database’; DELETE FROM Article WHERE Content=‘#525s35’;
Retrieval Queries in SQL • Basic form of the SQL retrieval queries: SELECT <attribute list> FROM <table list> WHERE <condition> • <attribute list> is a list of column names whose values are to be retrieved by the query • <table list> is a list of table names required to process the query • <condition> is a conditional (Boolean) expression that identifies the rows to be retrieved by the query
SQL Query (for 1 Table) Query: Find the names of users who have administrative privileges SELECT Username FROM User WHERE isAdmin = true;
Query: Show all the records in the User table SQL Query (for 1 Table) A wildcard (*) in the SELECT clause means retrieve all the columns SELECT * FROM User; Query has no WHERE clause (which means all rows will satisfy the condition)
Query: Count the number of edits made by XYZ SQL Query (for 1 Table) Count is an aggregate function; used to aggregate the rows SELECT Count(*) FROM Edit WHERE Editor=‘XYZ’;
Query: Count the number of edits made by each user SQL Query (for 1 Table) SELECT Editor, Count(*) FROM Edit GROUP BY Editor; GROUP BY clause means group the results based on the Editor column before summarizing them
Query: Count the number of distinct articles edited by each user SQL Query (for 1 Table) SELECT Editor, Count(DISTINCT Edited_Title) FROM Edit GROUP BY Editor;
Query: Find the top 2 editors who made the most edits SQL Query (for 1 Table) SELECT Editor, Count(*) FROM Edit GROUP BY Editor ORDER BY Count(*) DESC LIMIT 2;
SQL Query for 2 Tables • Query: Find the titles of Wikipedia articles edited by administrators SELECT Edited_Title FROM User, Edit WHERE IsAdmin = True; SQL is wrong. Missing the join condition.
SQL Query for 2 Tables • Query: Find the titles of Wikipedia articles edited by administrators SELECT Edited_Title FROM User, Edit WHERE Editor = Username AND IsAdmin = True; Join condition
SQL Query for 2 Tables • Query: For each non-administrator, count the number of distinct articles they have edited SELECT Editor, COUNT(DISTINCT Edited_Title) FROM User, Edit WHERE Editor = Username AND IsAdmin = False GROUP BY Editor;
Using MySQL with Python Install mysql-connector-python library conda install mysql-connector-python Steps Import mysql-connector module Create database connection cnx = mysql.connector.connect(username, password, host, database) Create a cursor cursor = cnx.cursor() Pass query to cursor for execution cursor.execute(query) If it is a database update query: cnx.commit() Close cursor and database connection
Using MySQL with Python Example: load wiki_edit.txt file into MySQL database Schema: RevisionId, Article_name, EditDate, Editor
Using MySQL with Python • Create a new table (and drop the old one) in MySQL
Using MySQL with Python • Insert tuples from DataFrame into MySQL (after removing the unicode characters in article name) Make sure you commit the database changes
Using MySQL with Python • Query table and print the results
Accessing CSE MySQL from Home • You may not be able to access the database directly from your code outside of MSU network (firewall) • You may need to login to MSU VPN or EGR VPN