210 likes | 315 Views
SQL for Crime ANALYSTS. BACIAA Session Thursday, 22 March 2012 James G. Beldock. Today’s Agenda. Introductions Preliminaries Databases, Structured Data, and Tables Demo 1: Exploring Tables How Databases Are Structured (& Why) Demo 2: Lots of Tables Break A Sample CAD Database
E N D
SQLfor Crime ANALYSTS BACIAA Session Thursday, 22 March 2012 James G. Beldock
Today’s Agenda • Introductions • Preliminaries • Databases, Structured Data, and Tables • Demo 1: Exploring Tables • How Databases Are Structured (& Why) • Demo 2: Lots of Tables • Break • A Sample CAD Database • SQL SELECT, part 1 • Using database data in Excel • Lunch • Joins • SQL Select, part 2 • Joining • “Saving” Joins to a View • Break • Views • Other SQL Commands
Databases, Database Varieties, and SQL How Databases Are Structured (& Why) Preliminaries
Databases • Store data permanently • Sometimes called “persistent storage” • Data can be • Structured data • A Personhas: First Name; Last Name; Social Security Number; Photo.JPG • Unstructured data • examples: Moby Dick; an entire website; email messages (sometimes) • Sizes • Databases can be small (100K, 1MB, etc.) or • Quite Large (UK Land Registry is 23TB; that’s ~1.1 Libraries of Congress)1 • RIDICULOUSLY LARGE (Google’s index of the web; Facebook’s profiles database) that’s 1.84 x 100,000,000,000,000 bits!2 1 DB2 - the secret database (http://www.theregister.co.uk/2006/01/18/db2_neglected/) 2 Wolfram Alpha is great for this sort of thing: http://www.wolframalpha.com/input/?i=23+terabytes
[silicon valley moment] • Recently, SQL-running databases have fallen somewhat out of fashion • SQL was never cool • Now it’s officially “uncool” for some purposes, like building NetFlix • Highly scalable (thousands of servers?) • Very flexible data structures • Today’s session is all about SQL, and SQL is (usually) used with relational databases, which are, if you ask the cool people, not as cool as they used to be. • SQL is still the world’s most prolific database language, and certainly stores more structured data than any other environment ever built.
Structured Data • SQL deals with structured data3 • Structured Data • Keeps track of one or more types ofthings, called Entities (or TABLEs in SQL) • Knows certain, specific, structured piecesof information about those entities,called Attributes (or COLUMNs in SQL) Sample Structured Data:a TABLEof Customers4 Note: SQL keywords will be in blue.They are traditionally written in ALL CAPS. and names of Tables or Columns willappear in Brownor Orange, respectively. Theyare traditionally Capitalized (but not ALL CAPS). 3 Well, nearly always. But not always always: Storing Unstructured Data in SQL Server 2008 – Microsoft, 4 source: SqlCourse2.com, http://www.sqlcourse2.com/index.html
Database TABLES Names of COLUMNS Name of TABLE Question: What’s the name of a ROW? ROWs COLUMNs
a ROW • 5 COLUMNs (also called Fields): • customeridsome type of number probably a Unique Identifier • firstnametext (called a String) probably not unique • lastname string probably not unique • city string probably not unique • state string probably not unique Unique IDs are calledKEYs The KEY used to name a ROWis called the PRIMARY KEY
Before We Go Further:SQL • That is why you’re here, right? • Structured Query Language (SQL) is: • A language for asking a database for information (“querying”) • A language for changing information in a database • Changing the structure of a database • Adjusting security, performance, and deployment of databases • Destroying everything in the database…but don’t worry :-) Database Manipulation Language, DML CreateRead Update Delete Often called:DANGEROUS (seriously, calledadmin functionality, or Database Definition Language, DDL)
SQL’s SELECT Statement • The single most important SQL statement. Period. • “Selects” data out of a database, or performs a calculation on a column, value, table, etc. • Really simple examples: • SELECT 'hello' → hello • SELECT 1 + 3→ 4
SELECTStatement, continued • More commonly, the basic SELECTstatement returns ROWs from a TABLE: • SELECT firstnameFROM customers → John Leroy Elroy Lisa • SELECT firstname, city FROM customers → John LyndenLeroy PinetopElroy SnoqualmieLisa Oshkosh • SELECT * FROM customers → 10101JohnGray Lynden Washington10298 LeroyBrown Pinetop Arizona10299 ElroyKeller SnoqualmieWashington10315 LisaJones Oshkosh Washington * A special COLUMNname: Means “all COLUMNs”
SELECTStatement: the Important optioNS5 (for one table) SELECTlist of columns, functions on columns, or *FROMname of tableWHERElist of conditions to include (called “predicates”)ORDERBYlist of columns and direction of sort (ascending/descending)GROUPBYlist of columns 5 The full definition of the SQL SELECT statement syntax is much longer and, to some extent, specific to the database software. See the definition of Microsoft SQL Server 2008 R2’s SELECT statement at http://msdn.microsoft.com/en-us/library/ms189499.aspx
SELECT…ORDER BY • Use Order By to sort by one or more columns, in ascending or descending order Effect of ORDER BY clause
SELECT…WHERE • Use WHERE to filter • based on one criterion: • or more than one: Why the [square brackets]? The word state is a reserved SQL keyword. When it is used as a column name, it must be [bracketed] to avoid confusion.
FUNCTIONS • You can add functions to a SQL SELECT statement to perform various analyses. • The most common6 are • Aggregate functions • count(), which returns the number of somethings, and • sum(),which adds up the somethings • Also: min(), max(), avg(), stdev(), var() • Math, Date and String (text) Manipulation functions • Math: abs(), ceiling(), power(), sqrt(), others • String: len(), substring(), replace(), upper(), lower(), left(), right(), others • Date: dateadd(), datediff(), datepart(), getdate(), day(), month(), year(), others 6 The full list is quite long. For SQL Server, see http://msdn.microsoft.com/en-us/library/aa258899(v=sql.80).aspx.
[DEMO]Using FunctionS, WHERE, and ORDER BY • Summary: • count(*) gives you the count of rows resulting from your query • You can SELECT any combination of columns • Unless you GROUP BY, in which case you are limited to the GROUPedBYcolumns and aggregate functions applied to other columns • Gotchas • sum(*) doesn’t make sense, but sum(columnname) does—for columns of numbers • GROUP BY is finicky: the list of columns you select is limited • Some things aren’t easy: for example, finding the percent of total
Terminology: DBMS • “Database” is a generic term; it can refer to: • A specific set of data running on a Database Server • A Database Server itself (not really the right term) • A large body of information kept by a human being (“my recipe database”) • Databases generally run on a Database Server • A computer running Database Management System (DBMS) • Accepts connections (“queries”) from many client computers • Returns a response (“result set”) to each client in response to each query • Can be distributed onto lots of servers (Facebook: 1,800+ MySQL servers) • DBMS handle multiple databases • Each Database is stored in one or more “database files” • Database Files can sometimes be loaded/viewed/edited by other software
Names You Might Encounter(in the Database World) • SQL Server, from Microsoft (also “Microsoft SQL Server”) • Oracle • DB2, from IBM • Less common: • Microsoft Access, dBase, Sybase
Database Structures • Most databases have many TABLEs • 10 would be “few”; 50 would be normal; 150 would be many • There is a method to this madness • Different TABLEs contain different categories of information • Example: • Customers: contains lots of customers • Products: contains lots of products • Orders: combines customers and products (and quantities, etc.)
Why So Many Tables? • Imagine a world with just 1 table • The problem of duplicate data: • Adding a new order is easy: • But what happens when James changes his address? • Answer: need to update every ROWwhere 'James' is the CustomerName(ugh!)
Solution: Divide and Conquer • Divide data into Entities (TABLEs), specific to a given purpose: