190 likes | 365 Views
Mike Carey Information Systems Group Computer Science Department UC Irvine. Database Systems: A Vertical Slice of Computer Science … or … It’s All About the Data!. Wait … Who Is This Guy?. Carnegie-Mellon University, 1975-80 B.S. and M.S. Student, EE/ECE UC Berkeley, 1980-83
E N D
Mike Carey Information Systems Group Computer Science Department UC Irvine Database Systems:A Vertical Slice of Computer Science … or …It’s All About the Data!
Wait … Who Is This Guy? • Carnegie-Mellon University, 1975-80 • B.S. and M.S. Student, EE/ECE • UC Berkeley, 1980-83 • Ph.D. Student, CS • University of Wisconsin, 1983-95 • Assistant/Associate/Full Professor, CS • IBM, 1995-2000 • Industrial Researcher & Software R&D Manager • Propel Software, 2000-01 • Startup Company Fellow/CTO/VP of Software • BEA Systems, Inc., 2001-08 (acquired by Oracle) • Industrial Software Architect & Sr. Engineering Director • And now I’m here… Trivia tidbit: Here’s a photo of my first (ever) CS TA
Plan For Today’s Talk • Okay, so just what is a database system? • Based on lecture notes from the UW-Madison database curriculum, as immortalized in Database Management Systems (Ramakrishnan & Gehrke, a.k.a.“the Cow book”) • The database field is a vertical slice of all of CS! • You’ll see what I mean (and why)… • What’s exciting in “database systems” today? • UCI Information Systems Group (ISG) and beyond!
What isa Database System? • So what’s a database? • A very large, integrated collection of data • Usually a model of a real-worldenterprise or a history of real-worldevents • Entities (e.g., students, courses, Facebook users, …) • Relationships (e.g., Susan is taking CS 234, Susan is a friend of Lynn, Mike filed a grade change for Lynn, …) • What’s a database management system (DBMS)? • A software system designed to store, manage, and provide access to one or more such databases
Evolution of DBMS New Data New Data New Data Relational CODASYL/IMS Files
Why Use a DBMS? • Reduced application development time • Efficient (and automatic!) data access • Data independence • Data integrity and security • Uniform data administration • Concurrent access and recovery from crashes
Why Study Databases? • Shift from computation to information • At the “low end”: explosion of the web (a mess!) • At the “high end”: scientific applications • Datasets increasing in diversity and volume • Digital libraries, interactive video, social media, genomic data, big science data, … • ... need for DBMS exploding! • DBMS field encompasses most of CS • OS, languages, theory, AI, multimedia, logic, … ?!
Data Models • A data model is a collection of concepts for describing data (to one another or to a DBMS) • Aschemais a description of a particular collection of data, using a given data model • The relational model is the most widely used data model today • Relation – basically a table with rows and (named) columns • Schema – describes the tables and their columns
Levels of Abstraction Lies! • Many views of one conceptual (logical) schema and an underlying physical schema • Views describe how different users or groups see the data • Conceptual schema defines the logical structure of the database • Physical schema describes the files and indexes used “under the covers” View 1 View 2 View 3 Conceptual Schema Logical Model Physical Schema On-Disk Data Structures Bits
Example: University DB • Conceptual schema: • Students(sid: string, name: string, login: string, age: integer, gpa: real) • Courses(cid: string, cname: string, credits: integer) • Enrolled(sid: string, cid: string, grade: string) • Physical schema: • Relations each stored as unordered files • Have indexes on first and third columns of Students • External schema (a.k.a. view): • CourseInfo(cid: string, cname: string, enrollment: integer)
Data Independence • Applications are insulated from how data is actually structured and stored! • Logical data independence:Protection from changes in the logical structure of data • Physical data independence:Protection from changes in the physical structure of data • One of the most important benefits of using a DBMS! • Allows changes to be made w/o application rewrites
Example: University DB (cont.) • User query (in SQL, against the external schema): • SELECT c.cid, c.enrollment FROM CourseInfo c WHERE c.cname = ‘Computer Game Design’ • Equivalent query (against the conceptual schema): • SELECT e.cid, count(e.*) FROM Enrolled e, Courses c WHERE e.cid = c.cid AND c.cname = ‘Computer Game Design’ GROUP BY c.cid • Under the hood (against the physical schema) • Access Courses – use index on cname to find associated cid • Access Enrolled – use index on cid to count the enrollments
Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Architecture of a DBMS Queries • A typical DBMS has a layered architecture • The figure doesn’t show the concurrency control and recovery components • This is one of several possible architectures; each actual system has its own variations Note: These layers must consider concurrency control and recovery
DB Field is a Vertical Slice of CS • “I like programming languages and compilers” • Consider high-level, declarative languages like SQL • “I like low-level operating systems issues” • DBMSs manage records, memory, locks, logs, … • “I really want to work on distributed systems” • Distributed and parallel database systems are ripe with distributed algorithms and systems issues (!) • “Data structure and algorithm design is really cool” • Database indexes are data structures on disk (or flash) (And so on!)
What’s Exciting in DB Land Today? • The Web is full of database challenges (“Big Data”!) • A box for keywords only goes so far… • How can I query the web, e.g., “Find me 5-string Fender bass guitars for sale in the $1000-1500 price range” • Click streams and social networks generate lots of data • How can I query and analyze all that data (e.g., toacton it)? • Ubiquitous computing is data-rich, too (IoT) • Build, deploy, and use location-based data services • Query and aggregate streams of sensor or video data • There’s data everywhere, and of all shapes and sizes • How do we integrate it, e.g., for rapid crisis response? • And when we do, how do we ensure privacy/security?
Ex: DB Challenges at Facebook • Data store for low-latency, high-traffic Web sites • Only have a few hundred milliseconds to generate an entire page • Data heavily cached outside the DBMS today, which is “far from ideal” • Data systems for offline/batch-oriented processing • I mentioned this before: clickstream analysis, graph analysis, etc. • Potentially interested in faster, approximate answers • Would like to do this in real time as well, as data arrives • Hardware trends (always) present new opportunities • Flash storage, for example • Multicore CPUs (nobody uses them super well yet) • Some open source work fromFacebookrelated to DBs • Hive: Open source SQL on top of Hadoop • Cassandra: Large-scale distributed storage for semistructured data
AsterixDB System (UCI / UCR)(https://asterixdb.apache.org/) Data loads & feeds from external sources (XML, JSON, …) AQL queries & scripting requests and programs Data publishing to external sources and apps ASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semi-structured information… Hi-Speed Interconnect CPU(s) CPU(s) CPU(s) Main Memory Main Memory (ADM = ASTERIX Data Model, AQL = ASTERIX Query Language) Main Memory Disk Disk Disk ADM Data ADM Data ADM Data
Summary • A DBMSis for storing and querying big datasets • Benefits of using one are many: rapid development of new applications (“what, not how”), recovery after crashes, support for (safe) concurrent access, help in ensuring data integrity and security, … • Levels of schema abstraction data independence • DB research is a vertical slice of all of CS (“for data”) • Big Dataexperts are in high industrial demand! () • Data is what it’s all about today! So, consider taking our three classes: CS 122A/B/C (and occasionally offered special topics classes)