390 likes | 520 Views
Toto, We’re Not in Kansas Anymore… On Transitioning from Research to the Real World Mike Carey Fellow, Platform Engineering carey@propel.com. Today’s Talk. Background information Lessons from the "Road to Propel" The UW-Madison years The IBM Almaden years The Propel (web) years
E N D
Toto, We’re Notin Kansas Anymore…On Transitioning fromResearch to the Real WorldMike CareyFellow, Platform Engineeringcarey@propel.com
Today’s Talk • Background information • Lessons from the "Road to Propel" • The UW-Madison years • The IBM Almaden years • The Propel (web) years • Database research in the new millennium • Maturity brings its own challenges • Research opportunities in e-commerce • Some operational recommendations
Part One: Background information
Background Info • UW-Madison CS Professor (1983-1995) • Concurrency control algorithms • Query processing performance • Main memory databases • Extensible database systems (Exodus) • Real-time database systems • Client-server O-O database systems (Shore) • Online algorithms, DBMS performance
Background Info (cont.) • IBM Almaden Research Staff Member and Manager (1995-2000) • Heterogeneous database systems (Garlic) • Object middleware (Component Broker) • Object-relational databases (DB2 UDB) • Propel Platform Engineering Fellow (2000-?) • Scalable e-commerce infrastructure software
Part Two: Lessons from the "Road to Propel"
UW-Madison YearsLesson #1: Awareness is key • Be “plugged in” to current technologies & issues • Hardware and OS characteristics • CPU, memory, disk, and network performance • Path lengths (e.g., TCP/IP messages) • DBMS software characteristics • DBMS internal components • Layers/calls: SQL, records, pages, … • Interactions, e.g., concurrency & recovery • Application characteristics • “Typical” workload characteristics • What systems can or cannot know (when/how)
UW-Madison YearsLesson #2: Students are the product • Having industrial impact is a laudable goal, but • It’s hard (in general) to be fully plugged in • Details of systems and workloads • The algorithms may not be the hard part • More about this shortly • Students are our biggest accomplishment • Well-trained students are incredibly valuable • Systems sense; ability to think, learn, adapt • I’m extremely proud of my former students! • That’s what I miss the most in industry
UW-Madison YearsThe wake-up call: A house of cards? • [ACL85]: Blindly following colleagues • Ten years later, some papers still using the same hardware and software parameters • RTDBS: The blind following the blind? • We basically stated and then solved these research problems ourselves • SIGMOD-94: The SIGMOD chair’s lunchtime analysis of SIGMOD paper production • Not clear to me that “most SIGMOD papers in the last ten years” was such a good thing
The First TransitionFrom UW-Madison to IBM Almaden • Intellectual reasons • Weary of inventing and then solving problems • Wanted access to real problems and systems • Also just needed a change after 12 years • IBM Almaden reasons • Terrific environment & colleagues for DB research • “Development from the safety of a research lab” • Personal reasons • Wanted to “have a life” again outside work • Wanted to live in the Bay area (Silicon Valley)
IBM Almaden YearsContext: Extending DB2 UDB • From 1996-2000, I worked on adding object extensions to SQL and DB2 UDB (V5.2-V7.1) • Object-relational data model extensions • Types, OIDs, references, subtables, object views • Corresponding query language extensions • Substitutability, path expressions, constraints and triggers, type predicates, sub-table access rules • System extensions • Storage & query processing for all of the above • DB2 UDB work is geographically distributed • IBM Toronto, Santa Teresa, and Almaden labs
IBM Almaden YearsLesson #1: Products are hard to build • Products are very different than prototypes • Someone else wrote the first 1M+ lines of code • System has many nooks and crannies • No one person understands the whole thing • 100 or so people are working on it with you • You have to do the other 80-90% of the work • Testing, code reviews, testing, docs, testing, … • System catalogs: no big deal, right…? • The engine is just one aspect of a product • Import/export, bulk load, control center, visual explain, query tools, design tools, replication, …
IBM Almaden YearsLesson #1: Products are hard (cont.) • It’s difficult to make some kinds of changes • Customers already have terabytes of data • Data migration is a no-no (at least at IBM ) • Catalog migration is a pain and a time sink • It’s not just your own product that’s affected • 3rd-party vendors may also be a factor • Ex. 1: Physical load utilities (table hierarchies) • Ex. 2: Logical & physical database design tools • Market share & standards come into play here
IBM Almaden YearsLesson #2: Adding to a language is hard • SQL is a 25-year old language that was never intended to do everything we want it to today • World was simple tables, basic retrievals • Various assumptions made for “convenience” • Ex. 1: Sub-queries – scalar- or table-valued? • Ex. 2: Nulls – inconsistent (e.g., where vs. max) • SQL changes must be monotonic in nature • Can’t change meaning of existing queries (!) • Extensions must all peacefully co-exist • Language is getting “full” (> 1000 pages)
IBM Almaden YearsLesson #2: Adding is hard (cont.) • “Cool new SQL features” are a double-edged sword • Can add real value for advanced applications • Consider OLAP, O-R, and temporal extensions • “Different” or “proprietary” = “bad”? • To 3rd-party vendors, also to nervous customers • And, tools may hide them anyway • Query builders, EJB programming model, … • SQL standardization is an interesting world • Serious extensions must someday fly with ANSI & ISO • SQL standard is in some ways a corporate battleground • Vendors only want the extensions on their radar screen
IBM Almaden YearsLesson #3: Listen to users’ needs • So many features, so little time…! • Potential users help you prioritize your work • Ex: Sub-table triggers & constraints in DB2 • They also help you make “safe” initial decisions • Ex: Internal storage for DB2 table hierarchies • Potential users can help you see things you might otherwise miss (at least initially) • Ex 1: Advantages of DB2 user-defined OIDs • Customers already “simulate” objects today • Access to system-generated OID values? • Object caching and efficient write-back • Ex 2: DB2 object view functionality • Virtual table hierarchies, same authorization model
The Second TransitionFrom IBM Almaden to Propel • Some triggering events • Working on XML middleware layer for DB2 UDB • After spending nearly 20 years “under the hood” • Almaden management discussions: connecting to Valley • Personal belief that this was a unique period for CS • Call (out of the blue) from Steve Kirsch, CEO • Given a 4-year paid scholarship to “e-school” • Chance to learn about • Using database system technology • Web and e-commerce applications • The startup company experience • Excellent senior team to learn from at Propel • Unemployment risk “low” () in Silicon Valley
Propel (Web) YearsContext: E-commerce infrastructure • Propel is developing two software products • E-Commerce Suite • “Amazon-in-a-box” product • Distributed Services Platform • Infrastructure product for the above (and other data-centric, mission-critical internet applications) • Platform = Scalable 24x7 “e-commerce OS” • Online data management, caching, search, messaging, live deployment, monitoring, …
Propel (Web) YearsContext: E-C infrastructure (cont.) . . . Firewall Load Balancer WebServer WebServer WebServer WebServer WebServer . . . App Server App Server AppServer . . . Propel Platform Message Service … … … … … … … Admin & MonitoringService CachingService ERP Service Data Management& Search Service OrderMgmtService PaymentService …
Propel (Web) YearsLesson #1: Standards vs. innovation • What a marketing person will likely tell you after asking a customer for their input • Customers want standards-based solutions • “We want DB access via SQL and JDBC” • “We want our programmers to use EJBs (J2EE)” • “We want to use JSPs for our dynamic pages” • I.e., a typical customer dictionary entry says • Proprietary: see “bad” • This poses obvious challenges for innovation! • Luckily… • XML is also considered “standards-based” • Performance, ease of use are still compelling in web-land
Propel (Web) YearsLesson #2: Oracle is a de facto standard • Talking to dot-com’s with Oracle DBAs is an interesting experience for the academic-minded • Academic point of view • Whatever; it’s just a database system… • Oracle DBA point of view • Do my Oracle utilities work with your solution? • Do my Oracle sequences work with your solution? • You mean it’s not Oracle? (said with a whine ) • Again, this poses obvious challenges for innovation (not to mention other DB vendors!) • Luckily… • Saying “Oracle inside” seems to help • Oracle is not a cheap, perfect, or limitless solution
Propel (Web) YearsLesson #3: VCs, dot-coms, and ASPs • Oracle+Sun+Solaris are to web sites what IBM was to corporate IS departments 15+ years ago • Some VC firms prescribe(d) them to dot-coms • Some IS departments pre-approve (just) them • They are a favorite managed stack for ASPs • Thus, today’s “technology brakes” include • Corporate and VC comfort zones • ASP system management expertise • Developer and DBA skill set availability
Part Three: Database research in the new millennium
The DB Field Has MaturedBringing a new set of challenges • SQL DB systems are becoming a commodity • ISVs produce DBMS-independent packages • Ex: ERP systems (SAP, Peoplesoft, Baan, …) • SQL + ODBC/JDBC is just a “given” • New features face a huge uphill battle • Witness the rate of object-relational adoption • Hopefully SQL99 will help, but….? • A SQL DBMS has truly become a component • Transactional storage for ERP • On-line data repository for e-commerce • I.e., just a place to put your data • So where does that leave our community…?
The DB Field Has MaturedBringing new challenges (cont.) • Interesting questions remain! For example: • A good component is easy to manage • DB systems have way too many knobs • They’re virtually impossible to hide as a result • A good component plugs in well with others • Better, faster interfaces would be nice • Cache interaction hooks would be nice • Workflow hooks would be nice • (Your application hooks go here) • XML appears poised for interoperation success • W3C XML Schema, Query, & Protocol efforts • Our community should keep playing a big role
The DB Field Has MaturedBringing new challenges (cont.) • Interesting questions remain (cont.) • Major applications are worth studying • Ex: Kemper, Kossman, et al SAP study • Sources of “typical” workload info, database characteristics, and feature use (or disuse) info • Bottom line from a component perspective • We need to understand how our technologies are being utilized (or not) and respond accordingly • Ex. 1: Queries with parameter markers • Ex. 2: SQL’s approach to authorization • Ex. 3: Actual usage-driven interoperation hooks • And, of course, we must continue to innovate! • Somehow…?!?
E-Commerce DB ResearchA Propel Perspective • The Propel Distributed Services Platform • Scalable, 24x7 e-business infrastructure • Array of inexpensive Sun or Intel boxes • Exploitation of low main memory cost • High-performance and highly available • Data management and search capabilities • Transparent data replication & partitioning • Caching of page fragments, objects, and data • Scalable messaging & queuing infrastructure • Built from best-of-breed components • XML-enabled (for the future of e-business) • Unified administration and on-line deployment
E-Commerce DB ResearchProblem #1: Caching • What to cache and where to cache it? • Fragments of dynamic HTML pages • Personalization ruins basic page caching • Commonly used fragments assured, though • XML objects used to create HTML fragments • If applicable, probably less bulky • Java objects materialized on app servers • Avoids database re-access cost • Issues: load balancing, memory duplication • Database objects accessed from DB server(s) • Lowers database access cost • Where – app servers, DB server(s), or both?
E-Commerce DB ResearchProblem #1: Caching (cont.) • How to keep caches consistent • Multiple web servers and app servers • DB rows -> Java objects -> XML -> HTML • How to uniquely identify objects? • How to keep track of what’s where? • How to keep track of data dependencies? • How/when to propagate updates? • How to maintain consistency? • In fact, how to define consistency…? • What about queries and query results? • And, just to up the ante a bit further • Want all this to work across continents…!
E-Commerce DB ResearchProblem #2: Consistency & transactions • Not all e-business data is equally “valuable” • Want to trade off reliability & performance • Products: hot, may be read-only once deployed • Shopping carts: read/write, “best effort” durability • Orders: also read/write, require full durability • Similar considerations arise w.r.t. consistency • Would like well-defined choices available • Auctions: okay to bid using slightly outdated info • Orders: real-time inventory requires transactions • Need good, architecturally appropriate solutions • Caching, replication, failover, smart load balancing, …
E-Commerce DB ResearchProblem #3: Queries and search • W3C’s XML Schema recommendation • How to store richly typed XML data? • Sparse/variant data, repeating elements, subtyping, text, … • Would like to map it into (object-?) relational databases • W3C’s XML Query recommendation • How to process XML queries efficiently? • SQL-appropriate processing model • Pushdown and other optimizations • How to handle search-oriented queries? • Want transaction-consistent text indexing • Also want relevance ranking and various IR “goodies”
E-Commerce DB ResearchProblem #4: Content management • E-business web sites are rich in content • HTML fragments (e.g., logos and other goodies) • Images (e.g., pictures of products) • Text (e.g., descriptions of products) • Database data (e.g., product attributes, pricing) • JSP pages (e.g., a product page) • Personalization rules (i.e., what to show me) • Business logic (i.e., Java code) • Data -> object mappings (e.g., Java classes) • And the list goes on…
E-Commerce DB ResearchProblem #4: Content mgmt. (cont.) • This poses a number of problems • Versioning of file-based artifacts • Not unlike CAD or document versioning • Multiple editors working on the content base • Several companies do this (e.g., Interwoven) • Versioning of DB-based artifacts • Not clear how to handle & integrate this part • No winning solutions out there yet (that I know of) • Versioning of code-based artifacts • How to keep all this stuff mutually consistent? • And, how to deploy online in a 24x7 world…?
E-Commerce DB ResearchProblem #5: The sun never sets anymore • The web brings a clear need for 24x7 solutions • Asynchronous replication techniques • Online schema evolution (w/replication) • Online data loading and deployment • Online management of rolling history data • Design for administration/monitoring is also key • Online backup/restore • Failure & performance monitoring • Would like system to be self-tuning & self-scaling • Reassign boxes between services as needed • Even give and take boxes from ASP infrastructure
The Propel PlatformWe’re attacking all of these issues • Programming model • Objects with (truly!) universal OIDs • Java classes, derived from XML Schema objects • Caching • Multilevel cache hierarchy (w/partitioning) • Mini-caches, global cache, MM-DBMS, DB-DBMS • Consistency and transactions • Can trade off ACID-ity vs. performance • Queries and search • XML-influenced query language, integrated search • Transparency for cached, partitioned, & replicated data
The Propel PlatformWe’re attacking all of these issues (cont.) • Platform messaging support • Clustered IPC for Platform components • Load balancing & failover • System monitoring • Persistent queues as database objects • Think “active tables” (enqueue/dequeue, queries) • Good foundation for transactional workflows • Content management • Currently focused on deployment problems • Partnering for content management today • System monitoring and administration • Separate software stack with agents everywhere • JSP-based console to oversee & integrate activities
ConclusionLessons from the "Road to Propel" • UW-Madison lessons: Know what matters! • Awareness is key • Students are the product • IBM Almaden lessons: What’s really hard? • Products are hard to build • Adding to a language is hard • Listen to users’ needs • Propel lessons: Commoditization brings roadblocks. • Standards vs. innovation • Oracle is a de facto standard • Dot-coms, VCs, and ASPs
ConclusionDB research in the new millennium • SQL databases are becoming commodity parts • ISVs strive for DBMS vendor-independence • This makes (visible) innovation hard • Lots of interesting research questions, though • Component hooks, usage scenarios, XML, … • E-commerce problems are ripe for the picking • Examples that have arisen at Propel include • Caching, transactions & consistency • Queries and search • Content management • Online everything for a 24x7 world
ConclusionSome operational recommendations • Understand the real problems out there • Industrial friends can be very helpful • Your students will benefit tremendously • So will the companies who hire them • Recognize that commoditization is happening • Consider working within the constraints that it brings • Many important open problems remain • E-commerce is one fun/interesting example here • Also keep in mind what really matters • It’s actually not any of this stuff, in the end…!