420 likes | 560 Views
C20.0046: Database Management Systems Lecture #25. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Querying XML Data Warehousing Next week: Data Mining Websearch Etc. Goals after today:. Be aware of some of the important XML standards
E N D
C20.0046: Database Management SystemsLecture #25 M.P. Johnson Stern School of Business, NYU Spring, 2005 M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Agenda • Querying XML • Data Warehousing • Next week: • Data Mining • Websearch • Etc. M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Goals after today: • Be aware of some of the important XML standards • Know how to write some DW queries in Oracle M.P. Johnson, DBMS, Stern/NYU, Spring 2005
XML: Semi-structured data • Not too random • Data organized into entities • Similar/related grouped to form other entities • Not too structured • Some attributes may be missing • Size of attributes may vary • Support of lists/sets • Juuust Right • Data is self-describing M.P. Johnson, DBMS, Stern/NYU, Spring 2005
<movieinfo> <movieid="o111"> <title>Lost in Translation</title> <year>2003</year> <starsidref="o333 o444"/> </movie> <movieid="o222"> <title>Hamlet</title> <year>1999</year> <starsidref="o333"/> </movie> <personid="o111"> <name>Bill Murray</name> <moviesidref="o111 o222"/> </person> </movieinfo> M.P. Johnson, DBMS, Stern/NYU, Spring 2005
New topic: Querying XML • XPath • Simple protocol for accessing node • Will use in XQuery and conversion from relations • XQuery • SQL : relations :: XQuery : XML • XSLT • sophisticated transformations • Sometimes for presentation M.P. Johnson, DBMS, Stern/NYU, Spring 2005
XQuery • Queries are FLWR expressions • Based on Quilt and XML-QL FOR/LET... WHERE... RETURN... FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title M.P. Johnson, DBMS, Stern/NYU, Spring 2005
XQuery • Find all book titles published after 1995: FOR $x IN document("bib.xml")/bib/book WHERE $x/year > 1995 RETURN { $x/title } Result: <title>abc</title> <title>def</title> <title>ghi</title> M.P. Johnson, DBMS, Stern/NYU, Spring 2005
SQL v. XQuery Product(pid, name, maker)Company(cid, name, city) Find all products made in NYC SELECT x.name FROM Product x, Company y WHERE x.maker=y.cid and y.city="NYC" SQL FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/row WHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC" RETURN { $x/name } XQuery M.P. Johnson, DBMS, Stern/NYU, Spring 2005
SQL v. XQuery For each company with revenues < 1M count the products over $100 SELECT y.name, count(*) FROM Product x, Company y WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000] RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count( $r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company> M.P. Johnson, DBMS, Stern/NYU, Spring 2005
XSLT: XSL Transformations • Converts XML docs to other XML docs • Or to HTML, PDF, etc. • E.g.: Have data in XML, want to display to all users • Users view web with IE, Firefox, Treo… • Have XSLT convert to HTML that looks good on each • XSLT processor takes XML doc and XSL template for view M.P. Johnson, DBMS, Stern/NYU, Spring 2005
XSLT v. XQuery • FLWR expressions: • Often much simpler than XSLT • XSLT v. XQuery: • http://www.xmlportfolio.com/xquery.html <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template> </xsl:transform> FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“ RETURN $b/title M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Displaying XML with XSL/XSLT • XSL: style sheet language for XML • XSL : XML :: CSS : HTML • Menu in XML: • http://www.w3schools.com/xml/simple.xml • XSL file for displaying it: • http://www.w3schools.com/xml/simple.xsl • XSL applied to the XML: • http://www.w3schools.com/xml/simplexsl.xml • More info on Java with XSLT and XPath: • http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html M.P. Johnson, DBMS, Stern/NYU, Spring 2005
From XML to relations (Oracle) • To move single values from XML to tables, can simply use extractvalue in UPDATE statements: SQL> UPDATE purchase_order SET order_nbr = 7101, customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'), customer_inception_date = to_date(extractvalue(purchase_order_doc, '/purchase_order/po_date'), 'yyyy-mm-dd'); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
From relations to XML (Oracle) • Saw how to put XML in a table • Conversely, can convert ordinary rel data to XML • XMLElement() generates an XML node • Now can call XMLElement ftn to wrap vals in tags: • And can build it up recursively: SELECT XMLElement("supplier_id", s.supplier_id) || XMLElement("name", s.name) xml_fragment FROM supplier s; SELECT XMLElement("supplier", XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name)) FROM supplier s; M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Why XML matters • Hugely popular • To past few years what Java was to mid-90s • Buzzword-compliant • XML databases won’t likely replace RDBMSs (remember OODBMSs?), but: • Allows for comm. between DBMSs disparate architectures, tools, languages, etc. • Basis for Web Services • DBMS vendors are adding XML support • MS, Oracle, et al. M.P. Johnson, DBMS, Stern/NYU, Spring 2005
For more info • APIs: SAX, JAXP • Editors: XML Spy, MS XML Notepad: http://www.webattack.com/get/xmlnotepad.shtml • Parsers: Saxon, Xalan, MS XML Parser • Lectures drew on resources from: • Nine-week course on XML: • http://www.cs.rpi.edu/~puninj/XMLJ/classes.html • W3C XML Tutorial: • http://www.w3schools.com/xml/default.asp • http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%20XPath,%20&%20Xslt.ppt M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Recent XML news/etc. • Group at Sun planning “binary XML” • http://developers.slashdot.org/article.pl?sid=05/01/14/1650206&tid=156 • XML is “simple and sloppy” • http://www.adambosworth.net/archives/000031.html • RDF: Resource Definition Framework • Metadata for the web “Semantic web” • Content, authors, relations to other content • http://www.w3.org/DesignIssues/RDFnot.html • Web + XML = the “global mind” • http://novaspivack.typepad.com/nova_spivacks_weblog/2004/06/minding_the_pla.html M.P. Johnson, DBMS, Stern/NYU, Spring 2005
New topic: Data Warehousing • Physical warehouse: stores different kinds of items • combined from different sources in supply chain • access items as a combined package • “Synergy” • DW is the sys containing the data from many DBs • OLAP is the system for easily querying the DW • Online analytical processing • front-end to DW & stats M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Integrating Data • Ad hoc combination of DBs from different sources can be problematic • Data may be spread across many systems • geographically • by division • different systems from before mergers… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Conversion/scrubbing/merging • Lots of issues… • different types of data • Varchar(255) v. char(30) • Different values for data • ‘GREEN’/’GR/’2 • Semantic differences • Cars v. Automobiles • Missing values • Handle with nulls or XML M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Federated DBs • Situ: n different DBs must work together • One idea: write programs for each to talk to each other one • How many programs required? • Like ambassadors for each country M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Federated DBs • Better idea: introduce another DB • write programs for it to talk to each other DB • Now how many programs? • English in business, French in diplomacy • Warehousing • Refreshed nightly M.P. Johnson, DBMS, Stern/NYU, Spring 2005
OLTP v. OLAP • DWs usually not updated in real-time • data is usually not live • but care about higher-level, longer-term patterns • For “knowledge workers”/decision-makers • Live data is in system used by OLTP • online transaction processing • E.g., airline reservations • OLTP data loaded into DW periodically, say nightly M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Utilizing Data • Situ: each time manager has hunch • requests custom reports • direct programmers to write/modify SQL app to produce these results • on higher or lower levels, for different specifics • Problem: too difficult/expensive/slow • too great a time lag M.P. Johnson, DBMS, Stern/NYU, Spring 2005
EISs • Could just write queries at command-prompt • But decision makes aren’t (all) SQL programmers • Soln: create an executive information system • provides friendly front-end to common, important queries • basically a simple DB front-end • your project part 5 • GROUP BY queries are particularly applicable… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
EISs v. OLAP • Okay for fixed set of queries • But what if queries are open-ended? • Q: What’s driving sales in the Northeast? • What’s the source cause? • Result from one query influences next query tried • OLAP systems are interactive: • run query • analyze results • think of new query • repeat M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Star Schemas • Popular schema for DW data • One central DB surrounded by specific DBs • Center: fact table • Extremities: data tables • Fields in fact table are foreign keys to data tables • Normalization Snowflake Schema • May not be worthwhile… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Dates and star schemas • OLAP behaves as though you had a Days table, with every possible row • Dates(day, week, month, year, DID) • (5, 27, 7, 2000) • Can join on Days like any other table M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Dates and star schemas • E.g.: products x salesperson x region x date • Products sold by salespeople in regions on dates • Regular dim tables: • Product(PID, name, color) • Emp(name, SSN, sal) • Region(name, RID) • Fact table: • Sales(PID, DID, SSN, RID) • Interpret as a cube (cross product of all dimensions) • Can have both data and stats M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Drill-down & roll-up • Imagine: notice some region’s sales way up • Why? Good salesperson? Some popular product there? • Maybe need to search by month, or month and product, abstract back up to just product… • “slicing & dicing” M.P. Johnson, DBMS, Stern/NYU, Spring 2005
OLAP and data warehousing • Could write GROUP BY queries for each • OLAP systems provide simpler, non-SQL interface for this sort of thing • Vendors: MicroStrategy, SAP, etc. • Otoh: DW-style operators have been added to SQL and some DBMSs… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • Suppose have orders table (from two years), with region and date info: • Can select total sales: • Examples derived/from Mastering Oracle SQL, 2e (O’Reilly) • Get data here: http://examples.oreilly.com/mastorasql2/mosql2_data.sql SQL> column month format a10 SQL> @mosql2_data SQL> describe all_orders; SELECT sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id; M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • Can write GROUP BY queries for year or region or both: SELECT r.name region, o.year, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY (r.name, o.year); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • ROLLUP operator • Extension of GROUP BY • Does GROUP BY on several levels, simultaneously • Order matters • Get sales totals for each region/year pair each region, and the grand total: SELECT r.name region, o.year, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (r.name, o.year); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • Change the order of the group fields to get a different sequence of groups • To get totals for each year/region pair, each year, and the grand total, and just reverse group-by order: SELECT o.year, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (o.year, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • Adding more dimensions, like month, is easy (apart from formatting): • NB: summing happens on each level SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: ROLLUP (Oracle) • If desired, can combine fields for the sake of grouping: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP ((o.year, o.month), r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: CUBE (Oracle) • Another GROUP BY extension: CUBE • Subtotals all possible combins of group-by fields (powerset) • Syntax: “ROLLUP” “CUBE” • Order of fields doesn’t matter (apart from ordering) • To get subtotals for each region/month pair, each region, each month, and the grand total: SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY CUBE (o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW extensions in SQL: CUBE (Oracle) • Again, can easily add more dimensions: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY CUBE (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
DW SQL exts: GROUPING SETS (Oracle) • That’s a lot of rows • Instead of a cube of all combinations, maybe we just want the totals for each individual field: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY GROUPING SETS (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Next time • Overview of data mining • Some other odds & ends… M.P. Johnson, DBMS, Stern/NYU, Spring 2005