1 / 41

Love the Data

Love the Data. Three Stories About Data Management. By Neil Hepburn (Dir. of Education, IRMAC). Love the Data. Three Stories About Data Management by IRMAC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Canada License . Based on a work at wikipedia.org.

robin-noble
Download Presentation

Love the Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Love the Data Three Stories About Data Management By Neil Hepburn (Dir. of Education, IRMAC) Love the Data. Three Stories About Data Management by IRMAC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Canada License.Based on a work at wikipedia.org.

  2. Speaker Bio and Relevant Experience • Bio • Data Architect for Empathica Inc. • Education: • Honours Bachelor of Mathematics in Computer Science from the University of Waterloo • Certified Data Management Professional (Mastery Level) • PMI Certified • 18 years IS/IT, both in full time and external consulting capacities with a focus on Data Management over past 7 years • GM of marketing for: Innovative iPhone App for Internet Radio Discovery • Director of Education for IRMAC (Toronto chapter of DAMA-I) • Relevant Experience • Consultant to Bell Mobility assisting in a reboot of their Market Analytics and Intelligence programme • Developed and implemented Predictive Analytics model for Call Genie, directly advising their CEO and SVP of Marketing • Technical lead on Business Intelligence Modernization project at Empathica

  3. Presentation Roadmap • Why am I giving this presentation? • The Story of the Whiz Kids • The Story of the Relational Model • The Story of Twitter Analytics

  4. Why am I giving this presentation? • Data Management is an important discipline as we move to an increasingly data-driven society that relies on quality data to make fact-based decisions • Data Management exists at the intersection between technology and business • Requires understanding the underlying meaning of the data and how it relates to the business • Requires mastery of technology required to assist in the production, transformation, and consumption of information • Most IT personnel have a Computer Science degree or similar educational background • Computer Science and IT programs don’t generally teach data management • data is regarded as little more than a “stage prop” • Databases are regarded as “bit buckets” • “Garbage in garbage out” is the prevailing attitude in IT departments • Data management is seen as a techno-bureaucracy

  5. Story of The Whiz Kids: The World Today • Current wave of “Cultures of Analytics” has begun to capture the popular the popular imagination. In the last three years we have seen the following books released: • The Numerati (by Stephen Baker) • Competing on Analytics (by Thomas Davenport & Jeanne Harris) • Supercrunchers (by Ian Ayres) • Data Driven: Profiting from your most Important Asset (by Thomas C. Redman) • The Information (by James Gleick) • Much of the inspiration behind these books originates from “Moneyball: The Art of Winning an Unfair Game” (by Michael M. Lewis), which documents the success of the Oakland ‘A’s through “Sabermetrics” – taking an analytical approach to team picks and real time game strategy • It’s all good stuff, but really nothing new…

  6. Where did Evidence Based Management Begin? • Some companies were using data analytics to gain a competitive advantage • The very use of analytics was regarded as a secret weapon, and those employed in statistical analysis were warned not to discuss their work • In 1908, William Sealy Gosset was employed by Arthur Guinness • Gosset applied statistics to both farm and brewery to determine the best yielding varieties of barley • Gosset also invented the “Student’s t-distribution”, which got its name from Gosset’s pseudonym “Student”

  7. Who were The Whiz Kids? (Pt. I) • The Whiz Kids trace their roots back to the US Air Force under the command of Robert A. Lovet (assistant secretary of War) • In 1939 Tex Thornton (who was the first Whiz Kid), hired nine other Whiz Kids from the Harvard Business School including Robert McNamara and Edward Lundy • The team called themselves “Statistical Control” and committed themselves to a new managerial discipline, basing all decisions on numbers • Statistical Control saved $3.6 billion for the Air Force in 1943 alone, while at the same time improving pilot and troop morale • After WWII, Tex Thornton sold all 10 Whiz Kids as a team to the Ford Motor Co. Reporting directly to then president Henry Ford Jr. • Upon arrival, The Whiz Kids discovered the finance department was designed solely for IRS tax purposes, and was not a tool of management

  8. Who were The Whiz Kids? (Pt. II) • The Whiz Kids got off to a rocky start when two layers of management were inserted between them and Henry Ford Jr. • Tex Thornton left the company, going on to head Litton Industries • Were ridiculed as “The Quiz Kids” (after a popular game show) • Nevertheless, through the techniques and discipline learned from Statistical Control, The Whiz Kids were able provide substantial cost savings, while at the same time growing market share • After turning Ford around, they were relabelled “The Whiz Kids” • McNamara was the first to recognize safety as a feature and attempted to introduce seat belts as a standard feature (tragically, this decision was collectively snubbed by the entire auto industry, delaying their introduction) • Ed Lundy transformed finance from an IRS compliancy cost centre, into a reporting powerhouse, establishing the CFO as the right-hand-man of the CEO • By 1960, Robert McNamara had been promoted to president of the company and was the first ever non family member to run the company • McNamara left the company shortly after to become JFK’s Secretary of Defence

  9. A Tale of Two Whiz Kids • Jack Reith was “a car guy” • Robert McNamera saw automobiles as consumer appliance, like a washing machine or refrigerator. Simply a means of transportation • Jack Reith took it upon himself to get involved in design decision with the Ford Edsel, and conceived the Mercury Comet • The Mercury Comet reflected Reith’s own convictions about driving as romantic pastime • Both cars bombed, leading to Reith’s departure • McNamara learned that Volkswagens were gaining market share. • Was common wisdom among auto execs that only “beatniks” were purchasing Volkswagens • McNamara commission a market research study discovering that customers were often doctors and lawyers • Also learned that buyers purchased Volkswagens due to their design that made it easier to repair in one’s own driveway • McNamara commissioned the Ford Falcon, which went on to be a top selling car • McNamara continued to rise at Ford, soon becoming president

  10. Lessons Learned From The Whiz Kids • They had the buy-in and full support of president Henry Ford Jr. • They were disciplined and forced themselves to adhere to their own principles • As measured by IQ, they were the most intelligent persons Ford had ever hired. Robert McNamara in particular was “off the charts” • They acted as specialized generalists (i.e. Versatilists): • Were as adept at data collection and statistical analysis as they were at leading and negotiating • Could perform each other’s tasks, but were focussed on a particular role • Continued to learn and seek out best practices • E.g. They implemented some of Peter Drucker’s teachings, such as a division structuring • Their experience in the Air Force infused them with a humility and maturity allowing them operate effectively within a large organization • In spite of their nickname “The Whiz Kids”, they were not Prima Donnas • They were competitive amongst themselves and were fiercely driven to demonstrate “bottom line” measurable improvements

  11. Smart People will Always Make Bad Decisions • Jonah Lehrer’s book “How We Decide” should be required reading for all analysts • The book explains why even the best of us, are prone to make bad decisions • All too often, good information is wilfully ignored • Even the McNamara made some famously bad decisions after he left Ford • As Secretary of Defence for Vietnam War, Robert McNamara continued to order the use of Agent Orange, in spite of report from The Rand Corporation showing that it did not help • McNamara disagreed with Edward Lansdale (a general who successfully led a counter-insurgency campaign in the Phillipines), and ignored all his “unconvential wisdom” • McNamara (under LB Johnson) arguably rationalized these poor decisions he already made on poor information and refused to consider any new information • Therefore, if we are to truly act in a rational manner we must above all else embrace humility

  12. The Story of the Relational Model • Relational Databases such as Oracle, DB2, SQLServer, PostgreSQL, MySQL, and Access have all been around for a while – at least since the 1970s • What came before relational databases? • Who invented the relational model and why? • Why is there a “holy war” between the relational purists and object oriented purists? • What are NOSQL (No Only SQL) databases? • Why were they invented?

  13. Punched Card Era – pre magnetic storage • In 1725 punched cards were used in France by Basile Bouchon and Jean-Baptiste Falcon to control textile looms • Technique was improved by Jacquard in 1801 • In 1832 Semen Korsakov (Ukranian) working Russian govt. invented a search system using punched cards • In 1890 Herman Hollerith invented a punch card and tabulating machine for the United States Census • Size was the same as an 1887 dollar bill • Enough for 80 columns and 12 rows (80x25 still exists in terminals – e.g. Windows 7 DOS terminal) • Hollerith left US government and founded the Tabulating Machine Company in 1896. • This company became IBM

  14. 1930s and 1940s: The Information Age Begins • In 1936 Alan Turing introduced the “Universal Turing Machine” as a thought experiment • Demonstrated that all computers are fundamentally the same • Divorcing the concept of computing from all physical implementations • In 1947 at AT&T’s Bell Labs, the first working transistor was created • In 1948 Claude E. Shannon, working at Bell Labs published the seminal paper “A Mathematical Theory of Information” • Shannon introduced the concept of “bit” and showed how all information could be reduced to a stream of bits • Shannon’s paper sparked new thinking in practically every domain, and in particular led to huge paradigm shifts in: physics; Chemistry; Biology; Psychology; Anthropology • Randomness = Complexity = Information

  15. Early 1950s – pre generalization era • Scientific applications dominated early 1950s with a shift to business administrative systems by the end of the decade • Standard application packages were rare, most software was written for the customer (money in the hardware) • Payroll was the first “killer app” • General Electric set the standard for payroll processing in 1954 running on a Univac • Difficult to deal with special-case handling. • Was more complicated that missile control systems • Essential Complexity! • Programmers spent much of their time writing low level data access and manipulation routines • A need to hide the complexity of data manipulation and retrieval from application programmers was well recognized

  16. Late 1950s – pre DBMS era (Pt. 2) • Software abstraction (known as “generalization”) began to take hold • Sorting was one of the first things to be “generalized” into re-usable code across customer installations • Report Generation Program was first developed in 1957 by GE’s team at the Hanford Nuclear Reservation on its IBM 702 • Consumed as input a data dictionary and a file containing desired report format (including calculated fields) • Share Data Processing Committee (like today’s Open Source communities) • First met October 2nd 1957, chaired by Charles Bachman

  17. 1960s – General Electric’s IDS • In 1961 Charles Bachman first developed IDS (Integrated Data Store) at General Electric • Was made possible by new “random access” disk technology – as opposed to sequential tapes • Developed as the DBMS for a Manufacturing Information and Control System (MIACS) used for GE’s High Voltage Switchgear (HVSG) Department • Later sold externally to Mack Truck and Weyerhauser • World’s first true transaction-oriented DBMS • Followed a Network Model • Data Element relationships were explicitly encoded and had to be explicitly traversed • Application programs had to be modified to take advantage of new indexes • Was later sold to B.F. Goodrich • Was modernized to behave more like an RDBMS and was rebranded IDMS (Integrated Data Management Store) • Currently being sold by CA, running on IBM mainframes

  18. 1960s – IBM’s IMS/360 • In 1963 IBM was asked to build a “data base” for the Apollo space mission, to manage parts inventory • IMS (Information Management System) was originally built in collaboration with Rockwell Space Division and released in 1965 for IBM 7000 series hardware • Utilized a hierarchical data model • In 1966 IMS was moved under the development of OS/360 (under the leadership of Fred “Mythical Man Month” Brooks) IMS was now rebranded as IMS/360 • Available for routine use at Rockwell on August 14th 1968 • IMS/360 led to many changes to OS/360 itself to provide nonstop operation and recovery • IBM also developed an alternative DBMS called GIS (Generalized Information System). GIS supported more flexible querying, but never achieved the success of IMS • IMS 11 currently runs on IBM’s system z mainframes, and continues to sell well in telecom, airlines, and finance

  19. 1965 –1973 DBTG and System/360 years • In 1965 Codasyl (Conference on Data Systems Languages) forms the DBTG (Data Base Task Group) • Was led by Charles Bachman (inventor of IDS) • DBTG’s mission was to create a DBMS standard • Standardized terms such as “record”, “set” and “database”, and added the term “schema” to describe logical format of data. • Some terms would later change. (e.g. “Data Structure Class” is now referred to as a “Data Model”) • In 1964 IBM’s System/360 was designed to support software compatibility between varying hardware platforms • In 1968 IBM began unbundling software, consulting services, and training services • In 1969, the DBTG published a language specification for a “Network Database Model” known as the “Codasyl Data Model” • ANSI and ISO adopted the “Codasyl Data Model” calling it “Network Database Language” (NDL). ISO 8907:1987 • Standard was eventually withdrawn in 1998 due to being superseded by SQL standardization • Confluence of the DBTG recommendations, System/360 and IBM’s unbundling of software led to an explosion of DBMS vendors • In 1972 there were 82 vendors offering 275 packages for the life insurance industry • Major DBMSs were: IDMS; IMS; Cincom Total, System 2000; Adabas; and Datacom/DB Fun fact: Tom Nies, CEO of Cincom is the longest serving CEO of any IT company. Fun fact: In 1971, the Data Base Users Group was formed in Toronto (later renamed to IRMAC [Information Resource Management Association of Canada], which went on to become part of DAMA-I, and is still recognized as the first operating chapter of DAMA-I

  20. 1969: Enter the Relational Model • In 1969, Edgar F. Codd working out of IBM’s San Jose Research Laboratory internally published a paper titled "A Relational Model of Data for Large Shared Data Banks“ • Paper was published externally published in 1970 in Communications of the ACM • The Relational Model was grounded in pure mathematics. • Set Theory (relational algebra) and First Order Logic (relational calculus) • The Relational Model is proved to be better aligned with how the business viewed data • Perspective Neutral: Shifted responsibility of specifying relationships between tables from the person designing them to the person querying them • Necessary for establishing large, general purpose databases shared between different departments and computer systems • Non-procedural (i.e. declarative). Tell the RDBMS WHAT you want, not HOW to get the data • IBM initially passed over on implementing Codd’s recommendations for fear of cannibalizing sales of IMS • In 1973 IBM began working on System R, based on Codd’s relational model, but the software architects were cut-off from Codd and did not entirely understand the relational model • IBM eventually released a relational database, DB2, which is to this date their de-facto database solution Fun fact: Codd was born in England, and moved to the US in 1948 to work for IBM as a programmer. In 1953, fed up with McCarthyism, he moved to Ottawa, Ontario and lived there for a decade before moving back to the US

  21. 1970s Commercial Implementations (RDMS and INGRES) range of e is employee retrieve (comp = e.salary / (e.age - 18)) where e.name = "Jones" • The first relational database was RDMS (Relational Data Management System) at MIT by L.A. Kraning and A.I. Fillat • Written in PL/1 for Multics OS • relation concept is implemented as a matrix of reference numbers which refer to character string datums which are stored elsewhere in distinct dataclass files • In 1973 two scientists at Berkeley - Michael Stonebraker and Eugene Wong – learned of the System R project and sought funding to create a relational database of their own • Stonebraker and Wong already already had funding for a geographic database called Ingres (INteractive Graphics REtrieval System). They decided to abandon this project and pursue an RDBMS • Additional funding came from the National Science Foundation, Air Force Office of Scientific Research, the Army Research Office, and the Navy Electronic Systems Command • INGRES was developed at UC Berkeley by a rotating group of students and staff. An initial proto-type was released in 1974. • Ran on DEC UNIX machines • INGRES was quasi open source. You could purchase the source code for a fee, and build on it. • Used a query language called Quel (as opposed to SQL) • Many companies released source code based on INGRES. • Most successful company was Relational Technology Inc (RTI) • Robert Epstein was one of the lead developers who went on to found Sybase • Flagship RDBMS eventually was acquired by Microsoft and lives on as MS SQLServer

  22. 1980s Commercial Implementations (Oracle and Informix) Fun fact: The name Oracle comes from the code name of a CIA project which the Oracle founders had all worked on while at the Ampex Corporation. • Oracle was founded in 1977 by Larry Ellison, Bob Miner, and Ed Oats. • The original name of the company was Software Development Laboratories (SDL), which became Relational Software Inc (RSI), and eventually was named after their flagship product Oracle. • Ellison wanted to make a product that was compatible with IBM’s System R. Although this was not possible, since IBM kept the error codes secret. • Oracle derived early success because it was written in C, and was easier to port to other hardware platforms • Oracle beat out Ingres by 1985 since it had standardized on SQL (as opposed to Ingres’ Quel), which was more popular. • SQL was in fact based on IBM System R’s non-relational “SEQUEL” (Structured English Query Language) • Oracle out marketed Ingres • Informix (INFORMation on unIX) was founded in 1981 by Roger Sippl and Laura King • In 1985 introduced new product ISQL which separated database access code into the query engine (as opposed to requiring the client to perform direct CISAM manipulations) • Was a pioneer in and set the stage for client server computing which came to dominate in the 1990s

  23. The 90s: Object Oriented Databases (OODBMS) • In 1988 Versant became the first company to introduce an OODBMS (object oriented data base management system) • Object Data Management Group was formed in 1991, and ratified the Object Definition Language (ODL) Object Query Language (OQL) • Sybase took a different approach and introduced Stored Procedures • Coupling code and data into the RDBMS – a key OOPs principle • ANSI SQL (and RDBMS vendors) continue to add complex datatypes and operators to their offerings: • Geometric datatypes and operators • Spatial datatypes and operators • Hierarchy datatypes and operators • Oracle added Binary Large Objects (BLOBS) and recently Microsoft has added FILESTREAM support • OODBMS have come back in cloud computing • Salesforce.com / force.com / database.com • Access 2010 Web DB (running on SharePoint 2010) Type Date Tuple {year, day, month} Type year, day, month integer Class manager attributes(id : string unique name : string phone : string set employees : Tuple { [Employee], Start_Date : Date }) Class Employee attributes(id : string unique name : string Start_Date : Date manager : [Manager]) SELECT manufacturer, AVG(SELECT part.pc.ram FROM partition part)FROM PCs pcGROUP BY manufacturer: pc.manufacturer

  24. Codd’s 12 Rules and Date’s Third Manifesto • Codd observed that no vendor had correctly implemented the relational model. To clarify his intent he published 13 (0 to 12) basic conditions that must be met in order for a DBMS to be considered “relational” • To this date, no vendor can satisfy all 13 rules. E.g.: • Updatable VIEWs are nigh impossible to implement • “Completness constraint” cannot easily be implemented • In spite of the popularity of RDBMS, starting in the 1980s, and continuing through to the present, Christopher Date (who worked with Codd on the relational model) believed that commercial RDBMSs were not truly relational • In 1995 Christopher Date and Hugh Darwin published the Third Manifesto • Major theme of Third Manifesto is that the relational model is not flawed. Rather RDBMS vendors have not correctly implemented it. • In particular, SQL is flawed • Describes a new language called “D” to address SQL’s shortcomings • Dataphor is a DBMS implemented with D4 (a later version of “D”) • “Rel” is implemented in Java as an interpretation of Date’s manifesto • SQL continues to evolve in order to meet deficiencies D4: T group by { A } add { Concat(B, C order by { A, B }) Bs } Oracle 11.2 SQL: select A, listagg(B, C) within group (order by B) as Bs from T group by A

  25. The Object Relational Impedance Mismatch “Holy War” • A philosophical, never-ending, and somewhat imagined debate exists between the “relationalists” and the “object orientedness” • Set-oriented vs. Graph-oriented • Thinking in sets vs. Thinking in fine-grain objects discrete objects • Data models within object oriented programs (e.g. Java, C#) don’t align with relational data models • Much time is spent interfacing with relational databases • ORM (Object Relational Mapping) layers like Hibernate and ADO.NET Entity Framework allow OOPs developers to persist data from their own object models within an RDBMS • Creates a “virtual object database” • Some limitations still exist • Performance issues can arise (especially in joins and batch deletions) • Often leads to application-centric data models • Key data elements required for downstream reporting are often left out • ORM-centric languages also exist (e.g. Groovy) • RDBMS-centric people prefer accessing data via stored procedures • Creates clean separation between RDBMS and application • Some RDBMSs support extensions in non-RDBMS languages (e.g. SQLServer allows functions and stored procs to be written in C# or VB.NET, as well as custom built-in scalar and aggregate functions Fun fact: The term object-relational impedance mismatch is derived from the electrical engineering term impedance matching.

  26. The Semantic Web and Linked Data PREFIX abc: <nul://sparql/exampleOntology#> . SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y. ?y abc:countryname ?country ; abc:isInContinent abc:Africa. } • Relational Model generally operates under the “Closed World Assumption”: What is not known to be true is assumed to be false • NULL values are the exception that proves the rule • Semantic Web is based on the opposite, the “Open World Assumption” • Because relational databases are centralized they guarantee data integrity, and users can safely apply First Order Logic to derive new facts • The Semantic Web, which is decentralized, cannot provide the same guarantees of integrity. However, it more closely resembles the organic (warts and all) nature of the Internet, and in turn the benefits that come with decentralization • Semantic Web is a set of technologies under the purview of the W3C. They include: • RDF (Resource Descriptor Framework): metamodel based on a subject, predicate, object pattern • SPARQL (SPARQL Protocol and RDF Query Language): SQL-like language for querying RDF data • Triplestore: database for storing RDF data • Semantic Web projects: • DBPedia (converting Wikipedia into RDF) • FOAF (Friend of a Friend) • Linking Open Data (one project to rule them all) • Ad hoc integration through web APIs seems to be more popular

  27. The Big Data Challenge and NOSQL • Big Data represents a class of problems which were hitherto seen as unrelated, but can in fact be solved with the same tools • Tracking (Geo Location and Proximit, Ads, RFID, you name it) • Causal Factor Discovery • Smart Utility Meters • Genomics Analysis • Data bag (Entity Attribute Value, on-the-fly data modeling) • Two basic approaches: • Extended RDBMS (e.g. Columnar MPP RDBMS) • Leverages existing data warehouse tools, skills, and data models • Slower load times • Does not work well with unstructured data • NOSQL, Hadoop/MapReduce • Evolving set of tools, both low-level and high level • Can deal with any kind of data, including BLOBs • Still cannot solve problem of joining 1 billion dimensions to 1 trillion facts • Other NOSQL DBs • MongoDB, CouchDB: Document Oriented (JSON). Supports ad hoc data models and flexible querying • Redis, HBase: Key Value, Real Time analytics, Complex Event Processing • Cassandra, Riak: Works well in heavy writes. Started at Facebook

  28. The Real Challenge of Data Management • Consider the challenge of managing your own personal data and optimizing your own life, everything here is related: • Finances • Courses • Home property (and all your possessions) • Telephone, Television, Internet • Personal Computer • Automobile, expenses, maintenance • Groceries • Dependents • Is an ultra-powerful, ultra-flexible database the solution? • Maintaining quality data requires tremendous discipline and sacrifices • Most companies can barely manage their Customer Master Data • Duplication of data is still commonplace • The real solutions are unglamorous, but separate the winners from losers: • Master Data Management • Metadata Management • Data Governance • Enterprise Frameworks and Data Models • Cloud-based RDBMS: A good swiss-army knife. Even MS Access will do.

  29. The Story of TopGun Twitter Analytics • Or... How To Build a Twitter Data Warehouse from public APIs and open source RDBMS and ETL tools • ...and keep the Open Source Code and run your own Twitter monitoring program

  30. Step 1: Choose you Subjects • Subjects are the most important WHATs • Always nouns • Our subjects? • Tweets • Twitterers

  31. The Art of Analytics: Deciding on Facts • In general, it’s difficult to know what questions to ask of our subjects – that is the art of Analytics • KPI (Key Performance Indicator), help us determine which facts (quantitative data) to track • Also helps us think about how we would like to pivot around these facts. I.e. What qualitative (dimension) data we wish to also capture • Altimeter Group has some fancy sounding ones: • share of voice; audience engagement; conversation reach, active advocates, advocate Influence, advocate Impact, resolution rate, resolution time, satisfaction score, topic trends, sentiment ratio, and idea impact • Let’s start simple: • Follower count, following count, num URL click-thrus • Decide on a partition key • Tweet Date (UTC) is an obvious one • For now this is not a priority

  32. The Art of Analytics: Deciding on Dimensions • Dimensions represent the qualitative attributes pertaining to the subject • If our subject is a tweet, the following dimensions are useful: • Keyword searched for to find Tweet • Time of Tweet (both GMT and local time) • Text of Tweet • Twitterer who tweeted tweet • Location of Twitterer • Software Client used to send out tweet (e.g. TweetDeck) • Web sites referenced by Tweet • We can continue to add dimensions, as we see necessary • Once we have our facts and dimensions, we can now create a data model • Denormalized “Star Schema” is a tried-and-true approach to data warehouse modeling

  33. The Science of Analytics: Build out our Schema in RDBMS

  34. The Science of Analytics: Data Definitions • Always include a plain English definition for every data element • Ideally the data definition is unambiguous, accurate, states what it is (as opposed to what it isn’t), and means the same to everybody

  35. The Science of Analytics: Use Staging Tables • Staging tables are mirrors of your fact tables • (e.g. Staging_fact_tweets = fact_tweets) • Staging tables allow you to prepare your fact table data without incurring the performance hits that are normally occur when manipulating massive tables

  36. The Science of Analytics: Use and ETL tool to load data • ETL (Extract Transform Load) are purpose built for loading data warehouses. Advantages include: • Easy to write code that runs safely in parallel • Configuration-oriented: Safer to change in live production environments • Visual Metaphor: Self-documenting code. Easier for others to understand and support

  37. Instrument ETL with job and data check pointing • Data checkpointing ensures we only retrieve data that is new or that which has changed • All Tweets have an auto-incrementing ID, and we can restrict our searches by this, using it as a data checkpoint • Since pulling data from web APIs is inherently unreliable, our jobs must be designed so that they keep track of the last successful operation, and can recover from that point We also need to wait on Twitter’s Rate Limit (of 150) to reset

  38. Ensure history is accurate with Type II SCDs • An SCD (Slowly Changing Dimension) is a dimension (e.g. a tweet’s Twitterer) whose attributes change gradually over time • By adhering to a Type II SCD policy, we archive all changes to a Twitterer and match them to the tweet for when those attributes were valid.

  39. Start your engines • Set up some topics (e.g. NoSQL) • Enter some keywords for the topic • Begin running TopGun Twitter Analytics to commence data collection

  40. Load Data into BI Tool (or just query using SQL) • Some BI tools may require you to build an OLAP data model • OLAP tools build “cubes” which contain the aggregation of every fact, for every combination of dimension value • MOLAP tools handle sparsity well, and can achieve excellent compression, even for billions of dimension tuples

  41. Presentation Over: Download the source code • Includes, Pentaho DI ETL Source Code, MySQL data model, and QlikView v10 load script. • Licensed as Open Source under Gnu Public License v3.0 • Can be downloaded from SourceForge.com • https://sourceforge.net/projects/topgun/ • NB: Requires bit.ly developer key (free) • IP address is used to rate limit Twitter user lookups (be aware if you’re sharing an IP, or using Twitter for other purposes) • Questions can be e-mailed to: education@irmac.ca

More Related