490 likes | 783 Views
Infobright Meetup Host Avner Algom May 28, 2012. Agenda. Infobright Paul Desjardins What part of the Big Data problem does Infobright solve? VP Bus Dev Where does Infobright fit in the database landscape? Infobright Inc Joon Kim Technical Overview
E N D
Infobright Meetup Host AvnerAlgom May 28, 2012
Agenda • Infobright • Paul Desjardins What part of the Big Data problem does Infobright solve? VP Bus Dev Where does Infobright fit in the database landscape? Infobright Inc • Joon Kim Technical Overview Sr. Sales Engineer Use cases • WebCollage • LiorHahamWebCollage Case Study: Using Analytics for Hosted Web Applications • Zaponet • AsafBirenzvieg Introduction / Experience with Infobright CEO • Q/A
Growing Customer Base across Use Cases and Verticals • 1000 direct and OEM installations across North America, EMEA and Asia • 8 of Top 10 Global Telecom Carriers using Infobright via OEM/ISVs
The Machine-Generated Data Problem “Machine-generated data is the future of data management.” Curt Monash, DBMS2 • Machine-generated/hybrid data • Weblogs • Computer, network events • CDRs • Financial trades • Sensors, RFID etc • Online game data • Human-generated data - input from most conventional kinds transactions • Purchase/sale • Inventory • Manufacturing • Employment status change Rate of Growth
The Value in the Data “Analytics drives insights; insights lead to greater understanding of customers and markets; that understanding yields innovative products, better customer targeting, improved pricing, and superior growth in both revenue and profits.” Accenture Technology Vision, 2011
Current Technology: Hitting the WallToday’s database technology requires huge effort and massive hardware How Performance Issues are Typically Addressed – by Pace of Data Growth Source: KEEPING UP WITH EVER-EXPANDING ENTERPRISE DATA By Joseph McKendrick, Research Analyst; Unisphere Research October 2010
Infobright Customer Performance Statistics Fast query response with no tuning or indexes • Mobile Data(15MM events) • Analytic Queries • Oracle Query Set • Alternative • Alternative • Alternative • 43 min with SQL Server • 23 seconds • 2+ hours with MySQL • <10 seconds • 10 seconds – 15 minutes • 0.43 – 22seconds • BI Report • Data Load • Alternative • Alternative • 7 hrs in Informix • 17 seconds • 11 hours in MySQL ISAM • 11 minutes
Save Time, Save Cost • Fastest time to value • Download in minutes, install in minutes • No indexes, no partitions, no projections • No complex hardware to install • Minimal administration • Self-tuning • Self-managing • Eliminate or reduce aggregate table creation • Outstanding performance • Fast query response against large data volume • Load speeds over 2TB /hour with DLP • High data compression 10:1 to 40:1+ • Economical • Low subscription cost • Less data storage • Industry-standard servers
Where does Infobright fit in the database landscape? • One Size DOESN’T fit all. • Specialized Databases Deployed • Excellent at what they were designed for • More open source specialized databases than commercial • Cloud / SaaS use for specialty DBMS becomes popular • Database Virtualization • Significantly lowered DBA costs
Why use Infobright to deal with large volumes of machine generated data? • EASY • TO INSTALL • TO USE • AFFORD- • ABLE • LESS HW • LOW SW • COST • FAST • FAST QUERY • FAST LOAD
Technical Overview of Infobright Joon Kim Senior Sales Engineer joon.kim@infobright.com
Key Components of Infobright 003 Column-Oriented • Smarter architecture • Load data and go • No indices or partitions to build and maintain • Knowledge Grid automatically updated as data packs are created or updated • Super-compact data foot- print can leverage off-the-shelf hardware Knowledge Grid–statistics and metadata “describing” the super-compressed data Data Packs – data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type
1. Column Orientation Incoming Data Column Oriented Layout (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;) • Works well with aggregate results (sum, count, avg. ) • Only columns that are relevant need to be touched • Consistent performance with any database design • Allows for very efficient compression
Compression • Results vary depending on the distribution of data among data packs • A typical overall compression ratio seen in the field is 10:1 • Some customers have seen results of 40:1 and higher • For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity 2. Data Packs and Compression Patent Pending Compression Algorithms • Data Packs • Each data pack contains 65,536 data values • Compression is applied to each individual data pack • The compression algorithm varies depending on data type and distribution 64K 64K 64K 64K
3. The Knowledge Grid Knowledge Grid Applies to the whole table Knowledge Nodes Built for each Data Pack Column A Information about the data Calculated during load Basic Statistics DP1 Column A Column B … Numerical Ranges DP1 DP2 Character Maps DP3 DP4 DP5 DP6 Dynamic Calculated during query 17
Knowledge Grid Internals 006 Data Pack Nodes (DPN) A separate DPN is created for every data pack created in the database to store basic statistical information Character Maps (CMAPs) Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character Histograms Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals. Pack-to-Pack Nodes (PPN) PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used. This metadata layer = 1% of the compressed volume
Optimizer / Granular Computing Engine Query received Engine iterates on Knowledge Grid Each pass eliminates Data Packs If any Data Packs are needed to resolve query, only those are decompressed Knowledge Grid Query Results 1% Q: How are my sales doing this year? Compressed Data 19
How the Optimizer Works 007 SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; salary age job city All packs ignored Rows 1 to 65,536 Find the Data Packs with salary > 50000 Find the Data Packs that contain age < 65 65,537 to 131,072 All packs ignored Find the Data Packs that have job = ‘Shipping’ 131,073 to …… Find the Data Packs that have City = “Toronto’ All packs ignored Now we eliminate all rows that have been flagged as irrelevant. Finally we have identified the data pack that needs to be decompressed Only this pack will be decompressed All values match Completely Irrelevant Suspect
Infobright Architected on MySQL • “The world’s most popular open source database”
Sample Script (Create Table, Import, Export) 065 -- Import the text file. Set AUTOCOMMIT=0; SET @bh_dataformat = 'txt_variable'; LOAD DATA INFILE "/tmp/Input/customers.txt" INTO TABLE customers FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n'; COMMIT; -- Export the data into BINARY format. SET @bh_dataformat = 'binary'; SELECT * INTO OUTFILE "/tmp/output/customers.dat" FROM customers; -- Export the data into TEXT format. SET @bh_dataformat = 'txt_variable'; SELECT * INTO OUTFILE "/tmp/output/customers.text" FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n' FROM customers; USE Northwind; DROP TABLE IF EXISTS customers; CREATE TABLE customers ( CustomerID varchar(5), CompanyName varchar(40), ContactName varchar(30), ContactTitle varchar(30), Address varchar(60), City varchar(15) Region char(15) PostalCode char(10), Country char(15), Phone char(24), Fax varchar(24), CreditCard float(17,1), FederalTaxes decimal(4,2) ) ENGINE=BRIGHTHOUSE;
Infobright 4.0 – Additional Features Built-in intelligence for machine-generated data: Find ‘Needle in the Haystack’ faster
Work with Data Even Faster • Intelligence to automatically optimize the database • DomainExpert DomainExpert: Breakthrough Analytics • Enables users to add intelligence into Knowledge Grid directly with no schema changes • Pre-defined/Optimized for web data analysis • IP addresses • Email addresses • URL/URI • Can cut query time in half when using this data definition
Pattern recognition enables faster queries Patterns defined and stored Complex fields decomposed into more homogeneous parts Database uses this information when processing query Users can also easily add their own data patterns Identify strings, numerics, or constants Financial Trading example– ticker feed “AAPL–350,354,347,349” encoded “%s-%d,%d,%d,%d” Will enable higher compression DomainExpert: Prebuilt plus DIY options
Get Data In Faster: DLP • Near-real time ad-hoc analysis Distributed Load Processor (DLP) • Add-on product to IEE which linearly scales load performance • Remote servers compress data and build Knowledge Grid elements on-the-fly… • Appended to the data server running the main Infobright database • It’s all about speed: Faster Load & Queries • Linear scalability of data load for very high performance
Get Data In Faster: Hadoop • Near-real time ad-hoc analysis Big Data - Hadoop Support • DLP Hadoop connector • Extracts data from HDFS, load into Infobright at high speeds • You load 100s of TBs or Petabytes into Hadoop for bulk storage and batch processing • Then load TBs into Infobright for near-real time analytics using Hadoop connector and DLP • Hadoop connectivity • Use the right tool for the job Infobright / Hadoop: Perfect complement to analyze Big Data
Rough Query: Speed Up Data Mining by 20x • Near-real time ad-hoc analysis Rough Query – Another Infobright Breakthrough • Enables very fast iterative queries to quickly drill down into large volumes of data • “Select roughly” to instantaneously see interval range for relevant data, • uses only the in-memory Knowledge Grid information • Filtering can narrow results • Need more detail? Drill down further with rough query or query for exact answer • Rough Query: Data mining “drill down” at RAM speed
The Value Infobright Delivers High performance with much less work and lower cost
Infobright and Hadoop in Video Advertising: LiveRail “Infobright and Hadoop are complementary technologies that help us manage large amounts of data while meeting diverse customers needs to analyze the performance of video advertising investments.” Andrei Dunca, CTO of LiveRail
Case Study: JDSU • Annual revenues exceeded $1.3B in 2010 • 4700 employees are based in over 80 locations worldwide • Communications sector offers instruments, systems, software, services, and integrated solutions that help communications service providers, equipment manufacturers, and major communications users maintain their competitive advantage
JDSU Service Assurance Solutions • Ensure high quality of experience (QoE) for wireless voice, data, messaging, and billing. • Used by many of the world’s largest network operators
JDSU Project Goals • New version of Session Trace solution that would: • Support very fast load speeds to keep up with increasing call volume and the need for near real-time data access • Reduce the amount of storage by 5x, while also keeping much longer data history • Reduce overall database licensing costs 3X • Eliminate customers’ “DBA tax,” meaning there should require zero maintenance or tuning while enabling flexible analysis • Continue delivering the fast query response needed by Network Operations Center (NOC) personnel when troubleshooting issues and supporting up to 200 simultaneous users
Session Trace Application For deployment at Tier 1 network operators, each site will store between 6 and 45TB of data, and the total data volume will range from 700TB to 1PB of data.
Save Time, Save Cost • Fastest time to value • Download in minutes, install in minutes • No indexes, no partitions, no projections • No complex hardware to install • Minimal administration • Self-tuning • Self-managing • Eliminate or reduce aggregate table creation • Outstanding performance • Fast query response against large data volume • Load speeds over 2TB /hour with DLP • High data compression 10:1 to 40:1+ • Economical • Low subscription cost • Less data storage • Industry-standard servers
What Our Customers Say “Using Infobright allows us to do pricing analyses that would not have been possible before.” “With Infobright, [this customer] has access to data within minutes of transactions occurring, and can run ad-hoc queries with amazing performance.” "Infobright offered the only solution that could handle our current data load and scale to accommodate a projected growth rate of 70 percent, without incurring prohibitive hardware and licensing costs. “Using Infobright allowed JDSU to meet the aggressive goals we set for our new product release: reducing storage and increasing data history retention by 5x, significantly reducing costs, and meeting the fast data load rate and query performance needed by the world’s largest network operators.”
Where does Infobright fit in the database landscape? • One Size DOESN’T fit all. • Specialized Databases Deployed • Excellent at what they were designed for • More open source specialized databases than commercial • Cloud / SaaS use for specialty DBMS becomes popular • Database Virtualization • Significantly lowered DBA costs
NoSQL: Unstructured Data Kings • Schema-less Designs • Extreme Transaction Rates • Massive Horizontal Scaling • Heavy Data Redundancy • Niche Players • Tame the Unstructured • Store Anything • Keep Everything Top NoSQL Offerings
NoSQL: Breakout 120+ Variants : Find More at nosql-databases.org
Lest We Forget Hadoop • Scalable, fault-tolerant distributed system for data storage and processing • HadoopDistributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing Value Add • Flexible :store schema-less data and add as needed • Affordable : low cost per terabyte • Broadly adopted : Apache Project with a large, active ecosystem • Proven at scale : petabyte+ implementations in production today
NewSQL: Operational, Relational Powerhouses • Overclock Relational Performance • Scale-Out • Scale “Smart” • New, Scalable SQL • Extreme Transaction Rates • Diverse Technologies • ACID Compliance