170 likes | 582 Views
Learning Objectives for Big Data. Define big data and understand how it is differentiated from “regular old” data. Recognize examples and applications of big data. Understand the key problems we are trying to solve when coping with big data.
E N D
Learning Objectives for Big Data Define big data and understand how it is differentiated from “regular old” data. Recognize examples and applications of big data. Understand the key problems we are trying to solve when coping with big data. Become aware of the “solutions” that people are using to cope with big data.
Big Data • Definitions differ depending on perspective. • Data that is difficult to process using traditional database and software techniques (abbreviated from Wikipedia/Webopedia). • “Big” is relative to the organization. • “Big” is relative in time.
Characteristics of big data Volume Variety Variability Velocity Veracity
What are the problems with big data? • Dealing with different types of data. • Data that doesn’t have a clear data type. • Data that changes data type. • Unstructured data: does not have a pre-defined data model; usually text. • Storing and accessing incredibly large quantities of data. • Transforming and loading data immediately. • Performing analytics immediately. • Using “big data” to create “real information”.
Solutions for storing unstructured data • Rows and columns don’t work. • Need a “file” or “document” type of management system. • Examples: • MongoDB • VelocityDB • Apache Hadoop (HDFS) • Oracle NoSQL • CouchDB
Solutions for storing and accessing big data (1) • Distribute processing of very large multi-structured data files across a large cluster of ordinary machines/processors • MapReduce • Sharding/Horizontal partitioning • Break the data into parts, which are then loaded into a file system on multiple nodes. • Each part may be replicated multiple times. • The results are collected and aggregated using a MapReduce algorithm, or other type of partitioning algorithm.
Solutions for storing and accessing big data (2) • Lots of memory; really fast disk • In-memory computing • HANA (SAP) • DB2 BLU (IBM) • Informix (IBM) • ActiveSpaces (TIBCO Software) • Oracle • Database appliance: marketing term for an integrated set of servers, storage, operationg system, and DBMS specifically pre-installed and pre-optimized for data warehousing (Wikipedia rules!!)
Solutions for TL immediacy • Transform after loading data. Perform data loading and transformation continuously. • Problems: • Most data transformation tools are not designed to work well with unstructured data. • Few frameworks are currently focusing on ETL, because the data is not “mission critical.” • Opportunities!!!!
Solutions for analytics immediacy Define need for immediacy. Real-time or close?? Streaming analytics: process data as it arrives; usually does not compare against all existing data – usually has a pre-defined “window” of time/data used for analytical processing. May or may not store the results of the analytical processes. Perpetual analytics: process data as it arrives comparing it against existing data and then storing the results of the analytics.
Solutions for creating information from big data Culture of data-driven decision making. Data scientist. Information visualization techniques.