490 likes | 710 Views
Wrangling Data. W ith Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @ gwenshap. Data Has Changed in the Last 30 Years. END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES. DATA GROWTH. UNSTRUCTURED DATA – 90%.
E N D
Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap
Data Has Changed in the Last 30 Years END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES DATA GROWTH UNSTRUCTURED DATA – 90% STRUCTURED DATA – 10% 1980 2013
Data Science is OSEMN (pronounced AWESOME) Obtaining, Scrubbing, Exploring, Modeling, and iNterpretingdata Mostly Scrubbing. Actually.
Data Wrangling (n): The process of converting “raw” data into a format that allows convenient consumption
Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Especially unstructured data • Flexibility where you need it most
Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF
But Eventually, you need Your data in your DWH Oracle Connectors for HadoopRock Data Loading
Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites
Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs
Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb--username hr --table emp--where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb--username myuser--table shops --split-by shop_id--num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans
Or… • hadoopfs -put myfile.txt /bigdata/project/myfile.txt • curl –ilist_of_urls.txt curl https://api.twitter.com/1/users/show.json?screen_name=cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 }
And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http://www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string
Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@
Inconsistencies are endless • Upper vs. lower case • Date formats • Times, timezones, 24h… • Missing values • NULL vs. empty string vs. NA… • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily
Hadoop Strategies • Hive does not validate data on LOAD • Validation script is ALWAYS first step • But not always enough • We have known unknowns • And unknowns unknowns
Known Unknowns • Map-only script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic
Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file
Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots
Anonymization • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data
87% of US populationcan be identified from gender, zip code and date of birth
Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient
Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code
Sqoop sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data
FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfsdfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle
That’s nice. • But can you load data FAST?
Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop No, you don’t need BDA
Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load
Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right?
Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats
Main Benefit: • Processing is done in Hadoop
Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance
Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB
Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading
Given fast enough network and disks, data loading will take all available CPU • This is a good thing