280 likes | 356 Views
eHarmony in Cloud. Subtitle. Brian Ko. eHarmony. Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom. On average, 236 members in US marry every day. More than 20 million registered users. 1. Why Cloud?.
E N D
eHarmony in Cloud Subtitle Brian Ko
eHarmony • Online subscription-based matchmaking service • Available in United States, Canada, Australia and United Kingdom. • On average, 236 members in US marry every day. • More than 20 million registered users. 1
Why Cloud? • Problem exceeds the limits of the data center and data warehouse environment. • Leverage EC2 and Hadoop to scale data 2
Finding match • Model Creation 3
Find matching • Matching 4
Find Matching • Predicative Model Scores 5
Requirement • All the matches, scores, and user information should be archived daily • Ready for 10X growth • Possible O(n2) problem • Need to support set of models becoming more complex 6
Challenge • Current architecture is multi-tiered with a relational back-end • Scoring is DB join intensive • Data need constant archiving • Matches, match scores, user attributes at time of match creation • Model validation is done at a later time across many days • Need a non-DB solution 7
Solution • Open Source Java implementation of Google’s MapReduce framework • Distributes work across vast amounts of data • Hadoop Distributed File System (HDFS) provides reliability through replication • Automatic re-execution on failure/distribution • Scale horizontally on commodity hardware 8
Slide 9 • Simple Storage Service (S3) provides cheap unlimited storage. • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. 9
MapReduce • A large server farm can use MapReduce to process huge dataset. • Map step • Master node takes the input • Chops it up into smaller sub-problems • Distributes those to worker nodes. • Reduce step • Master node takes the answers to all the sub-problems • Combines them in a way to get the output 10
Why Hadoop • Mapper and Reducer are written by you • Hadoop provides • Parallelization • Shuffle and sort 11
Actual Process • Upload to S3 and start EC2 Cluster 13
Actual Process • Process and archive 14
Amazon Elastic MapReduce • It is a web service • EC2 cluster is managed for you behind the scenes • Starts Hadoop implementation of the MapReduce framework on Amazon EC2 • Each step can read and write data directly from and to S3 • Based on Hadoop 0.18.3 15
Elastic MapReduce • No need to explicitly allocate, start and shutdown EC2 instances • Individual jobs were managed by a remote script running on master node (no longer required) • Jobs are arranged into a job flow, created with a single command • Status of a job flow and all its steps are accessible by a REST service 16
Before Elastic Map Reduce • Allocate/Verify cluster • Push application to cluster • Run a control script on the master • Kick off each job step on the master • Create and detect a job completion token • Shut the cluster down 17
After Elastic MapReduce • With Elastic MapReduce we can do all this with a single local command • Uses jar and conf files stored on S3 • Various monitoring tools for EC2 and S3 are provided 18
Development Environment • Cheap to set up on Amazon • Quick setup - Number of servers is controlled by a config variable • Identical to production • Separate development account recommended 19
Cost comparison • Average EC2 and S3 Cost • Each run is 2 to 3 hours • $1200/month for EC2 • $100/month for S3 • Projected in-house cost • $5000/month for a local cluster of 50 nodes running 24/7 • A new company needs to add data center and operation personnel expense 20
Summary • Dev tools really easy to work with and just work right out of the box • Standard Hadoop AMI worked great • Easy to write unit tests for MapReduce • Hadoop community support is great. • EC2/S3/EMR are cost effective
The End 5 minutes of question time starts now!
Questions 4 minutes left!
Questions 3 minutes left!
Questions 2 minutes left!
Questions 1 minute left!
Questions 30 seconds left!
Questions TIME IS UP!