240 likes | 272 Views
Applications on Spark. Prof. Harold Liu Beijing Institute of Technology October 2015. Who Are Using Spark These Days?. 2. https://spark-summit.org/.
E N D
Applications on Spark Prof. Harold Liu Beijing Institute of TechnologyOctober 2015
Who Are Using Spark These Days? 2 https://spark-summit.org/
From the figure above, over 1,000 companies have taken Spark platform into productions, including famous traditional manufacturers like TOYOTA and O2O company like Uber and airbnb. It indicates that the Spark user field has been expanded, not only in the Internet based industry, but also to traditional industries. Lots of big data framework distributors, including the former Hadoop distributors like Hortonworks and Cloudera, are beginning to take Spark into deployment, which will have a bigger impact in its spread. 3
Open Source Spark Community The figure shows that the number of contributors has increased rapidly from 2010 to 2014. Among these contributors, lots of Chinese organizations and developers show their enthusiasm on Spark. Now, the biggest Spark cluster of over 8,000 nodes is in Tencent and the highest amount of processed data per job is 1PB, recorded by Alibaba and Databricks. 4
Entertainment: Tecent Company Background: The biggest social service provider in China. Data Background: By the end of 2015, the active QQ users per month have exceeded 8,000 million. The active Wechat user per month have exceeded 6,000 million. They will bring over 200TB data every day. Business Requirement: Over 90% data need to be processed online. 6
Tencent Distributed Data Warehouse TDW collects all product level data and provides data storage and analysis services. TDW supports PB-level data storage and computing. It has two parts: one is off-line M/R and the other is online computing by Storm. 7
Hadoop V.S Spark on M/R Spark works much faster than Hadoop. The running time is only a quarter of that of Hadoop. Compute efficiencycan be faster when adding more executors. Overall, when facing data mining problems, traditional Hadoop M/R framework has serious performance problem, while the Spark can deal with the problem based on its iterative and in-memory computing. 8
E-commerce:Taobao • Company Background • The biggest C2C e-commerce company in China and the Spark pioneer user (since 2012) • Data Background • Up to 2014, Taobao has over 5,000 million registered members and 1,200 million active members. • Taobao has over 90 billion turnovers on November 11, 2014. • Its various businesses bring TB-lever data every day. • Business Requirement • In the past few years, Taobao has been using Yun Ti based on Hadoop. But Hadoop will encounter lots of problems in iterative computing. So Spark comes to its view. 9
Spark in Taobao 10 nodes cluster Yarn version:0.23.7 200 nodes Yarn cluster The figure shows the history of using Spark in Taobao. Taobao has been using Spark when Spark is very young (2012). 10
Spark Development Process in Taobao Before putting the job into production servers, the job will be tested on test servers. And the code will be merged to local repository or push to the open source community. 11
Recommender System in Taobao The recommender system combines Spark, Spark MLlib and Spark Streaming frameworks. It can perform bothoffline and online analysis that covers most parts of business requests in Taobao. 12
Test of K-Means Algorithm From the memory aspect, increasing worker’s memory will cut the running time. And increase worker numbers will have better performance. 13
Telecom: Telefonica • Company Background • Telefonica is a Spanish telecommunication company who provides comprehensive services including mobile phone, internet, data and wired television services. • Data Background • Telefonica is the biggest multi-national enterprise in Spain who provides customer services for over 40 countries. Its various businesses bring huge data. • Business Requirement • As the volume of data is increasing rapidly, network security problem comes to its sight, such as DDoS attack, SQL injection attack, account theft etc. • Using big data analysis technology to prevent Cyber crime has become urgent to the company. 14
Why Spark? • Spark provides full stack applications (i.e., SQL, Streaming, MLlib, GraphX) • Easy to use spark to analyze historical data and streaming data. • Support various applications and data sources in order to deal with complex application scenarios • Leverage the SQL language to use the power of Spark • The number of components in Spark is much fewer than that of Hadoop 15
Components of Spark and Hadoop From the figure above, the number of components in Spark is about half of that in Hadoop. Then, using Spark can potentially have much less errors because of less components. 16
Spark Production Architecture in Telefonica Data collection: Kafka Data pre-processing: Storm Batch processing: Cassandra+Spark It use distributed message queue system called “Kafka” to collect data from various sources. Then, data will be consumed by Storm for pre-processing. Finally, data will be processed by Spark or saved in Cassandra. 17
Retail: Euclid • Company Background • Euclid Analysis is a geo-data analysis company who provides solutions to customers based on offline positional information. • Data Background • Euclid mainly relies on WiFi devices to collect data from the physical world. • Business Requirement • Euclid’s main job is to support location based analysis services for customers. • Through collecting customer behavior data, it tries to know customer’s behavior and shopping feature, and suggestion future behaviors. 18
Retail Customer Features • Through the data collected from WiFi devices, customers can be divided into three parts: frequent customers, pass-by customers and quick-leave customers. • Some of them like to buy products, some spend a lot of time in store and some like to travel around in a zone.
Analysis Procedure with Spark First, mobile data are collected by WiFi devices through the pinged signals, which include device MAC address, magnitude of signal and other information. Then, these data will be sent to cloud and processed on Spark cluster. Finally, customers will know the analysis result on web.
Other Area: PubMatic • Company Background • PubMatic is an advertisement company • It developed the first real-time advertisement analysis system in the world marketing field. • Data Background • PubMatic has 6 geo-data data centers with 6 PB data to manage. • Every day it will post 12 billion ads and deal with 1,000 billion bids. • Now 22TB data are produced by its system. • Business Requirement • Because of its owned complex and various ad data, PubMatic needs to process the data in real-time.
System Architecture in PubMatic As we can see from the figure above, various streaming data (flows) are fed into memory which will be process by Spark. Finally, the data will be saved in HDFS and Amazon S3.
Spark v.s. Hive on Query Performance When the data volume is 192GB, it will cost 550 seconds on Spark while Hive needs 850s to deal with the same problem. As the data volume is increasing, the running time of Spark is 40% less then Hive on average.
Effect of Using Spark in PubMatic • Spark supports both offline and online data processing. • It has active community support and be compatible with Hadoop ecosystem. • Through the use of Spark Streaming, Spark SQL and Spark Mllib technologies together, PubMatic can provide real-time ads service and business analysis report to customers in a faster speed than ever before.