1 / 37

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark . Gavin Li, Jaebong Kim, Andy Feng Yahoo. Agenda. Audience Expansion Spark Application Spark scalability: problems and our solutions Performance tuning. How we built audience expansion on Spark . audience expansion.

bendek
Download Presentation

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo

  2. Agenda • Audience Expansion Spark Application • Spark scalability: problems and our solutions • Performance tuning

  3. How we built audience expansion on Spark audience expansion

  4. Audience Expansion • Train a model to find users perform similar as sample users • Find more potential “converters”

  5. System • Large scale machine learning system • Logistic Regression • TBs input data, up to TBs intermediate data • Hadoop pipeline is using 30000+ mappers, 2000 reducers, 16 hrs run time • All hadoopstreaming, ~20 jobs • Use Spark to reduce latency and cost

  6. Pipeline

  7. How to adopt to Spark efficiently? • Very complicated system • 20+ hadoop streaming map reduce jobs • 20k+ lines of code • Tbs data, person.monthsto do data validation • 6+ person, 3 quarters to rewritethe system based on Scala from scratch

  8. Our migrate solution • Build transition layer automatically convert hadoop streaming jobs to Spark job • Don’t need to change any Hadoop streaming code • 2 person*quarter • Private Spark

  9. ZIPPO Audience Expansion Pipeline 20+ Hadoop Streaming jobs ZIPPO: Hadoop Streaming Over Spark Hadoop Streaming Spark HDFS

  10. ZIPPO • A layer (zippo) between Spark and application • Implemented all Hadoop Streaming interfaces • Migrate pipeline without code rewriting • Can focus on rewriting perfbottleneck • Plan to open source Audience Expansion Pipeline Hadoop Streaming ZIPPO: Hadoop Streaming Over Spark Spark HDFS

  11. ZIPPO - Supported Features • Partition related • HadoopPartitioner class (-partitioner) • Num.map.key.fields, num.map.parition.fields • Distributed cache • -cacheArchive, -file, -cacheFile • Independent working directory for each task instead of each executor • HadoopStreaming Aggregation • Input Data Combination (to mitigate many small files) • Customized OutputFormat, InputFormat

  12. Performance Comparison 1Tb data • Zippo Hadoop streaming • Spark cluster • 1 hard drive • 40 hosts • Perf data: • 1hr 25 min • Original Hadoop streaming • Hadoopcluster • 1 hard drives • 40 Hosts • Perf data • 3hrs 5 min

  13. Spark Scalability

  14. Spark Shuffle • Mapper side of shuffle write all the output to disk(shuffle files) • Data can be large scale, so not able to all hold in memory • Reducers transfer all the shuffle files for each partition, then process

  15. Spark Shuttle Reducer Partition 1 Reducer Partition 2 Mapper m-2 Reducer Partition 3 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle n Reducer Partition n Mapper 1 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle n

  16. On each Reducer • Every partition needs to hold all the data from all the mappers • In hash map • In memory • Uncompressed Reducer i of 4 cores Partition 1 Partition 2 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Partition 3 Partition 4 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n

  17. How many partitions? • Need to have small enough partitions to put all in memory Host 2 (4 cores) Host 1 (4 cores) …… Partition 12 Partition 11 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Partition 10 Partition 13 Partition 14 Partition n ……

  18. Spark needs many Partitions • So a common pattern of using Spark is to have big number of partitions

  19. On each Reducer • For 64 Gb memory host • 16 cores CPU • For compression ratio 30:1, 2 times overhead • To process 3Tb data, Needs 46080 partitions • To process 3Pb data, Need 46 million partitions

  20. Non Scalable • Not linear scalable. • No matter how many hosts in total do we have, we always need 46k partitions

  21. Issues of huge number of partitions • Issue 1: OOM in mapper side • Each Mapper core needs to write to 46k shuffle files simultaneously • 1 shuffle file = OutputStream + FastBufferStream + CompressionStream • Memory overhead: • FD and related kernel overhead • FastBufferStream(for making ramdom IO to sequential IO), default 100k buffer each stream • CompressionStream, default 64k buffer each stream • So by default total buffer size: • 164k * 46k * 16 = 100+ Gb

  22. Issues of huge number of paritions • Our solution to Mapper OOM • Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream (kernel block size) • Based on our Contributed patch https://github.com/mesos/spark/pull/685 • Set spark.storage.compression.codec to spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint • Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can still have good compression ratio) • Total buffer size after this: • 12k * 46k * 16 = 10Gb

  23. Issues of huge number of partitions • Issue 2: large number of small files • Each Input split in Mapper is broken down into at least 46K partitions • Large number of small files makes lots of random R/W IO • When each shuffle file is less then 4k (kernel block size), overhead becomes significant • Significant meta data overhead in FS layer • Example: only manually deleting the whole tmp directory can take 2 hour as we have too many small files • Especially bad when splits are not balanced. • 5x slower than Hadoop Input Split 1 Input Split 2 Input Split n Shuffle 1 Shuffle 2 Shuffle 3 Shuffle 46080 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle 46080 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle 46080 … … …

  24. Reduce side compression • Current shuffle in reducer side data in memory is not compressed • Can take 10-100 times more memory • With our patch https://github.com/mesos/spark/pull/686, we reduced memory consumption by 30x, while compression overhead is only less than 3% • Without this patch it doesn’t work for our case • 5x-10x performance improvement

  25. Reduce side compression • Reducer side • compression – 1.6k files • Noncompression – 46k shuffle files

  26. Reducer Side Spilling Reduce Compression Bucket 1 Compression Bucket 2 Compression Bucket 3 Compression Bucket n … Spill 1 Spill 2 Spill n

  27. Reducer Side Spilling • Spills the over-size data to Disk in the aggregation hash table • Spilling - More IO, more sequential IO, less seeks • All in mem – less IO, more random IO, more seeks • Fundamentally resolved Spark’s scalability issue

  28. Align with previous Partition function • Our input data are from another map reduce job • We use exactly the same hash function to reduce number of shuffle files

  29. Align with previous Partition function • New hash function, More even distribution Previous Job Generating Input data Spark Job Input Data 0 Key 0, 4, 8… shuffule file 0 shuffule file 0 shuffule file 1 shuffule file 1 shuffule file 2 shuffule file 2 Key 1,5,9… shuffule file 3 shuffule file 3 shuffule file 4 Mod 4 Mod 5 shuffule file 4 Key 2, 6, 10… shuffule file 0 shuffule file 0 shuffule file 1 shuffule file 1 shuffule file 2 shuffule file 2 Key 3, 7, 11… shuffule file 3 shuffule file 3 shuffule file 4 shuffule file 4

  30. Align with previous Partition function • Use the same hash function Previous Job Generating Input data Spark Job Input Data 0 Key 0, 4, 8… 1 shuffle file Key 1,5,9… 1 shuffle file Mod 4 Mod 4 Key 2, 6, 10… 1 shuffle file Key 3, 7, 11… 1 shuffle file

  31. Align with previous Hash function • Our Case: • 16m shuffle files, 62kb on average (5-10x slower) • 8k shuffle files, 125mb on average • Several different input data sources • Partition function from the major one

  32. Performance tunning

  33. All About Resource Utilization • Maximize the resource utilization • Use as much CPU,Mem,Disk,Net as possbile • Monitor vmstat, iostat, sar

  34. Resource Utilization • (This is old diagram, to update)

  35. Resource Utilization • Ideally CPU/IO should be fully utilized • Mapper phase – IO bound • Final reducer phase – CPU bound

  36. Shuffle file transfer • Spark transfers all shuffle files to reducer memory before start processing. • Non-streaming(very hard to change to streaming). • For poor resource utilization • So need to make sure maxBytesInFlight is set big enough • Consider allocating 2x more threads than physical core number

  37. Thanks. Gavin Li liyu@yahoo-inc.com Jaebong Kim pitecus@yahoo-inc.com Andrew Fengafeng@yahoo-inc.com

More Related