1.02k likes | 1.24k Views
Parallel-lazy performance Java 8 vs Scala vs GS Collections. Craig Motlin June 2014. Goals. Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs
E N D
Parallel-lazy performanceJava 8 vs Scala vs GS Collections Craig Motlin June 2014
Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of claims and opinions
Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of evidence
Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
Intro • Solve the same problem in all three libraries • Java (1.8.0_05) • GS Collections (5.1.0) • Scala (2.11.0) • Count how many even numbers are in a list of numbers • Then accomplish the same thing in parallel • Data-level parallelism • Batch the data • Use all the cores
Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation • Market value stats aggregated by product or category
Count: Serial long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); intevens = fastList.count(each -> each % 2 == 0); valevens = arrayBuffer.count(_ % 2 == 0)
Count: Serial Lazy long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); intevens = fastList.asLazy() .count(each -> each % 2 == 0); valevens = arrayBuffer.view .count(_ % 2 == 0)
Count: Parallel Lazy long evens = arrayList.parallelStream() .filter(each -> each % 2 == 0).count(); intevens = fastList .asParallel(executorService, BATCH_SIZE) .count(each -> each % 2 == 0); valevens = arrayBuffer.par.count(_ % 2 == 0)
Parallel Lazy Batch Filter and Count Reduce
Parallel Eager Batch Filter Count Reduce
Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Time for some numbers!
Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x
Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysingnano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” • 5 forked JVMs per test • 100 warmup iterations per JVM • 50 measurement iterations per JVM • 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/
Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK .parallelStream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
Java Microbenchmark Harness • @Setup includes megamorphicwarmup • More info on megamorphic in the appendix • This is something that JMH does not handle for you!
Java Microbenchmark Harness • Throughput: higher is better • Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gscthrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gscthrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdkthrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scalathrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gscthrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scalathrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gscthrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdkthrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scalathrpt 250 84.200 1.033 ops/s...
Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation • Market value stats aggregated by product or category
Java Microbenchmark Harness • Performance tests are open sourced • Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/
Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns
Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Let’s look at reasons for the differences in count() Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition.
Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
Count: Java 8 implementation filter(Predicate) .count() Instead of count(Predicate) @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
Count: Java 8 implementation filter(Predicate) .count() Instead of count(Predicate) @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } Is count()just incrementing a counter?
Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }
Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0);
Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0); Seems like extra work
Count: GS Collections @GenerateMicroBenchmark public void serial_lazy_gsc() { intevens = this.integersGSC .asLazy() .count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }
Count: GS Collections AbstractLazyIterable.java public intcount(Predicate<? super T> predicate) { CountProcedure<T> procedure = new CountProcedure<T>(predicate); this.forEach(procedure); return procedure.getCount(); }
Count: GS Collections FastList.java public void forEach(Procedure<? super T> procedure) { for (inti= 0; i< this.size; i++) { procedure.value(this.items[i]); } }
Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private intcount; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public intgetCount() { return this.count; } }
Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private intcount; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public intgetCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0
Count: Scala implementation TraversibleOnce.scala defcount(p: A => Boolean): Int= { varcnt= 0 for (x <- this) if (p(x)) cnt+= 1 cnt }
Count: Scala implementation TraversibleOnce.scala defcount(p: A => Boolean): Int= { varcnt= 0 for (x <- this) if (p(x)) cnt+= 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure
Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn public booleanapply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn
Count: Scala implementation Integer int Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem boolean Boolean Boolean.valueOf(boolean)
Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Java’s pull lazy evaluation Scala’s auto-boxing
Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation • Market value stats aggregated by product or category
Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);
Parallel / Lazy / Scala vallist = this.integers .par .filter(each => each % 10000 != 0) .map(String.valueOf) .map(Integer.valueOf) .filter(each => (each + 1) % 10000 != 0) .toBuffer Assert.assertEquals(999800, list.size)
Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }
Fork-Join Merge • Intermediate results are merged in a tree • Merging is O(n log n) work and garbage