Parallel-lazy performance Java 8 vs Scala vs GS Collections

Parallel-lazy performanceJava 8 vs Scala vs GS Collections Craig Motlin June 2014

Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid

Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of claims and opinions

Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of evidence

Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid

Intro • Solve the same problem in all three libraries • Java (1.8.0_05) • GS Collections (5.1.0) • Scala (2.11.0) • Count how many even numbers are in a list of numbers • Then accomplish the same thing in parallel • Data-level parallelism • Batch the data • Use all the cores

Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation • Market value stats aggregated by product or category

Count: Serial long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); intevens = fastList.count(each -> each % 2 == 0); valevens = arrayBuffer.count(_ % 2 == 0)

Count: Serial Lazy long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); intevens = fastList.asLazy() .count(each -> each % 2 == 0); valevens = arrayBuffer.view .count(_ % 2 == 0)

Count: Parallel Lazy long evens = arrayList.parallelStream() .filter(each -> each % 2 == 0).count(); intevens = fastList .asParallel(executorService, BATCH_SIZE) .count(each -> each % 2 == 0); valevens = arrayBuffer.par.count(_ % 2 == 0)

Parallel Lazy Batch Filter and Count Reduce

Parallel Eager Batch Filter Count Reduce

Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Time for some numbers!

Serial Count ops/s (higher is better)

Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x

Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysingnano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” • 5 forked JVMs per test • 100 warmup iterations per JVM • 50 measurement iterations per JVM • 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/

Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK .parallelStream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }

Java Microbenchmark Harness • @Setup includes megamorphicwarmup • More info on megamorphic in the appendix • This is something that JMH does not handle for you!

Java Microbenchmark Harness • Throughput: higher is better • Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gscthrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gscthrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdkthrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scalathrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gscthrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scalathrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gscthrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdkthrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scalathrpt 250 84.200 1.033 ops/s...

Java Microbenchmark Harness • Performance tests are open sourced • Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/

Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns

Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Let’s look at reasons for the differences in count() Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition.

Count: Java 8

Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }

Count: Java 8 implementation filter(Predicate) .count() Instead of count(Predicate) @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }

Count: Java 8 implementation filter(Predicate) .count() Instead of count(Predicate) @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } Is count()just incrementing a counter?

Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }

Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0);

Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0); Seems like extra work

Count: GS Collections

Count: GS Collections @GenerateMicroBenchmark public void serial_lazy_gsc() { intevens = this.integersGSC .asLazy() .count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }

Count: GS Collections AbstractLazyIterable.java public intcount(Predicate<? super T> predicate) { CountProcedure<T> procedure = new CountProcedure<T>(predicate); this.forEach(procedure); return procedure.getCount(); }

Count: GS Collections FastList.java public void forEach(Procedure<? super T> procedure) { for (inti= 0; i< this.size; i++) { procedure.value(this.items[i]); } }

Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private intcount; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public intgetCount() { return this.count; } }

Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private intcount; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public intgetCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0

Count: Scala

Count: Scala implementation TraversibleOnce.scala defcount(p: A => Boolean): Int= { varcnt= 0 for (x <- this) if (p(x)) cnt+= 1 cnt }

Count: Scala implementation TraversibleOnce.scala defcount(p: A => Boolean): Int= { varcnt= 0 for (x <- this) if (p(x)) cnt+= 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure

Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn public booleanapply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn

Count: Scala implementation Integer int Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem boolean Boolean Boolean.valueOf(boolean)

Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Java’s pull lazy evaluation Scala’s auto-boxing

Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);

Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);

Parallel / Lazy / Scala vallist = this.integers .par .filter(each => each % 10000 != 0) .map(String.valueOf) .map(Integer.valueOf) .filter(each => (each + 1) % 10000 != 0) .toBuffer Assert.assertEquals(999800, list.size)

Stacked computation ops/s(higher is better) 8x

Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);

Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }

Fork-Join Merge • Intermediate results are merged in a tree • Merging is O(n log n) work and garbage

Parallel-lazy performance Java 8 vs Scala vs GS Collections

Parallel-lazy performance Java 8 vs Scala vs GS Collections

Presentation Transcript

Java vs. .Net

Sather vs Java

Java vs. C

Parallel vs Sequential Algorithms

Series vs. Parallel Circuits

Scala vs Clojure

Series VS parallel circuits

Java vs. Javascript

.Net vs. Java

Scala Parallel Collections

Java vs. .Net

Java vs C#

Java vs. C#

Lazy Loading vs. Eager Loading

Java Vs .Net

Lazy learning vs. eager learning

Python Vs Java

java vs kotlin

kotlin vs java

swift vs java

Kotlin vs Scala

Kotlin vs Java