Benchmark: concatenate lists

Benchmark: concatenate lists

Recently, I came across this piece of code that concatenates two lists using Guava’s Lists utility class and the Stream API:

Lists.newArrayList(collectionOriginal,collectionValue)
        .stream()
        .flatMap(Collection::stream)
        .toList();

As this piece of code is in the hot path of the application and therefore called very frequently, I wondered if this was the best way to concatenate two lists.

Two lists can be concatenated in many different ways:

  • With the above code
  • With Stream.concat()
  • With Stream.of()
  • With ListUtils from Apache Commons Collection
  • With ArrayList.addAll()

There are certainly other ways, but I wanted to compare these five!

To compare these five solutions, there’s nothing better than JMH – The Java Microbenchmark Harness. This tool enables us to write relevant microbenchmarks, taking into account the internal characteristics of the JVM. It also offers integrated tools for analyzing the performance of our tests (profiling, disassembling, etc.). If you’re not familiar with JMH, please refer to this article: INTRODUCTION TO JMH – JAVA MICROBENCHMARK HARNESS.

The benchmark

To be able to compare different list sizes, I use @Param to parameterize the size of the lists to be concatenated. Here’s the setup I used:

@Param({"10", "1000", "10000"})
int size;

List<String> list1;
List<String> list2;

@Setup
public void setup() {
    list1 = new ArrayList<>(size);
    list2 = new ArrayList<>(size);
    for (int i = 0; i < size; i++) {
        list1.add("list1-" + i);
        list2.add("list2-" + i);
    }
}

Voici ensuite l'implémentation des cinq benchmarks

@Benchmark
public List<String> concatWithStreamConcat() {
    return Stream.concat(list1.stream(), list2.stream())
            .toList();
}

@Benchmark
public List<String> concatWithStreamOf() {
    return Stream.of(list1, list2)
            .flatMap(List::stream)
            .toList();
}

@Benchmark
public List<String> concatWithGuava() {
    return Lists.newArrayList(list1, list2)
            .stream()
            .flatMap(Collection::stream)
            .toList();
}

@Benchmark
public List<String> concatWithCommonsCollection() {
    return ListUtils.union(list1, list2);
}

@Benchmark
public List<String> concatWithNewArrayList() {
    List<String> newList = new ArrayList<>(list1.size() + list2.size());
    newList.addAll(list1);
    newList.addAll(list2);
    return newList;
}

The results

All tests were run on my laptop: Intel(R) Core(TM) i7-1260P 8 coeurs (16 avec hyperthreading) – Ubuntu 24.10.

The Java version used is openjdk version "21.0.5" 2024-10-15 LTS.

 Benchmark                               (size)  Mode  Cnt       Score        Error  Units
 ConcatList.concatWithCommonsCollection      10  avgt    5      57,706 ±     27,376  ns/op
 ConcatList.concatWithCommonsCollection    1000  avgt    5    1631,110 ±    472,323  ns/op
 ConcatList.concatWithCommonsCollection   10000  avgt    5   15434,192 ±   2164,191  ns/op
 ConcatList.concatWithGuava                  10  avgt    5     225,249 ±     83,069  ns/op
 ConcatList.concatWithGuava                1000  avgt    5    7967,360 ±   8693,234  ns/op
 ConcatList.concatWithGuava               10000  avgt    5  108601,706 ± 104439,847  ns/op
 ConcatList.concatWithNewArrayList           10  avgt    5      54,327 ±     10,894  ns/op
 ConcatList.concatWithNewArrayList         1000  avgt    5    1515,667 ±    595,135  ns/op
 ConcatList.concatWithNewArrayList        10000  avgt    5   14827,026 ±   1253,715  ns/op
 ConcatList.concatWithStreamConcat           10  avgt    5     111,837 ±     26,451  ns/op
 ConcatList.concatWithStreamConcat         1000  avgt    5    7104,415 ±   4030,866  ns/op
 ConcatList.concatWithStreamConcat        10000  avgt    5   57587,144 ±  13994,551  ns/op
 ConcatList.concatWithStreamOf               10  avgt    5     157,634 ±     55,757  ns/op
 ConcatList.concatWithStreamOf             1000  avgt    5    5915,797 ±   1692,437  ns/op
 ConcatList.concatWithStreamOf            10000  avgt    5   58156,102 ±   1655,090  ns/op

What's immediately obvious is that the version used today is the worst :(.
If we leave it aside, we can see that concatWithStreamOf and concatWithStreamConcat perform almost identically, as do concatWithNewArrayList and concatWithCommonsCollection. A look at the code of ListUtils.union(list1, list2) shows that it does exactly what concatWithNewArrayList does.

Further analysis

To take the analysis a step further, let's re-run the benchmark with JMH's built-in GC profiler via the command-line option -prof gc, which is often a good start ;). I'm particularly interested in the memory allocation rate.

ConcatList.concatWithCommonsCollection:·gc.alloc.rate.norm                   10  avgt    5     232,012 ±     0,002    B/op
ConcatList.concatWithCommonsCollection:·gc.alloc.rate.norm                 1000  avgt    5   16072,479 ±     0,036    B/op
ConcatList.concatWithCommonsCollection:·gc.alloc.rate.norm                10000  avgt    5  160077,107 ±     0,434    B/op
ConcatList.concatWithGuava:·gc.alloc.rate.norm                               10  avgt    5     888,058 ±     0,009    B/op
ConcatList.concatWithGuava:·gc.alloc.rate.norm                             1000  avgt    5   16969,104 ±     0,221    B/op
ConcatList.concatWithGuava:·gc.alloc.rate.norm                            10000  avgt    5  212148,835 ±     2,822    B/op
ConcatList.concatWithNewArrayList:·gc.alloc.rate.norm                        10  avgt    5     232,011 ±     0,003    B/op
ConcatList.concatWithNewArrayList:·gc.alloc.rate.norm                      1000  avgt    5   16072,511 ±     0,040    B/op
ConcatList.concatWithNewArrayList:·gc.alloc.rate.norm                     10000  avgt    5  160076,834 ±     0,366    B/op
ConcatList.concatWithStreamConcat:·gc.alloc.rate.norm                        10  avgt    5     424,023 ±     0,006    B/op
ConcatList.concatWithStreamConcat:·gc.alloc.rate.norm                      1000  avgt    5    8344,716 ±     0,173    B/op
ConcatList.concatWithStreamConcat:·gc.alloc.rate.norm                     10000  avgt    5   80350,980 ±     1,306    B/op
ConcatList.concatWithStreamOf:·gc.alloc.rate.norm                            10  avgt    5     824,037 ±     0,007    B/op
ConcatList.concatWithStreamOf:·gc.alloc.rate.norm                          1000  avgt    5   16905,076 ±     0,208    B/op
ConcatList.concatWithStreamOf:·gc.alloc.rate.norm                         10000  avgt    5  212083,533 ±     2,497    B/op

Here, we're interested in the value of gc.alloc.rate.norm, which is the normalized memory allocation rate per operation, and we can see a big difference right away.

If we take ConcatList.concatWithNewArrayList as our baseline:

  • concatWithCommonsCollection: same allocation.
  • concatWithGuava: 4x more allocation.
  • concatWithStreamConcat: 2x more d'allocation.
  • concatWithStreamOf: 4x more allocation.

This explains some of the differences in performance, but not all.

Let's run the test again with the stack profiler, a simple, naive profiler, integrated into JMH.

# Benchmark: fr.loicmathieu.jmh.ConcatList.concatWithCommonsCollection
# Parameters: (size = 10)

....[Thread state: RUNNABLE]........................................................................
 33,3%  50,0% $lt;stack is empty, everything is filtered?>
 31,6%  47,4% fr.loicmathieu.jmh.ConcatList.concatWithCommonsCollection
  1,5%   2,3% fr.loicmathieu.jmh.jmh_generated.ConcatList_concatWithCommonsCollection_jmhTest.concatWithCommonsCollection_avgt_jmhStub
  0,1%   0,1% java.util.ArrayList.<init>
  0,1%   0,1% java.util.Arrays.copyOf
  0,0%   0,0% org.apache.commons.collections4.ListUtils.union
  0,0%   0,0% java.lang.invoke.MethodHandle.maybeCustomize

# Benchmark: fr.loicmathieu.jmh.ConcatList.concatWithGuava
# Parameters: (size = 10)

....[Thread state: RUNNABLE]........................................................................
 33,3%  50,0% $lt;stack is empty, everything is filtered?>
 19,2%  28,8% java.util.stream.SpinedBuffer.ensureCapacity
  5,3%   8,0% java.util.stream.AbstractPipeline.wrapSink
  3,4%   5,1% java.util.stream.ReferencePipeline.toArray
  3,2%   4,8% java.util.stream.SpinedBuffer.copyInto
  2,1%   3,1% fr.loicmathieu.jmh.jmh_generated.ConcatList_concatWithGuava_jmhTest.concatWithGuava_avgt_jmhStub
  0,1%   0,1% java.util.stream.StreamSupport.stream
  0,0%   0,0% java.util.stream.ReferencePipeline.lambda$toArray$0
  0,0%   0,0% java.util.ArrayList.spliterator
  0,0%   0,0% java.util.stream.SpinedBuffer.inflateSpine

# Benchmark: fr.loicmathieu.jmh.ConcatList.concatWithNewArrayList
# Parameters: (size = 10)

....[Thread state: RUNNABLE]........................................................................
 33,3%  50,0% $lt;stack is empty, everything is filtered?>
 24,4%  36,6% java.util.ArrayList.addAll
  8,7%  13,1% fr.loicmathieu.jmh.jmh_generated.ConcatList_concatWithNewArrayList_jmhTest.concatWithNewArrayList_avgt_jmhStub
  0,1%   0,2% java.util.Arrays.copyOf
  0,0%   0,1% java.util.ArrayList.<init>
  0,0%   0,0% fr.loicmathieu.jmh.ConcatList.concatWithNewArrayList


# Benchmark: fr.loicmathieu.jmh.ConcatList.concatWithStreamConcat
# Parameters: (size = 10)

....[Thread state: RUNNABLE]........................................................................
 33,3%  50,0% $lt;stack is empty, everything is filtered?>
 22,5%  33,8% java.util.ArrayList$ArrayListSpliterator.forEachRemaining
  5,8%   8,7% fr.loicmathieu.jmh.ConcatList.concatWithStreamConcat
  3,7%   5,6% java.util.stream.ReferencePipeline.toArray
  1,1%   1,7% fr.loicmathieu.jmh.jmh_generated.ConcatList_concatWithStreamConcat_jmhTest.concatWithStreamConcat_avgt_jmhStub
  0,0%   0,1% java.util.ImmutableCollections.listFromTrustedArrayNullsAllowed
  0,0%   0,1% java.util.stream.ReferencePipeline.lambda$toArray$0
  0,0%   0,1% java.util.stream.StreamSupport.stream
  0,0%   0,0% java.util.ArrayList.spliterator



# Benchmark: fr.loicmathieu.jmh.ConcatList.concatWithStreamOf
# Parameters: (size = 10)

....[Thread state: RUNNABLE]........................................................................
 33,3%  50,0% $lt;stack is empty, everything is filtered?>
 16,9%  25,4% java.util.ArrayList$ArrayListSpliterator.forEachRemaining
  4,5%   6,8% java.util.stream.SpinedBuffer.copyInto
  4,5%   6,7% java.util.stream.AbstractPipeline.wrapSink
  3,4%   5,2% java.util.stream.SpinedBuffer.ensureCapacity
  3,4%   5,0% fr.loicmathieu.jmh.jmh_generated.ConcatList_concatWithStreamOf_jmhTest.concatWithStreamOf_avgt_jmhStub
  0,3%   0,5% java.util.stream.ReferencePipeline.toArray
  0,1%   0,2% java.util.stream.ReferencePipeline$Head.forEach
  0,0%   0,0% java.util.ArrayList.spliterator
  0,0%   0,0% java.util.stream.SpinedBuffer.inflateSpine

concatWithCommonsCollection and concatWithNewArrayList both use java.util.Arrays.copyOf, this method is annotated with @IntrinsicCandidate which means that there is an intrinsic that allows it to be optimized directly within the JVM. To my knowledge, even if the annotation contains Candidate in the name, within Oracle's OpenJDK JVM there is always at least one intrinsic for one of the supported architectures (which may not be the case for other JVM vendors). This intrinsic enables memory-efficient copying of arrays of list elements without the need to traverse the list and copy elements one by one.

concatWithStreamConcat and concatWithStreamOf both run throught all list elements via java.util.ArrayList$ArrayListSpliterator.forEachRemaining, which is less efficient.

concatWithGuava will also run through all the elements in the list, but in addition will create a list with the first two lists, doubling the number of times the elements are copied!

Conclusion

The simplest code is often the most efficient, even if it's not necessarily the most elegant.

The JVM contains a number of intrinsics that enable array copy operations in memory, which are essential when working with large data lists or performing frequent copy operations on them.

Of course, this optimization only makes sense if the code called is in the application's hot path. Here, concatenation is potentially performed thousands of times per second in Kestra, so it's important that it performs in the most efficient way.

As a reminder, this article contains numbers, which are just that: numbers. They can be very different from one context to another. It's always best to make your own benchmarks, in your own context, and then validate them in a real environment to deduce their relevance.

For the curious, here's the PR in the repo Kestra: https://github.com/kestra-io/kestra/pull/7160 and here is the complete benchmark: ConcatList.java.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.