Benchmark : conversion from long to byte[]
I’ve been using Kafka a lot lately, and in Kafka a lot of things are byte arrays, even headers!
As I have many components that exchange messages, I added headers to help with message tracking, including a timestamp
header which has the value System.currentTimeMillis()
.
So I had to transform a long
into a byte array; in a very naive way, I coded this: String.valueOf(System.currentTimeMillis()).getBytes()
. But instantiating a String
each time a header is created does not seem very optimal to me!
Looking a bit further, Guava has a solution based on bitwise calculation via the Longs
class, as well as Kafka via its LongSerializer
. You can also use a ByteBuffer
to perform the conversion.
To compare the three, nothing better than JMH – The Java Microbenchmark Harness. This tool allows to write relevant micro-benchmarks taking into account the internal characteristics of the JVM. It also offers integrated tools to analyze the performance of our tests (profiling, disassembling, …). If you don’t know JMH, you can refer to this article: INTRODUCTION TO JMH – JAVA MICROBENCHMARK HARNESS.
The benchmark
First of all, I configured the benchmark with a Thread State so that the setup is played for each Thread. Among other things, I created a ByteBuffer
per thread to compare implementations with and without re-use of the buffer.
@State(Scope.Thread) // other JMH annotations ... public class LongToByteArray { private static final LongSerializer LONG_SERIALIZER = new LongSerializer(); long timestamp; ByteBuffer perThreadBuffer; @Setup public void setup() { timestamp = System.currentTimeMillis(); perThreadBuffer = ByteBuffer.allocate(Long.BYTES); } // benchmark methods }
Then, I implement a benchmark method for each way of converting a long
to byte[]
. I implemented two different algorithms for ByteBuffer
: one with an instantiation of a buffer on each conversion, and the other with a recycling of an existing buffer using the ByteBuffer
instantiated in the benchmark setup phase.
@Benchmark public byte[] testStringValueOf() { return String.valueOf(timestamp).getBytes(); } @Benchmark public byte[] testGuava() { return Longs.toByteArray(timestamp); } @Benchmark public byte[] testKafkaSerde() { return LONG_SERIALIZER.serialize(null, timestamp); } @Benchmark public byte[] testByteBuffer() { ByteBuffer buffer = ByteBuffer.allocate(Long.BYTES); buffer.putLong(timestamp); return buffer.array(); } @Benchmark public byte[] testByteBuffer_reuse() { perThreadBuffer.putLong(timestamp); byte[] result = perThreadBuffer.array(); perThreadBuffer.clear(); return result; }
The full benchmark is accessible here.
The results
All the tests were run on my laptop: Intel(R) Core(TM) i7-8750H 6 cores (12 with hyperthreading) – Ubuntu 19.10.
The Java version used was: openjdk version "11.0.7" 2020-04-14
.
Benchmark Mode Cnt Score Error Units LongToByteArray.testByteBuffer avgt 5 4,429 ± 0,204 ns/op LongToByteArray.testByteBuffer_reuse avgt 5 5,655 ± 0,793 ns/op LongToByteArray.testGuava avgt 5 6,422 ± 0,428 ns/op LongToByteArray.testKafkaSerde avgt 5 9,103 ± 1,515 ns/op LongToByteArray.testStringValueOf avgt 5 39,660 ± 4,372 ns/op
First observation: my intuition was good, instantiating a String
for each conversion is very bad, 4 to 10 times slower than all the other implementations. When we look at the result of the conversion, we understand why. By using a String
we no longer convert a 64bit number but a character string where each character (each digit of the number) is coded on a byte. So we do not compare exactly the same thing since the result of the conversion via a String
will give you an array of 13 bytes, while a Long
can be encoded in 8 bytes, as gives us the conversion via Guava, Kafka or a ByteBuffer.
Surprisingly Kafka, which is known for its performance, has a slower implementation than Guava or that via a ByteBuffer
.
The results obtained via ByteBuffer
are surprising, the instantiation of a ByteBuffer
for each conversion is more efficient than the reuse of an existing one (which requires a clean of the buffer) .
A little more detailed analysis
Let’s put aside the implementation via a String
and try to better understand the differences between the other implementations.
For this I will use the profiling capabilities of JMH via the -prof
option.
If we profile the memory allocations via -prof gc
we have the following results:
LongToByteArray.testByteBuffer avgt 5 4,492 ± 0,708 ns/op LongToByteArray.testByteBuffer:·gc.alloc.rate avgt 5 4635,903 ± 712,889 MB/sec LongToByteArray.testByteBuffer_reuse avgt 5 5,798 ± 1,139 ns/op LongToByteArray.testByteBuffer_reuse:·gc.alloc.rate avgt 5 ≈ 10⁻⁴ MB/sec LongToByteArray.testGuava avgt 5 6,939 ± 0,899 ns/op LongToByteArray.testGuava:·gc.alloc.rate avgt 5 3000,818 ± 376,613 MB/sec LongToByteArray.testKafkaSerde avgt 5 9,317 ± 0,842 ns/op LongToByteArray.testKafkaSerde:·gc.alloc.rate avgt 5 4467,791 ± 405,897 MB/sec
We can clearly see the advantage of reusing the ByteBuffer
: there is no memory allocation, while by creating a new buffer for each conversion, we have 4 GB/s of memory allocation !
On the other hand, the memory allocations are close for the three other implementations, so that does not give us much more information.
Now let’s try to profile the CPU with -prof perf
which will use the perf tool to profile the application.
The results are not easily understandable (to see them this is here), some observations :
- Reusing a
ByteBuffer
seems to involve a lot more CPU branches, maybe this is the cause of the performance difference. - The Kafka implementation seems to involve more CPU branches than Guava’s despite performing fewer instructions. Because of these branches, fewer instructions can be performed per CPU cycle. This is certainly the reason why the Guava implementation is more efficient.
Finally, out of curiosity, I looked at the code for HeapByteBuffer.putLong()
, this is the implementation used via ByteBuffer
because I don’t do any direct allocation. This uses a Unsafe.putLongUnaligned()
method. Unsafe
is known for its high performance implementations (but should not be used by everyone), here this method is annotated with @HotSpotIntrinsicCandidate
which means that an intrinsic may exists for it and could explain its difference in performance with other implementations. An intrinsic can be seen as a piece of native code, optimized for your OS / CPU architecture, which the JVM will substitute for the Java implementation of the method under certain conditions.
Conclusion
Be careful what you measure, the implementation via a String
does not generate the same array of bytes as the others, and is therefore much less efficient.
Reusing a ByteBuffer
is not always the best solution, as the cost of recycling canbe significant. Allocations are not very expensive within the JVM, and sometimes it is better to allocate than execute more instructions.
Follow the force, read the code;)
Although JMH is a great tool, it needs technical skills and a lot of time to fully analyze its results. Even if the observed differences are not fully explained; I’m still happy with my little experimentation;)