Primitives VS Boxed types performance
We have already looked at performance implications depending on data types: integer VS float and int32 VS int64. In Java and some other languages you have primitive and Boxed types available. Let’s see if choosing between these types has noticable performance implications.
Boxed (reference) types - primitive data types wrapped in an object. Values of such types are accessed through a pointer.
As usual, let’s start with numbers. Comparing basic arithmetic operations performance between int and Integer.
:
AdditionError Units
Benchmark Mode Cnt Score .sumBoxed avgt 10 2374.809 ± 99.090 us/op
Int.sumPrimitive avgt 10 219.264 ± 0.150 us/op
Int
:
SubtractionError Units
Benchmark Mode Cnt Score .minusBoxed avgt 10 2257.685 ± 6.309 us/op
Int.minusPrimitive avgt 10 219.102 ± 0.404 us/op
Int
:
MultiplicationError Units
Benchmark Mode Cnt Score .multiplyBoxed avgt 10 1296.610 ± 15.609 us/op
Int.multiplyPrimitive avgt 10 642.457 ± 1.045 us/op
Int
:
DivisionError Units
Benchmark Mode Cnt Score .divideBoxed avgt 10 2057.563 ± 3.316 us/op
Int.dividePrimitive avgt 10 1957.143 ± 1.380 us/op Int
Benchmarks made with JMH. Code below.
CPU: AMD Ryzen 9 5900x
OS: Ubuntu 21.10 64-bit
JVM: OpenJDK 17 64-bit
JMH: version 1.34
Score: how long does it take to execute an operation on 1 mln values
As we can see, there is drastic performance difference for add and subtract operations, which result in 10 times performance penalty for using boxed types.
There is also significant difference when performing multiplication, resulting in x2 runtime increase.
When performing division there is no significant performance difference.
Why does using boxed data types negatively affects performance?
There are number of reasons for that.
1. Stack VS heap
Primitive types are located on stack, while boxed types live in heap. Stack memory is way faster than heap to access.
2. Data size
Boxing overhead is significant. While regular int takes only 4 bytes of space (32-bits) Integer type takes 16 bytes (128-bits).
3. Fetch time & cache misses
When CPU fetches data from memory it never fetches just one value. Instead it fetches what is called a cache line. The size of cache line is usually 64 Bytes, at least for Intel and AMD. However, there are some CPUs that have 128-bit (IBM PowerPC) and even 256-bit. What that means, is that CPU fetches an entire block from RAM to its cache and doesn’t have to do expensive round trips to RAM as long as it has required data in the cache.
In our test we were performing a set of operations on an array of ints and Integers. This is very common scenario in real life, you would probably perform operations on collections more often than on individual variables. In this case it played a significant role, because our variables were co-allocated in RAM, so each fetch cycle retrieved anywhere from 1 to 16 variables.
Contrary to that, array of Integers is an array of pointers. Although the pointers were fetched from the RAM, on each access CPU has to do one more trip to the RAM to fetch actual data. There are optimization techniques, that modern CPUs use to alleviate this issue and we will talk about it in the future. Stay tuned.
Note: since native collections as of today operate over boxed types, you may want to check out one of many libraries offering primitive collections. FastUtils is a good option.
Other types
To complete this test, let’s also take a look at other types arithmetic operations performance difference:
Long:
:
AdditionError Units
Benchmark Mode Cnt Score Long.sumBoxed avgt 10 3638.636 ± 13.105 us/op
Long.sumPrimitive avgt 10 222.600 ± 1.021 us/op
:
SubtractionError Units
Benchmark Mode Cnt Score Long.minusBoxed avgt 10 3643.976 ± 27.786 us/op
Long.minusPrimitive avgt 10 224.145 ± 1.520 us/op
:
MultiplicationError Units
Benchmark Mode Cnt Score Long.multiplyBoxed avgt 10 1912.212 ± 55.481 us/op
Long.multiplyPrimitive avgt 10 649.564 ± 1.241 us/op
:
DivisionError Units
Benchmark Mode Cnt Score Long.divideBoxed avgt 10 2855.358 ± 28.921 us/op
Long.dividePrimitive avgt 10 1943.306 ± 1.871 us/op
Float:
:
AdditionError Units
Benchmark Mode Cnt Score Float.sumBoxed avgt 10 2217.923 ± 8.242 us/op
Float.sumPrimitive avgt 10 653.798 ± 3.041 us/op
:
SubtractionError Units
Benchmark Mode Cnt Score Float.minusBoxed avgt 10 2233.400 ± 25.552 us/op
Float.minusPrimitive avgt 10 646.899 ± 2.525 us/op
:
MultiplicationError Units
Benchmark Mode Cnt Score Float.multiplyBoxed avgt 10 2364.269 ± 70.336 us/op
Float.multiplyPrimitive avgt 10 646.119 ± 0.852 us/op
:
DivisionError Units
Benchmark Mode Cnt Score Float.divideBoxed avgt 10 2792.920 ± 200.376 us/op
Float.dividePrimitive avgt 10 2254.131 ± 13.576 us/op
Double:
:
AdditionError Units
Benchmark Mode Cnt Score Double.sumBoxed avgt 10 3743.926 ± 48.198 us/op
Double.sumPrimitive avgt 10 650.057 ± 0.874 us/op
:
SubtractionError Units
Benchmark Mode Cnt Score Double.minusBoxed avgt 10 3962.240 ± 24.792 us/op
Double.minusPrimitive avgt 10 647.330 ± 0.971 us/op
:
MultiplicationError Units
Benchmark Mode Cnt Score Double.multiplyBoxed avgt 10 3686.790 ± 37.230 us/op
Double.multiplyPrimitive avgt 10 678.465 ± 55.601 us/op
:
DivisionError Units
Benchmark Mode Cnt Score Double.divideBoxed avgt 10 4329.624 ± 106.395 us/op
Double.dividePrimitive avgt 10 2914.918 ± 29.782 us/op
Code:
Sources and further reading:
x86 latency, throughput, and
port usage information
Intel®
Xeon® Processor Scalable Family Technical Overview
Intel®
Advisor User Guide: cachesim-cacheline-size
Java Object Layout
(JOL)
JOL
Samples
Java
Objects Inside Out by Aleksey Shipilev
10 Apr 2022 - Hasan Al-Ammori