Primitives VS Boxed types performance

We have already looked at performance implications depending on data types: integer VS float and int32 VS int64. In Java and some other languages you have primitive and Boxed types available. Let’s see if choosing between these types has noticable performance implications.

Boxed (reference) types - primitive data types wrapped in an object. Values of such types are accessed through a pointer.

As usual, let’s start with numbers. Comparing basic arithmetic operations performance between int and Integer.

Addition:
Benchmark              Mode  Cnt     Score    Error  Units
Int.sumBoxed           avgt   10  2374.809 ± 99.090  us/op
Int.sumPrimitive       avgt   10   219.264 ±  0.150  us/op

Subtraction:
Benchmark              Mode  Cnt     Score    Error  Units
Int.minusBoxed         avgt   10  2257.685 ±  6.309  us/op
Int.minusPrimitive     avgt   10   219.102 ±  0.404  us/op

Multiplication:
Benchmark              Mode  Cnt     Score    Error  Units
Int.multiplyBoxed      avgt   10  1296.610 ± 15.609  us/op
Int.multiplyPrimitive  avgt   10   642.457 ±  1.045  us/op

Division:
Benchmark              Mode  Cnt     Score    Error  Units
Int.divideBoxed        avgt   10  2057.563 ±  3.316  us/op
Int.dividePrimitive    avgt   10  1957.143 ±  1.380  us/op

Benchmarks made with JMH. Code below.
CPU: AMD Ryzen 9 5900x
OS: Ubuntu 21.10 64-bit
JVM: OpenJDK 17 64-bit
JMH: version 1.34
Score: how long does it take to execute an operation on 1 mln values

Int VS Integer

As we can see, there is drastic performance difference for add and subtract operations, which result in 10 times performance penalty for using boxed types.

There is also significant difference when performing multiplication, resulting in x2 runtime increase.

When performing division there is no significant performance difference.

Why does using boxed data types negatively affects performance?

There are number of reasons for that.

1. Stack VS heap

Primitive types are located on stack, while boxed types live in heap. Stack memory is way faster than heap to access.

2. Data size

Boxing overhead is significant. While regular int takes only 4 bytes of space (32-bits) Integer type takes 16 bytes (128-bits).

3. Fetch time & cache misses

When CPU fetches data from memory it never fetches just one value. Instead it fetches what is called a cache line. The size of cache line is usually 64 Bytes, at least for Intel and AMD. However, there are some CPUs that have 128-bit (IBM PowerPC) and even 256-bit. What that means, is that CPU fetches an entire block from RAM to its cache and doesn’t have to do expensive round trips to RAM as long as it has required data in the cache.

In our test we were performing a set of operations on an array of ints and Integers. This is very common scenario in real life, you would probably perform operations on collections more often than on individual variables. In this case it played a significant role, because our variables were co-allocated in RAM, so each fetch cycle retrieved anywhere from 1 to 16 variables.

Contrary to that, array of Integers is an array of pointers. Although the pointers were fetched from the RAM, on each access CPU has to do one more trip to the RAM to fetch actual data. There are optimization techniques, that modern CPUs use to alleviate this issue and we will talk about it in the future. Stay tuned.

Note: since native collections as of today operate over boxed types, you may want to check out one of many libraries offering primitive collections. FastUtils is a good option.

Other types

To complete this test, let’s also take a look at other types arithmetic operations performance difference:

Long:

Addition:
Benchmark               Mode  Cnt     Score    Error  Units
Long.sumBoxed           avgt   10  3638.636 ± 13.105  us/op
Long.sumPrimitive       avgt   10   222.600 ±  1.021  us/op

Subtraction:
Benchmark               Mode  Cnt     Score    Error  Units
Long.minusBoxed         avgt   10  3643.976 ± 27.786  us/op
Long.minusPrimitive     avgt   10   224.145 ±  1.520  us/op

Multiplication:
Benchmark               Mode  Cnt     Score    Error  Units
Long.multiplyBoxed      avgt   10  1912.212 ± 55.481  us/op
Long.multiplyPrimitive  avgt   10   649.564 ±  1.241  us/op

Division:
Benchmark               Mode  Cnt     Score    Error  Units
Long.divideBoxed        avgt   10  2855.358 ± 28.921  us/op
Long.dividePrimitive    avgt   10  1943.306 ±  1.871  us/op

Float:

Addition:
Benchmark                Mode  Cnt     Score     Error  Units
Float.sumBoxed           avgt   10  2217.923 ±   8.242  us/op
Float.sumPrimitive       avgt   10   653.798 ±   3.041  us/op

Subtraction:
Benchmark                Mode  Cnt     Score     Error  Units
Float.minusBoxed         avgt   10  2233.400 ±  25.552  us/op
Float.minusPrimitive     avgt   10   646.899 ±   2.525  us/op

Multiplication:
Benchmark                Mode  Cnt     Score     Error  Units
Float.multiplyBoxed      avgt   10  2364.269 ±  70.336  us/op
Float.multiplyPrimitive  avgt   10   646.119 ±   0.852  us/op

Division:
Benchmark                Mode  Cnt     Score     Error  Units
Float.divideBoxed        avgt   10  2792.920 ± 200.376  us/op
Float.dividePrimitive    avgt   10  2254.131 ±  13.576  us/op

Double:

Addition:
Benchmark                 Mode  Cnt     Score     Error  Units
Double.sumBoxed           avgt   10  3743.926 ±  48.198  us/op
Double.sumPrimitive       avgt   10   650.057 ±   0.874  us/op

Subtraction:
Benchmark                 Mode  Cnt     Score     Error  Units
Double.minusBoxed         avgt   10  3962.240 ±  24.792  us/op
Double.minusPrimitive     avgt   10   647.330 ±   0.971  us/op

Multiplication:
Benchmark                 Mode  Cnt     Score     Error  Units
Double.multiplyBoxed      avgt   10  3686.790 ±  37.230  us/op
Double.multiplyPrimitive  avgt   10   678.465 ±  55.601  us/op

Division:
Benchmark                 Mode  Cnt     Score     Error  Units
Double.divideBoxed        avgt   10  4329.624 ± 106.395  us/op
Double.dividePrimitive    avgt   10  2914.918 ±  29.782  us/op

Code:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

public class IntPrimitiveVsBoxed {

    @State(Scope.Thread)
    public static class ExecutionPlan {

        int len = 1_000_000;

        int[] primitiveArr = new int[len];
        Integer[] boxedArr = new Integer[len];

        Random rnd = new Random();

        @Setup(Level.Trial)
        public void setUp() {
            for (int i = 0; i < len; i++) {
                int val = rnd.nextInt();
                // do not generate 0 to ensure no division by zero
                if (val == 0) val = 1;
                primitiveArr[i] = val;
                boxedArr[i] = val;
            }
        }
    }

    //<editor-fold desc="Division">
    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public int dividePrimitive(ExecutionPlan plan) {
        int result = Integer.MAX_VALUE;
        for (int i = 0; i < plan.len; i++){
            result /= plan.primitiveArr[i];
        }
        return result;
    }

    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public Integer divideBoxed(ExecutionPlan plan) {
        Integer result = Integer.MAX_VALUE;
        for (int i = 0; i < plan.len; i++){
            result /= plan.boxedArr[i];
        }
        return result;
    }
    //</editor-fold>

    //<editor-fold desc="Multiply">
    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public int multiplyPrimitive(ExecutionPlan plan) {
        int result = 1;
        for (int i = 0; i < plan.len; i++){
            result *= plan.primitiveArr[i];
        }
        return result;
    }

    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public Integer multiplyBoxed(ExecutionPlan plan) {
        Integer result = 1;
        for (int i = 0; i < plan.len; i++){
            result *= plan.boxedArr[i];
        }
        return result;
    }
    //</editor-fold>

    //<editor-fold desc="Minus">
    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public int minusPrimitive(ExecutionPlan plan) {
        int result = 0;
        for (int i = 0; i < plan.len; i++){
            result -= plan.primitiveArr[i];
        }
        return result;
    }

    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public Integer minusBoxed(ExecutionPlan plan) {
        Integer result = 0;
        for (int i = 0; i < plan.len; i++){
            result -= plan.boxedArr[i];
        }
        return result;
    }
    //</editor-fold>

    //<editor-fold desc="Sum">
    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public int sumPrimitive(ExecutionPlan plan) {
        int result = 0;
        for (int i = 0; i < plan.len; i++){
            result += plan.primitiveArr[i];
        }
        return result;
    }

    @Benchmark
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public Integer sumBoxed(ExecutionPlan plan) {
        Integer result = 0;
        for (int i = 0; i < plan.len; i++){
            result += plan.boxedArr[i];
        }
        return result;
    }
    //</editor-fold>

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(IntPrimitiveVsBoxed.class.getSimpleName())
                .warmupIterations(1)
                .measurementIterations(10)
                .threads(2)
                .forks(1)
                .verbosity(VerboseMode.EXTRA)
                .build();

        new Runner(opt).run();
    }
    
}

Sources and further reading:

x86 latency, throughput, and port usage information
Intel® Xeon® Processor Scalable Family Technical Overview
Intel® Advisor User Guide: cachesim-cacheline-size
Java Object Layout (JOL)
JOL Samples
Java Objects Inside Out by Aleksey Shipilev

10 Apr 2022 - Hasan Al-Ammori