Floating point numbers - IEEE-754

We all use floating pointer numbers daily, but the way they work is a mystery to many. Let’s take a closer look

TL/DR

Floating point number representation (32-bit):

$$\begin{align}({-1}^{sign}) \cdot (1 + fraction) \cdot 2^{exponent - bias}\end{align}$$

sign - is stored in the first bit
exponent - next 8 bits
fraction - last 23 bits
bias - is a constant, that equals to 127

sign	exponent								fraction
0	1	0	0	0	0	0	0	1	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

For 64-bit representation:
sign - 1 bit
exponent - 11 bits
fraction - 52 bits
bias - 1023

Full version:

Integer representation

Integer number representation is easy, but in case you forgot, here is an example. Number 5 converted to binary would be 101: 1 ⋅ 2² + 0 ⋅ 2¹ + 1 ⋅ 2⁰

In case you forgot how to transition to binary - check out my article “From Binary to Hexadecimal Intuition”.

Hence, in memory int32 number 5 would look like:

0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1

As simple as it gets. But what if we wanted to represent 0.5?

Floating point numbers representation

Floating point number representation is slightly more tricky, but not by much. It all boils down to a standard scientific notation in binary. In case you need a reminder, check out my article Scientific Notation.

So, to represent 0.5, we could write: $$ \frac{5}{10}= 5 \cdot 10^{-1} = 0.5 $$ Or in binary: $$ \frac{1}{2} = 1 \cdot 2^{-1} = 0.1 $$

How about 5.75? First we need to “split it” into powers of 2:

5.75 = 4 + 1 + 0.5 + 0.25

Now we can encode it in binary:

2² + 2⁰ + 2⁻¹ + 2⁻² → 101.11

However, since we want to encode it into computer memory, we should use standard scientific notation, so:

101.11 ⋅ 2⁰ → 1.0111 ⋅ 2²

Now, let’s take a close look at the number above and figure out the shortest way to encode it. Couple observations:

First digit in binary is always 1, unless the number is zero, in which case all digits are 0. Meaning - we don’t need to encode the first digit;
The base 2 is constant, so there is no point in encoding it, hence we just need to store the exponent, just like in e-notation;
We also need to store a sign;

If we remove constants: $$ \cancel{1}.0111 \cdot \cancel{2}^2 $$ All we need to store are fraction value (.0111) and an exponent 2 (10 in base 2).

Fraction VS Mantissa

Let’s talk about fraction for a bit. Previously we were using a term mantissa. These terms are confusing and often used interchangeably, but we are going to agree that 1 + fraction = mantissa.

Example:

1.0111 - mantissa

0111 - fraction

Fractions in binary

How do fractions work in binary? More reasonable question would be: how to convert binary fractions to decimal ones? Pretty much the same way you convert integer binary to decimal, really, the only difference is direction

$$ \begin{align} .0111_{2} = 0 \cdot 2^{-1} + 1 \cdot 2^{-2} + 1 \cdot 2^{-3} + 1 \cdot 2^{-4} \to \end{align} $$ $$ 0 \cdot 1/2 + 1 \cdot 1/4 + 1 \cdot 1/8 + 1 \cdot 1/16 = \frac{4 + 2 + 1}{16} = .4375_{10} $$ Hence: $$ \begin{align} 1.0111_2 = 1.4375_{10} \end{align} $$

And just to make sure we haven’t made a mistake so far, we can check that 1.4375 * 2² = 1.4375 * 4 = 5.75 Looks like our representation works so far.

Floating point numbers binary representation

For 32-bit numbers (single precision) we use: 1. First bit is used to denote a sign: 0 - means positive, 1 - negative 2. next 8 bits are used for an exponent 3. last 23 bits used for fraction

General formula for encoding looks like this: (−1^sign) ⋅ (1+fraction) ⋅ 2^{exponent − bias}

bias is needed so we can represent negative exponent. If we wanted to represent exponent of -10, we would encode it as -10 + 127 = 117. We have to remember about bias when encoding positive exponent as well. Thus, power 10 should be encoded as 10 + 127 = 137.

Let’s apply what we know so far to represent 5.75:
sign = 0
mantissa = 1.0111
fraction = 0111
exponent = 2 + 127 = 129₁₀ = 10000001

sign	exponent								fraction
0	1	0	0	0	0	0	0	1	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

And boom, just like that we have encoded a floating point number into IEEE 754 format!

To validate our result, you may use the following class:

public class FloatBinaryConverter {

    public float fromBinary(String bits) {
        int bitsInt32 = Integer.parseInt(bits, 2);
        return Float.intBitsToFloat(bitsInt32);
    }

    public String toBinary(float f) {
        int bitsInt32 = Float.floatToIntBits(f);
        String bits = Integer.toBinaryString(bitsInt32);
        // toBinaryString function may drop prepending zeroes
        String prefix = "0".repeat(32 - bits.length());
        return prefix + bits;
    }

}

You may also use the following test:

class FloatBinaryConverterTest {

    FloatBinaryConverter converter;

    @BeforeEach
    void setUp() {
        converter = new FloatBinaryConverter();
    }

    @Test
    void fromBinary(){
        FloatBinaryConverter converter = new FloatBinaryConverter();
        String bits = "01000000101110000000000000000000";
        float result = converter.fromBinary(bits);
        float expectedResult = 5.75f;
        assertEquals(expectedResult, result);
    }

    @Test
    void toBinary(){
        FloatBinaryConverter converter = new FloatBinaryConverter();
        float number = 5.75f;
        String result = converter.toBinary(number);
        String expectedResult = "01000000101110000000000000000000";
        assertEquals(expectedResult, result);
    }

}

Or simply run:

public static void main(String[] args) {
    FloatBinaryConverter converter = new FloatBinaryConverter();
    float number = 5.75f;
    String bits = converter.toBinary(number);
    System.out.println(number + " -> "+bits);
    bits = "01000000101110000000000000000000";
    number = converter.fromBinary(bits);
    System.out.println(bits + " -> " + number);
}

Output:

5.75 -> 01000000101110000000000000000000
01000000101110000000000000000000 -> 5.75

$$ \begin{align} Fin \end{align}$$

20 Mar 2022 - Hasan Al-Ammori