# Floating point numbers - IEEE-754

We all use floating pointer numbers daily, but the way they work is a mystery to many. Let’s take a closer look

## TL/DR

Floating point number representation (32-bit):

$$\begin{align}({-1}^{sign}) \cdot (1 + fraction) \cdot 2^{exponent - bias}\end{align}$$

*sign* - is stored in the first bit

*exponent* - next 8 bits

*fraction* - last 23 bits

*bias* - is a constant, that equals to 127

sign | exponent | fraction | |||||||||||||||||||||||||||||

0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

For 64-bit representation:

*sign* - 1 bit

*exponent* - 11 bits

*fraction* - 52 bits

*bias* - 1023

## Full version:

## Integer representation

Integer number representation is easy, but in case you forgot, here
is an example. Number 5 converted to binary would be 101: 1 ⋅ 2^{2} + 0 ⋅ 2^{1} + 1 ⋅ 2^{0}

In case you forgot how to transition to binary - check out my article “From Binary to Hexadecimal Intuition”.

Hence, in memory int32 number 5 would look like:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
---|

As simple as it gets. But what if we wanted to represent 0.5?

## Floating point numbers representation

Floating point number representation is slightly more tricky, but not
by much. It all boils down to a **standard scientific
notation** in binary. In case you need a reminder, check out my
article Scientific
Notation.

So, to represent 0.5, we could write: $$ \frac{5}{10}= 5 \cdot 10^{-1} = 0.5 $$ Or in binary: $$ \frac{1}{2} = 1 \cdot 2^{-1} = 0.1 $$

How about 5.75? First we need to “split it” into powers of 2:

5.75 = 4 + 1 + 0.5 + 0.25

Now we can encode it in binary:

2^{2} + 2^{0} + 2^{−1} + 2^{−2} → 101.11

However, since we want to encode it into computer memory, we should
use **standard scientific notation**, so:

101.11 ⋅ 2^{0} → 1.0111 ⋅ 2^{2}

Now, let’s take a close look at the number above and figure out the shortest way to encode it. Couple observations:

- First digit in binary is always 1, unless the number is zero, in which case all digits are 0. Meaning - we don’t need to encode the first digit;
- The base 2 is constant, so there is no point in encoding it, hence we just need to store the exponent, just like in e-notation;
- We also need to store a sign;

If we remove constants: $$ \cancel{1}.0111 \cdot \cancel{2}^2 $$ All we need to store are fraction value (.0111) and an exponent 2 (10 in base 2).

## Fraction VS Mantissa

Let’s talk about fraction for a bit. Previously we were using a term
mantissa. These terms are confusing and often used interchangeably, but
we are going to agree that *1 + fraction = mantissa.*

Example:

1.0111 - mantissa

0111 - fraction

## Fractions in binary

How do fractions work in binary? More reasonable question would be:
**how to convert binary fractions to decimal ones?** Pretty
much the same way you convert integer binary to decimal, really, the
only difference is direction

$$ \begin{align} .0111_{2} = 0 \cdot 2^{-1} + 1 \cdot 2^{-2} + 1 \cdot 2^{-3} + 1 \cdot 2^{-4} \to \end{align} $$ $$ 0 \cdot 1/2 + 1 \cdot 1/4 + 1 \cdot 1/8 + 1 \cdot 1/16 = \frac{4 + 2 + 1}{16} = .4375_{10} $$ Hence: $$ \begin{align} 1.0111_2 = 1.4375_{10} \end{align} $$

And just to make sure we haven’t made a mistake so far, we can check
that 1.4375 * 2^{2} = 1.4375 * 4 = 5.75
Looks like our representation works so far.

## Floating point numbers binary representation

For 32-bit numbers (single precision) we use: 1. First bit is used to denote a sign: 0 - means positive, 1 - negative 2. next 8 bits are used for an exponent 3. last 23 bits used for fraction

General formula for encoding looks like this: (−1^{sign}) ⋅ (1+*f**r**a**c**t**i**o**n*) ⋅ 2^{exponent − bias}

*bias* is needed so we can represent negative exponent. If we
wanted to represent exponent of *-10*, we would encode it as
*-10 + 127 = 117*. We have to remember about bias when encoding
positive exponent as well. Thus, power *10* should be encoded as
*10 + 127 = 137*.

Let’s apply what we know so far to represent 5.75:

*sign = 0*

*mantissa = 1.0111*

*fraction = 0111*

*exponent = 2 + 127 = 129 _{10} = 10000001*

sign | exponent | fraction | |||||||||||||||||||||||||||||

0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

And boom, just like that we have encoded a floating point number into IEEE 754 format!

To validate our result, you may use the following class:

You may also use the following test:

Or simply run:

Output:

$$ \begin{align} Fin \end{align}$$

20 Mar 2022 - Hasan Al-Ammori