Floating point numbers - IEEE-754
We all use floating pointer numbers daily, but the way they work is a mystery to many. Let’s take a closer look
TL/DR
Floating point number representation (32-bit):
$$\begin{align}({-1}^{sign}) \cdot (1 + fraction) \cdot 2^{exponent - bias}\end{align}$$
sign - is stored in the first bit
exponent - next 8 bits
fraction - last 23 bits
bias - is a constant, that equals to 127
sign | exponent | fraction | |||||||||||||||||||||||||||||
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
---|
As simple as it gets. But what if we wanted to represent 0.5?
Floating point numbers representation
Floating point number representation is slightly more tricky, but not by much. It all boils down to a standard scientific notation in binary. In case you need a reminder, check out my article Scientific Notation.
So, to represent 0.5, we could write: $$ \frac{5}{10}= 5 \cdot 10^{-1} = 0.5 $$ Or in binary: $$ \frac{1}{2} = 1 \cdot 2^{-1} = 0.1 $$
How about 5.75? First we need to “split it” into powers of 2:
5.75 = 4 + 1 + 0.5 + 0.25
Now we can encode it in binary:
22 + 20 + 2−1 + 2−2 → 101.11
However, since we want to encode it into computer memory, we should use standard scientific notation, so:
101.11 ⋅ 20 → 1.0111 ⋅ 22
Now, let’s take a close look at the number above and figure out the shortest way to encode it. Couple observations:
- First digit in binary is always 1, unless the number is zero, in which case all digits are 0. Meaning - we don’t need to encode the first digit;
- The base 2 is constant, so there is no point in encoding it, hence we just need to store the exponent, just like in e-notation;
- We also need to store a sign;
If we remove constants: $$ \cancel{1}.0111 \cdot \cancel{2}^2 $$ All we need to store are fraction value (.0111) and an exponent 2 (10 in base 2).
Fraction VS Mantissa
Let’s talk about fraction for a bit. Previously we were using a term mantissa. These terms are confusing and often used interchangeably, but we are going to agree that 1 + fraction = mantissa.
Example:
1.0111 - mantissa
0111 - fraction
Fractions in binary
How do fractions work in binary? More reasonable question would be: how to convert binary fractions to decimal ones? Pretty much the same way you convert integer binary to decimal, really, the only difference is direction
$$ \begin{align} .0111_{2} = 0 \cdot 2^{-1} + 1 \cdot 2^{-2} + 1 \cdot 2^{-3} + 1 \cdot 2^{-4} \to \end{align} $$ $$ 0 \cdot 1/2 + 1 \cdot 1/4 + 1 \cdot 1/8 + 1 \cdot 1/16 = \frac{4 + 2 + 1}{16} = .4375_{10} $$ Hence: $$ \begin{align} 1.0111_2 = 1.4375_{10} \end{align} $$
And just to make sure we haven’t made a mistake so far, we can check that 1.4375 * 22 = 1.4375 * 4 = 5.75 Looks like our representation works so far.
Floating point numbers binary representation
For 32-bit numbers (single precision) we use: 1. First bit is used to denote a sign: 0 - means positive, 1 - negative 2. next 8 bits are used for an exponent 3. last 23 bits used for fraction
General formula for encoding looks like this: (−1sign) ⋅ (1+fraction) ⋅ 2exponent − bias
bias is needed so we can represent negative exponent. If we wanted to represent exponent of -10, we would encode it as -10 + 127 = 117. We have to remember about bias when encoding positive exponent as well. Thus, power 10 should be encoded as 10 + 127 = 137.
Let’s apply what we know so far to represent 5.75:
sign = 0
mantissa = 1.0111
fraction = 0111
exponent = 2 + 127 = 12910 = 10000001
sign | exponent | fraction | |||||||||||||||||||||||||||||
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |