Sat. Jan 21st, 2023

Use of binary to represent negative and floating point numbers

Representing negative numbers

Binary values which can only be positive are referred to as unsigned – that means there is no sign (positive or negative required).

Signed binary numbers can be either positive or negative, and are also referred to as being represented using ‘2’s complement’.

In a signed number, the MSB (Most Significant Bit) is assigned a negative value. For example, if using 8 bits, the column values become:

-1286432168421

An unsigned 8-bit value can be any of the 256 values from 0 to 255. A signed 8-bit value can be any of the 256 values from -128 to +127. Note that the number of values that can be represented does not change; by changing the MSB from 128 to -128, the range of values is altered.

Working out negative binary values

If you need to represent a negative value in 2s complement binary, you should follow these steps:

  1. Work out the binary representation for the positive value
  2. Flip the bits (all 0s become 1s and vice versa)
  3. Add 1

So, to represent -83 in binary, we do this:

Calculate 83 in binary:

1286432168421
01010011

Now, we flip the bits:

1286432168421
Before:01010011
After:10101100

And finally, we add ‘1’, and remember to make the MSB a negative value:

-1286432168421
Value10101100
Add 100000001
(Carry)
Result10101101

We can check this: using the result row, we get: -128 + 32 + 8 + 4 + 1, which is -83.

Representing fixed point values

The column values in binary are all powers of 2. In the grid above, from right to left, the columns are 20, 21, 22, 23 and so on. There is an implied binary point to the right of the 1. This is no different to saying in denary that 24 is the same as 24. – both technically have a decimal point after the 4, but as it’s superfluous in this context, it is left out.

If we continued beyond the right-most column, the pattern of powers-of-two continues, and we have 2-1, 2-2, 2-3, 2-4 and so on. Another way of looking at this, is that as we move from right to left, the column value doubles. Therefore, moving from left to right, the column value is halving.

This means that the following is perfectly legitimate:

84211/21/41/81/16
10010011

There is an implied binary point between the ‘1’ and the ‘1/2’. To work out the value, as before, simply add the column values where there is a ‘1’:

8 + 1 + 1/8 + 1/16 = 9 and 3/16

The process to work these out is exactly the same as for integer binary, and so is not covered here.

Negative values are calculated in the same way: the MSB becomes a negative value (-8 in this example).

COMMON ERROR ALERT! If you have -8 and you add 1/4 to this, the answer is -7 and 3/4, NOT -8 and 1/4!!

The problem with fixed point binary numbers is that you are pre-determining the range and precision of the number that can be stored. For this reason, it is common to use floating point values.

Floating point values

A floating point value does not have a fixed binary point – as the name implies. This means that it is possible to make a choice about the range and precision of numbers when they are stored.

A floating point value is represented as a mantissa and an exponent; in base 10, it might look like this:

Avagadro’s constant: 6 * 1023

The ‘6’ is the mantissa, and the exponent is ’23’. Because we are using base 10, the exponent is ’1023′.

When representing a number in floating point binary, you need to know how many bits are reserved for the mantissa and the exponent. For example, if you have 12 bits in total, then you may use 7 bits for the mantissa, and 5 for the exponent. Of course, you could also use 10 bits for the mantissa and 2 for the exponent – but we will look at the implications of this later.

Important points

Before going in to detail, the following concepts are important to understand:

  • Normalised – we represent floating point using a normalised system. This means we don’t waste space in the mantissa (which would reduce the precision of the value). For example, you would expect to use 1.55 * 105 instead of 0.00155 * 108. Although as written, these two are identical, what would happen if you were only allowed to use 3 digits for the mantissa? The second value would effectively become zero (0.00 * 108). In binary, any positive normalised number will begin with 0.1, and any normalised negative number will begin with 1.0
  • As we are working in binary, the exponent represents *2n instead of *10n
  • Expect both the mantissa and the exponent to be represented using 2s compliment (i.e. both the mantissa and exponent are signed values, with their respective MSBs being negative)

Converting floating point binary to denary

Here is a worked example showing the steps required to convert a floating point value from binary to denary.

“Convert the value 010100000111 into denary, where the value is represented with a 7-bit mantissa and a 5-bit exponent”

Separate out the two components

Separate out the two sections and add a binary point to the mantissa, so that you have:

Mantissa = 0.101000
Exponent = 00111

Process the exponent

The exponent is a 5-bit 2s complement value, and therefore its value can be calculated as follows:

-168421
00111
Note that the MSB is negative (2s complement)

This gives us an exponent of +7.

Apply the exponent to the mantissa

This means moving the binary point. A positive exponent means the point moves to the right (making the result larger), whereas a negative exponent means the point moves to the left, resulting in a smaller value.

In our example the exponent is +7, so the point is moved 7 places to the right. If there are not enough bits, we just add 0s as we go. The same is true when moving in the opposite direction with the following caveat:

If the exponent is negative: if you run out of bits when moving the point to the left, if the mantissa begins with a 0 then pad with 0s. If the mantissa begins with a 1, then pad with 1s.

Applying this to the mantissa, we get:

1286432168421.1/21/41/81/161/321/64
Before0.101000
After01010000.
Note the 0 in bold on the ‘After’ row: this is padding added, because there were no additional bits available when moving the point 7 places to the right (there were only six bits after the point)

This tells us that the value represented was 80.

We can double check: The original mantissa was 1/2 + 1/8 = 5/8. The exponent was 7; this means 5/8 * 27, which is (dividing top and bottom by 8) 5 * 24 = 5 * 16 = 80.

Converting denary to floating point binary

Let’s assume you are tasked with representing -14.125 using floating point binary, with 7 bits used for the mantissa, and 5 bits used for the exponent.

Work out 14.125 in fixed point binary

8421.1/21/41/8
1110.001

Convert the value into a negative value (two’s complement)

You will need to add an additional bit to this (as it stands, the MSB would become -8, which is not large enough to represent -14). Note that when performing 2s complement on a fixed point value, the ‘add 1’ is taken to mean ‘add 1 at the position of the LSB’, which in this example is 1/8. So, adding an extra bit and performing the steps (flip bits, add 1), gives us:

-168421.1/21/41/8
Before01110.001
Flip bits10001.110
Add ‘1’00000.001
(Carry)
Result10001.111

As always, at this point we should double check our working: -16 + 1 is -15. -15 + 7/8 is -14 and 1/8, so it looks good.

Calculate the exponent

In order to be a normalised number, the mantissa needs to be written as 1.0001111 rather than 10001.111 – the exponent will tell us how to move the point.

Looking at the above, we need the point to move 4 places to the right when we convert back to denary, which means the exponent is +4.

Using our 5 bits for the exponent, and 2s complement, that makes the exponent:

-168421
00100

Combining the mantissa and exponent

Now we have a normalised mantissa and an exponent, we can write out our answer – this is the first 7 bits of the mantissa, and the 5 bits of the exponent. The answer is:

1.000111 00100

Did you spot the error?

Always double-check your work. So, let’s make sure what we just wrote is actually correct:

Mantissa: 1.000111
Exponent: 00100

So, move the point four places to the right…

Result: 10001.11

Being a signed value, we can now check it’s value:

-168421.1/21/4
10001.11

This is: -16 + 1 = -15 (so far so good…) plus 1/2 + 1/4 = -14.25

This is not the original value; what has gone wrong?

Precision error

We have encountered a precision error. There are not enough bits available in the mantissa to accurately represent the initial value if the 12 bits are split such that 7 are assigned to the mantissa and 5 are assigned to the exponent.

Distributing bits between mantissa and exponent

With the example above, we were not able to accurately represent the original value of -14.125, because there were not enough bits available in the mantissa.

A general rule is:

  • The more bits that are assigned to the mantissa, the greater the precision, but the smaller the range
  • The more bits that are assigned to the exponent, the greater the range, but the lower the precision

To see this in action, what would happen if we tried the above using 8 bits for the mantissa and only 4 for the exponent?

We would need to calculate our exponent using only 4 bits now:

-8421
0100

Using the mantissa from before, which was 1.0001111, putting the mantissa and exponent together now gives us:

1.0001111 0100

If you work this out, you will see that it does indeed represent -14.125 – the precision has increased, allowing us to represent the 1/8 correctly. However, the highest and lowest numbers that could be represented are now smaller, as the exponent can range from -8 to +7 instead of -16 to +15.

Note that you would need to know how many bits are being assigned to both the mantissa and the exponent in order to perform these conversions.