http://www.perlmonks.org?node_id=407094

in reply to Re: Converting decimals to binary
in thread Converting decimals to binary

How does multiplying by 4294967296 not include 1 but multiplying by 2147483648 does?

In response to the first reply, if I just multiply the decimal by the "accuracy" per say then take the integer portion am I losing more accuracy, than the fixed point answer provided by number 2? Are these two solutions the same?

Replies are listed 'Best First'.
Fixed Point Numbers
by ikegami (Pope) on Nov 11, 2004 at 18:12 UTC

```I don't know of any sources, but here's some background.

Let's use smaller numbers to demonstrate. Let's say we have 4 bit numb
+ers.

+---+---+---+---+
|   |   |   |   |
+---+---+---+---+

As a little refresher, let's forget decimal numbers for a second.

+---+---+---+---+
| 0 | 1 | 1 | 1 |  7 = 0*2^3 + 1*2^2 + 1*2^1 + 0*2^0
+---+---+---+---+

I'm going to call that B4, since 4 bits are used by the integer portio
+n of the number.

+---+---+---+---+
| 0 | 1 | 1 | 1 |  7 B4
+---+---+---+---+

So how would you store a decimal number? Let's take 1.75, for example.

+---+---+---+---+   +   +
| 0 | 0 | 0 | 1 | 1   1    1.75 = 1*2^0 + 1*2^(-1) + 1*2^(-2)
+---+---+---+---+   +   +

Unfortunately, I can't just add bits to my register.
What if we shifted the bits?

+---+---+---+---+
| 0 | 1 | 1 | 1 |  1.75 * 2^2 = [ 1*2^0 + 1*2^(-1) + 1*2^(-2) ] * 2
+^2
+---+---+---+---+

Another way of phrasing that is in terms of number of integer bits lef
+t.

+---+---+---+---+
| 0 | 1 | 1 | 1 |  1.75 B2 = [ 1*2^0 + 1*2^(-1) + 1*2^(-2) ] * 2^(4
+-2)
+---+---+---+---+

In other words, using smaller B-scalings increases precision:
The following show the precision for a given scaling.

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^0    ] * 2^(4-4) = 1 B4
+---+---+---+---+

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^(-1) ] * 2^(4-3) = 0.5 B3
+---+---+---+---+

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^(-2) ] * 2^(4-2) = 0.25 B2
+---+---+---+---+

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^(-3) ] * 2^(4-1) = 0.125 B1
+---+---+---+---+

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^(-4) ] * 2^(4-0) = 0.0625 B0
+---+---+---+---+

Nothing says we have to stop at 0.

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^(-5) ] * 2^(4--1) = 0.03125 B-1
+---+---+---+---+

The downside is that biggest number we can represent goes down
as the precision increases.

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^3    + 1*2^2    + 1*2^1    + 1*2^0    ] *
+2^(4-4) = 15 B4
+---+---+---+---+

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^2    + 1*2^1    + 1*2^0    + 1*2^(-1) ] *
+2^(4-3) = 7.5 B3
+---+---+---+---+

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^1    + 1*2^0    + 1*2^(-1) + 1*2^(-2) ] *
+2^(4-2) = 3.75 B2
+---+---+---+---+

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^0    + 1*2^(-1) + 1*2^(-2) + 1*2^(-3) ] *
+2^(4-1) = 1.875 B1
+---+---+---+---+

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^(-1) + 1*2^(-2) + 1*2^(-3) + 1*2^(-4) ] *
+2^(4-0) = 0.9375 B0
+---+---+---+---+

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^(-2) + 1*2^(-3) + 1*2^(-4) + 1*2^(-5) ] *
+2^(4--1) = 0.46875 B-1
+---+---+---+---+

We can go the other way, if need be.

+---+---+---+---+
| 1 | 1 | 1 | 1 |  [ 1*2^4    + 1*2^3    + 1*2^2    + 1*2^1    ] *
+2^(4-5) = 30 B5
+---+---+---+---+

+---+---+---+---+
| 0 | 0 | 0 | 1 |  [ 1*2^1    ] * 2^(4-5) = 2 B5
+---+---+---+---+

<blockquote><i>How does multiplying by 4294967296 not include 1 but mu
+ltiplying by 2147483648 does?</i></blockquote>

Multiplying by 4294967296 (2^32) gives 32 bits for the decimal portion
+,
leaving no bits (32-32) for the integer portion. In other words, B0.
1 does not fit into no bits.

+   +---+-----
1 | 0 | ...         1 B0
+   +---+-----

Multiplying by 2147483648 (2^31) gives 31 bits for the decimal portion
+,
leaving 1 bit (32-31) for the integer portion. In other words, B1.
1 fits in one bit.

+---+---+-----
| 1 | 0 | ...     1 B1
+---+---+-----

So what's the advantage over floating-point numbers?
Floating-point numbers does this the same way, and it does it for you.
However, in order to do that, it must save the scaling (called "expone
+nt")
along with the number. That means floats require more memory to store
+than these.

So what good is any of this anyway?
You can use integer arithmetic on these numbers!

X Bi + Y Bi = (X+Y) Bi
X Bi - Y Bi = (X-Y) Bi
X Bi * Y Bj = (X*Y) B(i+j)
X Bi / Y Bj = (X/Y) B(i-j)
X Bi >> j   = X B(i+j)
X Bi << j   = X B(i-j)

You can also compare numbers of the same scaling.

X Bi <  Y Bi
X Bi >  Y Bi
X Bi == Y Bi
etc.

A = ...    # B1
B = ...    # B1
C = A + B  # POTENTIAL OVERFLOW!
# 1 B1 + 1 B1 will overflow, for example

There are two ways of avoiding overflow.
You can make sure in advance that the numbers won't cause an overflow,
+ or
you can switch to a different B-scaling (at the cost of a bit of preci
+sion).

A = ...     # B1
A = A >> 1  # B2
B = ...     # B1
B = B >> 1  # B2
C = A + B   # B2

We do lots of works with values in [0.0..1.0]. When we need to calcula
+te by
how much a valve should be open, we calculate it using values in [0.0.
+.1.0].
Later on, we convert them to the right value to send to the Analog Out
+put.
It sounds like you want to deal with numbers in this same range, so I
+thought
the following would be very useful to you:

A = ...          # B1  Must be between 0.0 and 1.0.
B = ...          # B1  Must be between 0.0 and 1.0.
C = A * B        # B2
C = C << 1       # B1  Safe, since 1.0 * 1.0 = 1.0.

Although we've only dealt unsigned numbers, everything
I've said so far applies to signed 2s complement numbers.