Arithmetic Operations
Overview
When developing a dynamic system using floating-point arithmetic, you generally don't have to worry about numerical limitations since floating-point data types have high precision and range. Conversely, when working with fixed-point arithmetic, you must consider these factors when developing dynamic systems:
- Overflow - Adding two sufficiently large negative or positive values can produce a result that does not fit into the representation. This will have an adverse effect on the control system.
- Quantization - Fixed-point values are rounded. Therefore, the output signal to the plant and the input signal to the control system do not have the same characteristics as the ideal discrete-time signal.
- Computational noise - The accumulated errors that result from the rounding of individual terms within the realization introduces noise into the control signal.
- Limit cycles - In the ideal system, the output of a stable transfer function (digital filter) approaches some constant for a constant input. With quantization, limit cycles occur where the output oscillates between two values in steady state.
Recommendations for Arithmetic
This section describes the relationship between arithmetic operations and fixed-point scaling, and some basic recommendations that may be appropriate for your fixed-point design. For each arithmetic operation:
- The general slope/bias encoding scheme described in Scaling is used.
- The scaling of the result is automatically selected based on the scaling of the two inputs. In other words, the scaling is inherited.
- Scaling choices are based on minimizing the number of arithmetic operations of the result and maximizing the precision of the result.
- Radix point-only scaling is presented as a special case of the general encoding scheme.
In embedded systems, the scaling of variables at the hardware interface (the ADC or DAC) is fixed. However for most other variables, the scaling is something you can choose to give the best design. When scaling fixed-point variables, it is important to remember that:
- Your scaling choices depend on the particular design you are simulating.
- All scaling choices have associated advantages and disadvantages.
Addition
Consider the addition of two real-world values.
\(V_{a} = V_{b} + V_{c}\)
These values are represented by the general slope/bias encoding scheme.
\(V_{i} = F_{i} 2^{E_{i}} \cdot Q_{i} + B_{i}\)
In a fixed-point system, the addition of values results in finding the variable \(Q_{a}\).
\(Q_{a} = \dfrac{F_{b}}{F_{a}} 2^{E_{b} - E_{a}}\cdot Q_{b} + \dfrac{F_{c}}{F_{a}} 2^{E_{c} - E_{a}}\cdot Q_{c} + \dfrac{B_{b} + B_{c} - B_{a}}{F_{a}}\cdot 2^{-E_{a}}\)
This formula shows:
- \(Q_{a}\) is not computed through a simple addition of \(Q_{b}\) and \(Q_{c}\).
- In general, there are two multiplies of a constant and a variable, two additions, and some additional bit shifting.
Inherited Scaling for Speed
In the process of finding the scaling of the sum, one reasonable goal is to simplify the calculations. Simplifying the calculations should reduce the number of operations thereby increasing execution speed. The following choices can help to minimize the number of arithmetic operations:
- Set \(B_{a}\) = \(B_{b}\) + \(B_{c}\) which eliminates one addition.
- Set \(F_{a}\) = \(F_{b}\) or \(F_{a}\) which eliminates one of the two constant times variable multiplies.
The resulting formula is one of the following:
\(Q_{a} = 2^{E_{b} - E_{a}}\cdot Q_{b} + \dfrac{F_{c}}{F_{a}} 2^{E_{c} - E_{a}}\cdot Q_{c}\)
\(Q_{a} = \dfrac{F_{b}}{F_{a}} 2^{E_{b} - E_{a}}\cdot Q_{b} + 2^{E_{c} - E_{a}}\cdot Q_{c}\)
These equations appear to be equivalent. However, your choice of rounding and precision may make one choice stand out over the other. To further simplify matters you could choose \(E_{a}= E_{c}\) or \(E_{a}= E_{b}\), which eliminates some bit shifting.
Inherited Scaling for Maximum Precision
In the process of finding the scaling of the sum, one reasonable goal is maximum precision. The maximum precision scaling can be determined if the range of the variable is known. The range of a fixed-point operation can be determined from \(\text{max}(\tilde V_{a})\) and \(\text{max}(\tilde V_{a})\). For summation, the range can be determined from:
- \(\text{min}(\tilde V_{a})=\text{min}(\tilde V_{b})+\text{min}(\tilde V_{c})\)
- \(\text{max}(\tilde V_{a})=\text{max}(\tilde V_{b})+\text{max}(\tilde V_{c})\)
The maximum precision slope can now be derived.
\(F_{a}2^{E^{a}} = \dfrac{\text{max}(\tilde V_{a}) - \text{min}(\tilde V_{a})}{2^{ws_{a}-1}}\)
Radix Point-Only Scaling
For radix point-only scaling, finding \(Q_{a}\) results in this simple expression.
\(Q_{a} = 2^{E_{b}-E_{a}}\cdot Q_{b}+2^{E_{c}-E_{a}}\cdot Q_{c}\)
This scaling choice results in only one addition and some bit shifting. The avoidance of any multiplications is a big advantage of radix point-only scaling.
Note
The subtraction of values produces results that are analogous to those produced by the addition of values.
Multiplication
Consider the multiplication of two real-world values.
\(V_{a} = V_{b}\cdot V_{c}\)
These values are represented by the general slope/bias encoding scheme.
\(V_{i} = F_{i} 2^{E_{i}} \cdot Q_{i} + B_{i}\)
In a fixed-point system, the multiplication of values results in finding the variable \(Q_{a}\).
\(Q_{a}=\dfrac{F_{b}F_{c}}{F_{a}}\cdot 2^{E_{b}+E{c}-E_{a}}\cdot Q_{b}Q_{c} + \dfrac{F_{b}B_{c}}{F_{a}}\cdot 2^{E_{b}-E_{a}}\cdot Q_{b} + \dfrac{F_{c}}{B_{b}}\cdot 2^{E_{c}-E_{a}}\cdot Q_{c} + \dfrac{B_{b}B_{c}-B_{a}}{F_{a}}\cdot 2^{-E_{a}}\)
This formula shows:
- \(Q_{a}\) is not computed through a simple multiplication of \(Q_{b}\) and \(Q_{c}\)
- In general, there is one multiply of a constant and two variables, two multiplies of a constant and a variable, three additions, and some additional bit shifting.
Inherited Scaling for Speed
The number of arithmetic operations can be reduced by setting the following:
- \(B_{a} = B_{b}B_{c}\) which eliminates one addition operation.
- \(F_{a} = F_{b}F_{c}\) which simplifies the triple multiplication.
- \(E_{a} = E_{b}E_{c}\) which eliminates some of the bit-shifting.
The resulting formula is
\(Q_{a} = Q_{b}Q_{c} + \dfrac{B_{c}}{F_{c}}\cdot 2^{-E_{c}}\cdot Q_{b} + \dfrac{B_{b}}{F_{b}}\cdot 2^{-E_{b}}\cdot Q_{c}\)
Inherited Scaling for Maximum Precision
The maximum precision scaling can be determined if the range of the variable is known. The range of a fixed-point operation can be determined from \(\text{max}(\tilde V_{a})\) and \(\text{min}(\tilde V_{a})\).
For multiplication, the range can be determined from
\(\text{min}(\tilde V_{a})=\text{min}(\tilde V_{LL},\tilde V_{LH},\tilde V_{HL},\tilde V_{HH} )\)
where:
- \(\text{min}(\tilde V_{LL})=\text{min}(\tilde V_{b})\cdot \text{min}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{min}(\tilde V_{b})\cdot \text{max}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{max}(\tilde V_{b})\cdot \text{min}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{max}(\tilde V_{b})\cdot \text{max}(\tilde V_{c})\)
Radix Point-Only Scaling
For radix point-only scaling, finding Qa results in this simple expression.
\(Q_{a} = 2^{E_{b}E_{c}-E_{a}}\cdot Q_{b}Q_{c}\)
Gain
Consider the multiplication of a constant and a variable.
\(V_{a} = K\cdot V_{b}\)
where \(K\) is a constant called the gain. Since \(V_{a}\) results from the multiplication of a constant and a variable, finding \(Q_{a}\) is a simplified version of the general fixed-point multiply formula.
\(Q_{a} = \dfrac{KF_{b}2^{E_{b}}}{F_{a}2^{E_{a}}}\cdot Q_{b} + \dfrac{KB_{b}-B_{a}}{F_{a}2^{E_{a}}}\)
To implement the above equation without changing it to a more complicated form, the constants need to be encoded using a radix point-only format. For each of these constants, the range is the trivial case of only one value. Despite the trivial range, the radix point formulas for maximum precision are still valid. The maximum precision representations are the most useful choices unless there is an overriding need to avoid any shifting. The encoding of the constants is:
\(\dfrac{KF_{b}2^{E_{b}}}{F_{a}2^{E_{a}}} = 2^{E_{x}}Q_{x}\)
\(\dfrac{KB_{b}-B_{a}}{F_{a}2^{E_{a}}} = 2^{E_{y}}Q_{y}\)
The resulting formula is:
\(Q_{a} = 2^{E_{x}}Q_{x}Q_{b} + 2^{E_{y}}Q_{y}\)
Inherited Scaling for Speed
The number of arithmetic operations can be reduced by setting the following:
- \(B_{a} = KB_{b}\) which eliminates one constant term.
- \(F_{a} = KF_{b}\) and \(E_{a} = E_{b}\) which sets the other constant term to unity.
The resulting formula is simply \(Q_{a} = Q_{b}\).
If the number of bits is different, then either handling potential overflows or performing sign extensions is the only possible operations involved.
Inherited Scaling for Maximum Precision
The scaling for maximum precision does not need to be different than the scaling for speed unless the output has fewer bits than the input. If this is the case, then saturation should be avoided by dividing the slope by 2 for each lost bit. This will prevent saturation but will cause rounding to occur.
Division
Division of values is an operation that should be avoided in fixed-point embedded systems, but it can occur in places. Therefore, consider the division of two real-world values.
\(V_{a} = V_{b}/V_{c}\)
These values are represented by the general slope/bias encoding scheme described in Scaling.
\(V_{i} = F_{i}2^{E_{i}}Q_{i} + B_{i}\)
In a fixed-point system, the division of values results in finding the variable \(Q_{a}\).
\(Q_{a} = \dfrac{F_{b}2^{E_{b}}Q_{b} + B_{b}}{F_{c}F_{a}2^{E_{c} + E_{a}}Q_{c} + B_{c}F_{a}\cdot 2^{E_{a}}} - \dfrac{B_{a}}{F_{a}}\cdot 2^{-E_{a}}\)
This formula shows:
- \(Q_{a}\) is not computed through a simple division of \(Q_{b}\) by \(Q_{c}\).
- There are two multiplies of a constant and a variable, two additions, one division of a variable by a variable, one division of a constant by a variable, and some additional bit shifting.
Inherited Scaling for Speed
The number of arithmetic operations can be reduced with these choices:
- \(B_{a} = 0\) which eliminates one addition operation.
- If \(B_{c} = 0\), then set the fractional slope \(F_{a} = F_{b}/F_{c}\). This eliminates one constant times variable multiplication.
The resulting formula is:
\(Q_{a} = \dfrac{Q_{b}}{Q_{c}}\cdot 2^{E_{b}-E_{c}-E_{a}} + \dfrac{B_{b}/F_{b}}{Q_{c}}\cdot 2^{-E_{c}-E_{a}}\)
If \(B_{c} \ne 0\), then no clear recommendation can be made.
Inherited Scaling for Maximum Precision
The maximum precision scaling can be determined if the range of the variable is known. The range of a fixed-point operation can be determined from \(\text{min}(\tilde V_{a})\) and \(\text{min}(\tilde V_{a})\). For division, the range can be determined from:
\(\text{min}(\tilde V_{a})=\text{min}(\tilde V_{LL},\tilde V_{LH},\tilde V_{HL},\tilde V_{HH} )\)
where for nonzero denominators:
- \(\text{min}(\tilde V_{LL})=\text{min}(\tilde V_{b}) / \text{min}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{min}(\tilde V_{b}) / \text{max}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{max}(\tilde V_{b}) / \text{min}(\tilde V_{c})\)
- \(\text{min}(\tilde V_{LL})=\text{max}(\tilde V_{b}) / \text{max}(\tilde V_{c})\)
Radix Point-Only Scaling
For radix point-only scaling, finding \(Q_{a}\) results in this simple expression.
\(Q_{a} = \dfrac{Q_{b}}{Q_{c}}\cdot 2^{E_{b}-E_{c}-E_{a}}\)
Warning
For the last two formulas involving \(Q_{a}\), a divide by zero and zero divided by zero are possible. In these cases, the hardware will give some default behavior but you must make sure that these default responses give meaningful results for the embedded system.
Limits on Precision
Computer words consist of a finite numbers of bits, and the binary encoding of variables is only an approximation of an arbitrarily precise real-world value. Therefore, the limitations of the binary representation automatically introduce limitations on the precision of the value.
The precision of a fixed-point word depends on the word size and radix point location. Extending the precision of a word can always be accomplished with more bits although you face practical limitations with this approach. Instead, you must carefully select the data type, word size, and scaling such that numbers are accurately represented. Rounding and padding with trailing zeros are typical methods implemented on processors to deal with the precision of binary words.
Rounding
The result of any operation on a fixed-point number is typically stored in a register that is longer than the number's original format. When the result is put back into the original format, the extra bits must be disposed of. That is, the result must be rounded. Rounding involves going from high precision to lower precision and produces quantization errors and computational noise.
The blockset provides four rounding modes, which are shown below.

The Fixed-Point Blockset rounding modes are discussed below. The data is generated using Simulink's Signal Generator block and doubles are converted to signed 8-bit numbers with radix point-only scaling of \(2^{-2}\).
Round Toward Zero
The computationally simplest rounding mode is to drop all digits beyond the number required. This mode is referred to as rounding toward zero, and it results in a number whose magnitude is always less than or equal to the more precise original value. In MATLAB, you can round to zero using the fix function.
Rounding toward zero introduces a cumulative downward bias for positive numbers and a cumulative upward bias for negative numbers. That is, all positive numbers are rounded to smaller positive numbers, while all negative numbers are rounded to smaller negative numbers.

Round Toward Nearest
When rounding toward nearest, the number is rounded to the nearest representable value. This mode has the smallest errors associated with it and these errors are symmetric. As a result, rounding toward nearest is the most useful approach for most applications.
In MATLAB, you can round to nearest using the round function.

Round Toward Ceiling
When rounding toward ceiling, both positive and negative numbers are rounded toward positive infinity. As a result, a positive cumulative bias is introduced in the number.
In MATLAB, you can round to ceiling using the ceil function.

Round Toward Floor
When rounding toward floor, both positive and negative numbers are rounded to negative infinity. As a result, a negative cumulative bias is introduced in the number.
In MATLAB, you can round to floor using the floor function.

Rounding toward ceiling and rounding toward floor are sometimes useful for diagnostic purposes. For example, after a series of arithmetic operations, you may not know the exact answer because of word-size limitations, which introduce rounding. If every operation in the series is performed twice, once rounding to positive infinity and once rounding to negative infinity, you obtain an upper limit and a lower limit on the correct answer. You can then decide if the result is sufficiently accurate or if additional analysis is required.
Padding with Trailing Zeros
Padding with trailing zeros involves extending the least significant bit (LSB) of a number with extra bits. This method involves going from low precision to higher precision.
For example, suppose two numbers are subtracted from each other. First, the exponents must be aligned, which typically involves a right shift of the number with the smaller value. In performing this shift, significant digits can "fall off" to the right. However, when the appropriate number of extra bits is appended, the precision of the result is maximized. Consider two 8-bit fixed-point numbers that are close in value and subtracted from each other:
\(1.0000000\cdot 2^{q} - 1.1111111\cdot 2^{q-1} \hspace{1cm} \text{Where q is an integer}\)
To perform this operation, the exponents must be equal.
\(\!\begin{aligned} &1.00000000\cdot 2^{q} \\ -&0.11111111\cdot 2^{q} \\ \hline &0.00000001\cdot 2^{q} \end{aligned}\)
If the top number is padded by two zeros and the bottom number is padded with one zero, then the above equation produces a more precise result.
\(\!\begin{aligned} &1.0000000000\cdot 2^{q} \\ -&0.1111111110\cdot 2^{q} \\ \hline &0.0000000010\cdot 2^{q} \end{aligned}\)
Example: Limits on Precision
Fixed-point variables have a limited precision because digital systems represent numbers with a finite number of bits. For example, suppose you must represent the real-world number 35.375 with a fixed-point number. Using the general encoding scheme, the representation is:
\(\tilde{V}=2^{-2} Q+32\)
The two closest approximations to the real-world value are \(Q\) = 13 and \(Q\) = 14.
\(\tilde{V}=2^{-2}(13)+32=35.25\)
\(\tilde{V}=2^{-2}(14)+32=35.50\)
In either case, the absolute error is the same.
\(|{\tilde{V}-V}|=0.125=F\dfrac {2^{E}}{2}\)
For fixed-point values within the limited range, this represents the worst-case error if round-to-nearest is used. If other rounding modes are used, the worst-case error can be twice as large.
\(|{\tilde{V}-V}|=F2^{E}\)
Example: Maximizing Precision
Precision is limited by slope. To achieve maximum precision, the slope should be made as small as possible while keeping the range adequately large. The bias will be adjusted in coordination with the slope.
Assume the maximum and minimum real-world value is given by max(V) and min(V), respectively. These limits may be known based on physical principles or engineering considerations. To maximize the precision, you must decide upon a rounding scheme and whether overflows saturate or wrap. To simplify matters, this example assumes the minimum real-world value corresponds to the minimum encoded value, and the maximum real-world value corresponds to the maximum encoded value. Using the general encoding scheme, these values are given by:
\(\text{max}(V)=F2^{E}\cdot \text{max}(Q)+B\)
\(\text{min}(V)=F2^{E}\cdot \text{min}(Q)+B\)
Solving for the slope, you get:
\(F2^{E}=\dfrac{\text{max}(V)-\text{min}(V)}{\text{max}(Q)-\text{min}(Q)}=\dfrac{\text{max}(V)-\text{min}(V)}{2^{ws}-1}\)
This result is independent of rounding and overflow issues, and depends only on word size (\(ws\)).
Limits on Range
Limitations on the range of a fixed-point word occur for the same reason as limitations on its precision. Namely, fixed-point words have limited size.
In binary arithmetic, a processor may need to take an n-bit fixed-point number and store it in m bits, where m ≠ n . If m < n, the range of the number has been reduced and an operation can produce an overflow condition. Some processors identify this condition as infinity or NaN. For other processors, especially digital signal processors (DSPs), the value saturates or wraps. If m > n, the range of the number has been extended. Extending the range of a word requires the inclusion of guard bits, which act to "guard" against potential overflow. In both cases, the range depends on the word's size and scaling.
The Fixed-Point Blockset supports saturation and wrapping for all fixed-point data types, while guard bits are supported only for fractional data types. As shown below, you can select saturation or wrapping with the Saturate to max or min when overflows occur check box, and you can specify guard bits with the Output data type parameter.

Saturation and Wrapping
Saturation and wrapping describe a particular way that some processors deal with overflow conditions. For example, Analog Device's ADSP-2100 family of processors supports either of these modes. If a register has a saturation mode of operation, then an overflow condition is set to the maximum positive or negative value allowed. Conversely, if a register has a wrapping mode of operation, an overflow condition is set to the appropriate value within the range of the representation.
Consider an 8-bit unsigned word with radix point-only scaling of \(2^{-5}\). Suppose this data type must represent a sine wave that ranges from -4 to 4. For values between 0 and 4, the word can represent these numbers without regard to overflow. This is not the case with negative numbers. If overflows saturate, all negative values are set to zero, which is the smallest number representable by the data type.

If overflows wrap, all negative values are set to the appropriate positive value.

Note
For most control applications, saturation is the safer way of dealing with fixed-point overflow. However, some processor architectures allow automatic saturation by hardware. If hardware saturation is not available, then extra software is required resulting in larger, slower programs. This cost is justified in some designs - perhaps for safety reasons. Other designs accept wrapping to obtain the smallest, fastest software.
Guard Bits
You can eliminate the possibility of overflow by appending the appropriate number of guard bits to a binary word.
For a two's complement signed value, the guard bits are filled with either 0's or 1's depending on the value of the most significant bit (MSB). This is called sign extension. For example, consider a 4-bit two's complement number with value 1011. If this number is extended in range to 7 bits with sign extension, then the number becomes 1111101 and the value remains the same.
Guard bits are supported only for fractional data types. For both signed and unsigned fractionals, the guard bits lie to the left of the default radix point.
Fixed-point variables have a limited range for the same reason they have limited precision - because digital systems represent numbers with a finite number of bits. As a general example, consider the case where an integer is represented as a fixed-point word of size \(ws\). The range for signed and unsigned words is:
\(\text{max}(Q) - \text{min}(Q)\)
where:
\(\text{max}(Q) = \begin{cases} \ 2^{ws}-1 & \text{unsigned} \\[8pt] \ 2^{ws-1}-1 & \text{signed} \\[8pt] \end{cases}\)
\(\text{min}(Q) = \begin{cases} \ 0 & \text{unsigned} \\[8pt] \ -2^{ws}-1 & \text{signed} \\[8pt] \end{cases}\)
Using the general slope/bias encoding scheme, the approximate real-world value has the range:
\(\text{max}(\tilde{V}) - \text{min}(\tilde{V})\)
where:
\(\text{max}(\tilde{V}) = \begin{cases} \ F2^{E}(2^{ws}-1)+B & \text{unsigned} \\[8pt] \ F2^{E}(2^{ws-1}-1)+B & \text{signed} \\[8pt] \end{cases}\)
\(\text{min}(\tilde{V}) = \begin{cases} \ B & \text{unsigned} \\[8pt] \ -F2^{E}(2^{ws-1})+B & \text{signed} \\[8pt] \end{cases}\)
If the real-world value exceeds the limited range of the approximate value, the accuracy of the representation can become significantly worse.