Skip to content

Data Types and Scaling

Overview

In digital hardware, numbers are stored in binary words. A binary word is a fixed-length sequence of binary digits (1's and 0's). The way in which hardware components or software functions interpret this sequence of 1's and 0's is described by a data type.

Binary numbers are represented as either fixed-point or floating-point data types. A fixed-point data type is characterized by the word size in bits, the radix (binary) point, and whether it is signed or unsigned. The radix point is the means by which fixed-point values are scaled. Within the Fixed-Point Blockset, fixed-point data types can be integers, fractionals, or generalized fixed-point numbers. The main difference between these data types is their default radix point. Floating-point data types are characterized by a sign bit, a fraction (or mantissa) field, and an exponent field. The blockset adheres to the IEEE Standard 754-1985 for Binary Floating-Point Arithmetic (referred to simply as the IEEE Standard 754 throughout this guide) and supports singles, doubles, and a nonstandard IEEE-style floating-point data type.

When choosing a data type, you must consider these factors: - The numerical range of the result - The precision required of the result - The associated quantization error (i.e., the rounding mode) - The method for dealing with exceptional arithmetic conditions

These choices depend on your specific application, the computer architecture used, and the cost of development, among others.

With the Fixed-Point Blockset, you can explore the relationship between data types, range, precision, and quantization error in the modeling of dynamic digital systems. With the Real-Time Workshop, you can generate production code based on that model.

Floating-Point Numbers

Fixed-point numbers are limited in that they cannot simultaneously represent very large or very small numbers using a reasonable word size. This limitation is overcome by using scientific notation. With scientific notation, you can dynamically place the radix point at a convenient location and use powers of the radix to keep track of that location. Thus, a range of very large and very small numbers can be represented with only a few digits.

Any binary floating-point number can be represented in scientific notation form as \(\pm f^{\pm e}\) where \(f\) is the fraction (or mantissa), 2 is the radix or base (binary in this case), and \(e\) is the exponent of the radix. The radix is always a positive number while f and e can be positive or negative.

When performing arithmetic operations, floating-point hardware must take into account that the sign, exponent, and fraction are all encoded within the same binary word. This results in complex logic circuits when compared with the circuits for binary fixed-point operations.

The Fixed-Point Blockset supports single-precision and double-precision floating-point numbers as defined by the IEEE Standard 754. Additionally, a nonstandard IEEE-style number is supported. To link the world of fixed-point numbers with the world of floating-point numbers, the concepts behind scientific notation are reviewed below.

Scientific Notation

A direct analogy exists between scientific notation and radix point notation. For example, scientific notation using five decimal digits for the fraction has the form:

\(\pm d.dddd\times10^{p} = \pm ddddd.0\times10^{p-4} = \pm 0.ddddd\times10^{p+1}\)

where \(p\) is an integer of unrestricted range. Radix point notation using five bits for the fraction is the same except for the number base.

\(\pm b.bbbb\times10^{q} = \pm qqqqq.0\times10^{q-4} = \pm 0.qqqqq\times10^{q+1}\)

where \(q\) is an integer of unrestricted range. The previous equation is valid for both fixed- and floating-point numbers. For both these data types, the fraction can be changed at any time by the processor. However, for fixed-point numbers the exponent never changes, while for floating-point numbers the exponent can be changed any time by the processor.

For fixed-point numbers, the exponent is fixed but there is no reason why the radix point must be contiguous with the fraction. For example, a word consisting of three unsigned bits is usually represented in scientific notation in one of these four ways.

\(\begin{align*} bbb. &= bbb. \times 2^{0} \\ bb.b &= bbb. \times 2^{-1} \\ b.bb &= bbb. \times 2^{-2} \\ .bbb &= bbb. \times 2^{-3} \end{align*}\)

If the exponent is greater than 0 or less than -3, the representation would involve lots of zeros.

\(\begin{align*} bbb00000. &= bbb. \times 2^{5} \\ bbb00. &= bbb. \times 2^{2} \\ .00bbb &= bbb. \times 2^{-5} \\ .00000bbb &= bbb. \times 2^{-8} \end{align*}\)

However, these extra zeros never change to ones so they don't show up in the hardware. Furthermore, unlike floating-point exponents, a fixed-point exponent never shows up in the hardware, so fixed-point exponents are not limited by a finite number of bits.

Note

The restriction of the radix point being contiguous with the fraction is unnecessary, and the Fixed-Point Blockset allows you to extend the radix point to any arbitrary location.

The IEEE Format

he IEEE Standard 754 has been widely adopted, and is used with virtually all floating-point processors and arithmetic coprocessors - with the notable exception of many DSP floating-point processors.

Among other things, this standard specifies four floating-point number formats of which singles and doubles are the most widely used. Each format contains three components: a sign bit, a fraction field, and an exponent field. These components, as well as the specific formats for singles and doubles, are discussed below.

The Sign Bit

While two's complement is the preferred representation for signed fixed-point numbers, IEEE floating-point numbers use a sign/magnitude representation, where the sign bit is explicitly included in the word. Using this representation, a sign bit of 0 represents a positive number and a sign bit of 1 represents a negative number.

The Fraction Field

In general, floating-point numbers can be represented in many different ways by shifting the number to the left or right of the radix point and decreasing or increasing the exponent of the radix by a corresponding amount.

To simplify operations on these numbers, they are normalized in the IEEE format. A normalized binary number has a fraction of the form 1.f where f has a fixed size for a given data type. Since the leftmost fraction bit is always a 1, it is unnecessary to store this bit and is therefore implicit (or hidden). Thus, an n-bit fraction stores an n+1-bit number. The IEEE format also supports denormalized numbers, which have a fraction of the form 0.f.

The Exponent Field

In the IEEE format, exponent representations are biased. This means a fixed value (the bias) is subtracted from the field to get the true exponent value. For example, if the exponent field is 8 bits, then the numbers 0 through 255 are represented, and there is a bias of 127. Note that some values of the exponent are reserved for flagging infinity, NaN, and denormalized numbers, so the true exponent values range from -126 to 127.

Single Precision Format

The IEEE single-precision floating-point format is a 32-bit word divided into a 1-bit sign indicator s, an 8-bit biased exponent e, and a 23-bit fraction f.

The relationship between this format and the representation of real numbers is given by:

\(value = \begin{cases} \ (-1)^{s} \cdot 2^{e-127} \cdot 1.f & \text{normalized, 0 < e < 255} \\[8pt] \ (-1)^{s} \cdot 2^{e-126} \cdot 0.f & \text{denormalized, e = 0, f > 0} \\[8pt] \ \text{exceptional value} & \text{otherwise} \end{cases}\)

Double Precision Format

The IEEE double-precision floating-point format is a 64-bit word divided into a 1-bit sign indicator s, an 11-bit biased exponent e, and a 52-bit fraction f.

The relationship between this format and the representation of real numbers is given by:

\(value = \begin{cases} \ (-1)^{s} \cdot 2^{e-1023} \cdot 1.f & \text{normalized, 0 < e < 2047} \\[8pt] \ (-1)^{s} \cdot 2^{e-1022} \cdot 0.f & \text{denormalized, e = 0, f > 0} \\[8pt] \ \text{exceptional value} & \text{otherwise} \end{cases}\)

Nonstandard IEEE Format

The Fixed-Point Blockset supports a nonstandard IEEE-style floating-point data type. This data type adheres to the definitions and formulas previously given for IEEE singles and doubles. You create nonstandard floating-point numbers with the float function.

float(TotalBits,ExpBits)

TotalBits is the total word size and ExpBits is the size of the exponent field. The size of the fraction field and the bias are calculated from these input arguments. You can specify any number of exponent bits up to 11, and any number of total bits such that the fraction field is no more than 53 bits.

When specifying a nonstandard format, you should remember that the number of exponent bits largely determines the range of the result and the number of fraction bits largely determines the precision of the result.

Note

These numbers are normalized with a hidden leading one for all exponents except the smallest possible exponent. However, the largest possible exponent might not be treated as a flag for infinity or NaNs.

Range and Precision

The range of a number gives the limits of the representation while the precision gives the distance between successive numbers in the representation. The range and precision of an IEEE floating-point number depend on the specific format.

Range

The range of representable numbers for an IEEE floating-point number with f bits allocated for the fraction, e bits allocated for the exponent, and the bias of e given by bias = \(2e^{-1} - 1\) is illustrated below.

where:

  • Normalized positive numbers are defined within the range \(2^{1 - bias}\) to \((2 - 2^{-f}) \cdot 2^{bias}\).
  • Normalized negative numbers are defined within the range \(-2^{1 - bias}\) to \(-(2 - 2^{-f}) \cdot 2^{bias}\).
  • Positive numbers greater than \((2 - 2^{-f}) \cdot 2^{bias}\), and negative numbers greater than \(-(2 - 2^{-f}) \cdot 2^{bias}\) are overflows.
  • Positive numbers less than \(2^{1 - bias}\), and negative numbers less than \(-2^{1 - bias}\) are either underflows or denormalized numbers.
  • Zero is given by a special bit pattern, where \(e = 0\) and \(f = 0\).

Overflows and underflows result from exceptional arithmetic conditions. Floating-point numbers outside the defined range are always mapped to \(\pm\)inf.

Info

You can use the MATLAB commands realmin and realmax to determine the dynamic range of double-precision floating-point values for your computer.

Precision

Due to a finite word size, a floating-point number is only an approximation of the true value. Therefore, it is important to have an understanding of the precision of a floating-point result. In general, a value v with an accuracy q is specified by v ∓ q. For IEEE floating-point numbers, \(v = -1s \cdot 2^{e-bias} \cdot 1.f\) and \(q = 2^{-f} \cdot 2^{e-bias}\). Thus, the precision is associated with the number of bits in the fraction field.

In MATLAB, floating-point relative accuracy is given by the command eps, which returns the distance from 1.0 to the next largest floating point number. For a computer that supports the IEEE Standard 754, eps = \(2^{-52}\) or \(2.2204 \cdot 10^{-16}\).

Exceptional Arithmetic

In addition to specifying a floating-point format, the IEEE Standard 754 specifies practices and procedures so that predictable results are produced independent of the hardware platform. Specifically, denormalized numbers, infinity, and NaNs are defined to deal with exceptional arithmetic (underflow and overflow).

If an underflow or overflow is handled as infinity or NaN, then significant processor overhead is required to deal with this exception. Although the IEEE Standard 754 specifies practices and procedures to deal with exceptional arithmetic conditions in a consistent manner, microprocessor manufacturers may handle these conditions in ways that depart from the standard.

Denormalized Numbers

Denormalized numbers are used to handle cases of exponent underflow. When the exponent of the result is too small (i.e., a negative exponent with too large a magnitude), the result is denormalized by right-shifting the fraction and leaving the exponent at its minimum value. The use of denormalized numbers is also referred to as gradual underflow. Without denormalized numbers, the gap between the smallest representable nonzero number and zero is much wider than the gap between the smallest representable nonzero number and the next larger number. Gradual underflow fills that gap and reduces the impact of exponent underflow to a level comparable with round off among the normalized numbers. Thus, denormalized numbers provide extended range for small numbers at the expense of precision.

Infinity

Arithmetic involving infinity is treated as the limiting case of real arithmetic, with infinite values defined as those outside the range of representable numbers.

\(-\infty \le \text{representable number} \le \infty\)

With the exception of the special cases discussed below (NaNs), any arithmetic operation involving infinity yields infinity. Infinity is represented by the largest biased exponent allowed by the format and a fraction of zero.

NaNs

A NaN (not-a-number) is a symbolic entity encoded in floating-point format. There are two types of NaNs: signaling and quiet. A signaling NaN signals an invalid operation exception. A quiet NaN propagates through almost every arithmetic operation without signaling an exception. NaNs are produced by these operations: \(\infty - \infty, -\infty + \infty, 0 \cdot \infty, 0 / 0, \text{and } \infty / \infty\).

Both types of NaNs are represented by the largest biased exponent allowed by the format and a fraction that is nonzero. The bit pattern for a quiet NaN is given by \(0.f\) where the most significant number in \(f\) must be 1. The bit pattern for a signaling NaN is given by \(0.f\) where the most significant number in \(f\) must be 0 and at least one of the remaining numbers must be nonzero.

Fixed-Point Numbers

Fixed-point numbers are stored in data types that are characterized by their word size in bits, radix point, and whether they are signed or unsigned. The Fixed-Point Blockset supports integers, fractionals, and generalized fixed-point numbers. The main difference between these data types is their default radix point.

Info

Fixed-point word sizes up to 128 bits are supported.

A common representation of a binary fixed-point number (either signed or unsigned) is shown below.

where:

  • \(b_{i}\) are the binary digits (bits).
  • The size of the word in bits is given by \(ws\).
  • The most significant bit (MSB) is the leftmost bit, and is represented by location.
  • The least significant bit (LSB) is the rightmost bit, and is represented by location \(b_{0}\).
  • The radix point is shown four places to the left of the LSB.

Signed Fixed-Point Numbers

Computer hardware typically represent the negation of a binary fixed-point number in three different ways:

  • Sign/magnitude
  • One's complement
  • Two's complement

Two's complement is the preferred representation of signed fixed-point numbers and is supported by the Fixed-Point Blockset. Negation using two's complement consists of a bit inversion (translation into one's complement) followed by the addition of a one. For example, the two's complement of 000101 is 111011.

Whether a fixed-point value is signed or unsigned is usually not encoded explicitly within the binary word (i.e., there is no sign bit). Instead, the sign information is implicitly defined within the computer architecture.

Radix Point Interpretation

The radix point is the means by which fixed-point numbers are scaled. It is usually the software that determines the radix point. When performing basic math functions such as addition or subtraction, the hardware uses the same logic circuits regardless of the value of the scale factor. In essence, the logic circuits have no knowledge of a scale factor. They are performing signed or unsigned fixed-point binary algebra as if the radix point is to the right of \(b_{0}\).

Within the Fixed-Point Blockset, the main difference between fixed-point data types is the default radix point. For integers and fractionals, the radix point is fixed at the default value. For generalized fixed-point data types, you must explicitly specify the scaling by configuring dialog box parameters, or inherit the scaling from another block. The supported fixed-point data types are described below.

  • Integers - The default radix point for signed and unsigned integer data types is assumed to be just to the right of the LSB. You specify unsigned and signed integers with the uint and sint functions, respectively.
  • Fractionals - The default radix point for unsigned fractional data types is just to the left of the MSB, while for signed fractionals the radix point is just to the right of the MSB. If you specify guard bits, then they lie to the left of the radix point. You specify unsigned and signed fractional numbers with the ufrac and sfrac functions, respectively.
  • Generalized Fixed-Point Numbers - For signed and unsigned generalized fixed-point numbers, there is no default radix point. You specify unsigned and signed generalized fixed-point numbers with the ufix and sfix functions. respectively.

Scaling

The dynamic range of fixed-point numbers is much less than that of floating-point numbers with equivalent word sizes. To avoid overflow conditions and minimize quantization errors, fixed-point numbers must be scaled.

With the Fixed-Point Blockset, you can select a fixed-point data type whose scaling is defined by its default radix point, or you can select a generalized fixed-point data type and choose an arbitrary linear scaling that suits your needs. This section presents the scaling choices available for generalized fixed-point data types.

A fixed-point number can be represented by a general slope/bias encoding scheme.

\(V= \tilde V = SQ+B\)

where:

  • \(V\) is an arbitrarily precise real-world value.
  • \(\tilde V\) is the approximate real-world value.
  • \(Q\) is an integer that encodes V.
  • \(S\) = \(F \cdot 2^{E}\) is the slope.
  • \(B\) is the bias.

The slope is partitioned into two components:

  • \(2^{E}\) specifies the radix point. \(E\) is the fixed power-of-two exponent.
  • \(F\) is the fractional slope. It is normalized such that \(1 \le F \lt 2\).

\(S\) and \(B\) are constants and do not show up in the computer hardware directly - only the quantization value \(Q\) is stored in computer memory.

Radix Point-Only Scaling

As the name implies, radix point-only (or "powers-of-two") scaling involves moving only the radix point within the generalized fixed-point word. The advantage of this scaling mode is the number of processor arithmetic operations is minimized.

With radix point-only scaling, the components of the slope/bias formula have these values:

  • \(F = 1\)
  • \(S = 2^{E}\)
  • \(B = 0\)

That is, the scaling of the quantized real-world number is defined only by the slope \(S\), which is restricted to a power of two.

Radix point-only scaling is specified with the syntax \(2^{-E}\) where \(E\) is unrestricted. This creates a MATLAB structure with a bias \(B = 0\) and a fractional slope \(F = 1.0\). For example, the syntax \(2^{-10}\) defines a scaling such that the radix point is at a location 10 places to the left of the least significant bit.

Slope/Bias Scaling

When scaling by slope and bias, the slope S and bias B of the quantized real-world number can take on any value. Scaling by slope and bias is specified with the syntax [slope bias], which creates a MATLAB structure with the given slope and bias. For example, a slope/bias scaling specified by [5/9 10] defines a slope of 5/9 and a bias of 10. The slope must be a positive number.

Quantization

The quantization \(Q\) of a real-world value \(V\) is represented by a weighted sum of bits. Within the context of the general slope/bias encoding scheme, the value of an unsigned fixed-point quantity is given by:

\(\tilde{V} = S \cdot \left[ \sum_{i=0}^{w_s-1} b_i \, 2^i \right] + B\)

The value of a signed fixed-point quantity is given by:

\(\tilde{V} = S \cdot \left[ -b_{ws -1}2^{ws-1} + \sum_{i=0}^{w_s-1} b_i \, 2^i \right] + B\)

where:

  • \(b_{i}\) are binary digits, with values = 0,1.
  • \(ws\) is the word size in bits, with values = 1,2,3,...,128.
  • \(S\) is given by \(F2^{E}\), where the scaling is unrestricted since the radix point does not have to be contiguous with the word.

Range and Precision

The range of a number gives the limits of the representation while the precision gives the distance between successive numbers in the representation. The range and precision of a fixed-point number depends on the length of the word and the scaling.

Range

The range of representable numbers for an unsigned and two's complement fixed-point number of size \(ws\), scaling \(S\), and bias \(B\) is illustrated below.

For both the signed and unsigned fixed-point numbers of any data type, the number of different bit patterns is \(2^{ws}\).

For example, if the fixed-point data type is an integer with scaling defined as \(S = 1\) and \(B = 0\), then the maximum unsigned value is \(2^{ws} - 1\) since zero must be represented. In two's complement, negative numbers must be represented as well as zero so the maximum value is \(2^{ws-1} - 1\). Additionally, since there is only one representation for zero, there must be an unequal number of positive and negative numbers. This means there is a representation for \(-2^{ws-1}\) but not for \(2^{ws-1}\).

Precision

The precision (scaling) of integer and fractional data types is specified by the default radix point. For generalized fixed-point data types, the scaling must be explicitly defined as either slope/bias or radix point-only. In either case, the precision is given by the slope.