IEEE 754标准

IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

IEEE 754-1985 represents numbers in binary, providing definitions for four levels of precision, of which the two most commonly used are:

level width range precision*
single precision 32 bits ±1.18×10−38 to ±3.4×1038 approx. 7 decimal digits
double precision 64 bits ±2.23×10−308 to ±1.80×10308 approx. 15 decimal digits
  • Precision: The number of decimal digits precision is calculated via number_of_mantissa_bits * Log10(2). Thus ~7.2 and ~15.9 for single and double precision respectively.

The standard also defines representations for positive and negative infinity, a "negative zero", five exceptions to handle invalid results like division by zero, special values called NaNs for representing those exceptions,denormal numbers to represent numbers smaller than shown above, and four rounding modes.

Contents

[hide]

[edit] Representation of numbers

The number 0.15625 represented as a single-precision IEEE 754-1985 floating-point number. See text for explanation.

The three fields in a 64bit IEEE 754 float

Floating-point numbers in IEEE 754 format consist of three fields: a sign bit, a biased exponent, and a fraction. The following example illustrates the meaning of each.

The decimal number 0.1562510 represented in binary is 0.001012 (that is, 1/8 + 1/32). (Subscripts indicate the number base.) Analogous to scientific notation, where numbers are written to have a single non-zero digit to the left of the decimal point, we rewrite this number so it has a single 1 bit to the left of the "binary point". We simply multiply by the appropriate power of 2 to compensate for shifting the bits left by three positions:

Now we can read off the fraction and the exponent: the fraction is .012 and the exponent is −3.

As illustrated in the pictures, the three fields in the IEEE 754 representation of this number are:

sign = 0, because the number is positive. (1 indicates negative.)
biased exponent = −3 + the "bias". In single precision, the bias is 127, so in this example the biased exponent is 124; in double precision, the bias is 1023, so the biased exponent in this example is 1020.
fraction = .01000…2.

IEEE 754 adds a bias to the exponent so that numbers can in many cases be compared conveniently by the same hardware that compares signed 2‘s-complement integers. Using a biased exponent, the lesser of two positive floating-point numbers will come out "less than" the greater following the same ordering as for sign and magnitude integers. If two floating-point numbers have different signs, the sign-and-magnitude comparison also works with biased exponents. However, if both biased-exponent floating-point numbers are negative, then the ordering must be reversed. If the exponent were represented as, say, a 2‘s-complement number, comparison to see which of two numbers is greater would not be as convenient.

The leading 1 bit is omitted since all numbers except zero start with a leading 1; the leading 1 is implicit and doesn‘t actually need to be stored which gives an extra bit of precision for "free."

[edit] Zero

The number zero is represented specially:

sign = 0 for positive zero, 1 for negative zero.
biased exponent = 0.
fraction = 0.

[edit] Denormalized numbers

The number representations described above are called normalized, meaning that the implicit leading binary digit is a 1. To reduce the loss of precision when an underflow occurs, IEEE 754 includes the ability to represent fractions smaller than are possible in the normalized representation, by making the implicit leading digit a 0. Such numbers are called denormal. They don‘t include as many significant digits as a normalized number, but they enable a gradual loss of precision when the result of an arithmetic operation is not exactly zero but is too close to zero to be represented by a normalized number.

A denormal number is represented with a biased exponent of all 0 bits, which represents an exponent of −126 in single precision (not −127), or −1022 in double precision (not −1023).[1] In contrast, the smallest biased exponent representing a normal number is 1 (see examples below).

[edit] Representation of non-numbers

The biased-exponent field is filled with all 1 bits to indicate either infinity or an invalid result of a computation.

[edit] Positive and negative infinity

Positive and negative infinity are represented thus:

sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.

[edit] NaN

Some operations of floating-point arithmetic are invalid, such as dividing by zero or taking the square root of a negative number. The act of reaching an invalid result is called a floating-point exception. An exceptional result is represented by a special code called a NaN, for "Not a Number". All NaNs in IEEE 754-1985 have this format:

sign = either 0 or 1.
biased exponent = all 1 bits.
fraction = anything except all 0 bits (since all 0 bits represents infinity).

[edit] Range and precision

Precision is defined as the minimum difference between two successive mantissa representations; thus it is a function only in the mantissa; while the gap is defined as the difference between two successive numbers.[2]

[edit] Single precision

Single-precision numbers occupy 32 bits. In single precision:

  • The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the exponent field and the binary value 1 in the fraction field) are

    ±2−149 ≈ ±1.4012985×10−45
  • The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the exponent field and 0 in the fraction field) are
    ±2−126 ≈ ±1.175494351×10−38
  • The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the exponent field and all 1s in the fraction field) are
    ±(1−2−24)×2128[3] ≈ ±3.4028235×1038

Some example range and gap values for given exponents in single precision:

Actual Exponent (unbiased) Exp (biased) Minimum Maximum Gap
0 127 1 1.999999880791 1.19209289551e−7
1 128 2 3.99999976158 2.38418579102e−7
2 129 4 7.99999952316 4.76837158203e−7
10 137 1024 2047.99987793 1.220703125e−4
11 138 2048 4095.99975586 2.44140625e−4
23 150 8388608 16777215 1
24 151 16777216 33554430 2
127 254 1.7014e38 3.4028e38 2.02824096037e31

As an example, 16,777,217 can not be encoded as a 32-bit float as it will be rounded to 16,777,216. This shows why floating point arithmetic is unsuitable for accounting software. However, all integers within the representable range that are a power of 2 can be stored in a 32-bit float without rounding.

[edit] Double precision

Double-precision numbers occupy 64 bits. In double precision:

  • The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

    ±2−1074 ≈ ±5×10−324
  • The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
    ±2−1022 ≈ ±2.2250738585072020×10−308
  • The finite positive and finite negative numbers furthest from zero (represented by the value with 2046 in the Exp field and all 1s in the fraction field) are
    ±((1−(1/2)53)21024)[3] ≈ ±1.7976931348623157×10308

Some example range and gap values for given exponents in double precision:

Actual Exponent (unbiased) Exp (biased) Minimum Maximum Gap
0 1023 1 1.9999999999999997 2.2204460492503130808472633e−16
1 1024 2 3.9999999999999995 8.8817841970012523233890533447266e−16
2 1025 4 7.9999999999999990 3.5527136788005009293556213378906e−15
10 1033 1024 2047.9999999999997 2.27373675443232059478759765625e−13
11 1034 2048 4095.9999999999995 4.5474735088646411895751953125e−13
52 1075 4503599627370496 9007199254740991 1
53 1076 9007199254740992 18014398509481982 2
1023 2046 8.9884656743115800e307 1.7976931348623157e308 1.9958403095347198116563727130368e292

[edit] Extended formats

The standard also recommends extended format(s) to be used to perform internal computations at a higher precision than that required for the final result, to minimise round-off errors: the standard only specifies minimum precision and exponent requirements for such formats. The x87 80-bit extended format is the most commonly implemented extended format that meets these requirements.

[edit] Examples

Here are some examples of single-precision IEEE 754 representations:

Type Sign Actual Exponent Exp (biased) Exponent field Significand (fraction field) Value
Zero 0 −127 0 0000 0000 000 0000 0000 0000 0000 0000 0.0
Negative zero 1 −127 0 0000 0000 000 0000 0000 0000 0000 0000 −0.0
One 0 0 127 0111 1111 000 0000 0000 0000 0000 0000 1.0
Minus One 1 0 127 0111 1111 000 0000 0000 0000 0000 0000 −1.0
Smallestdenormalized number * −127 0 0000 0000 000 0000 0000 0000 0000 0001 ±2−23 × 2−126 = ±2−149 ≈ ±1.4×10−45
"Middle" denormalized number * −127 0 0000 0000 100 0000 0000 0000 0000 0000 ±2−1 × 2−126 = ±2−127 ≈ ±5.88×10−39
Largest denormalized number * −127 0 0000 0000 111 1111 1111 1111 1111 1111 ±(1−2−23) × 2−126 ≈ ±1.18×10−38
Smallest normalized number * −126 1 0000 0001 000 0000 0000 0000 0000 0000 ±2−126 ≈ ±1.18×10−38
Largest normalized number * 127 254 1111 1110 111 1111 1111 1111 1111 1111 ±(2−2−23) × 2127 ≈ ±3.4×1038
Positive infinity 0 128 255 1111 1111 000 0000 0000 0000 0000 0000 +∞
Negative infinity 1 128 255 1111 1111 000 0000 0000 0000 0000 0000 −∞
Not a number * 128 255 1111 1111 non zero NaN
* Sign bit can be either 0 or 1 .

[edit] Comparing floating-point numbers

Every possible bit combination is either a NaN or a number with a unique value in the affinely extended real number system with its associated order, except for the two bit combinations negative zero and positive zero, which sometimes require special attention (see below). The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers;endianness issues apply.

Floating-point arithmetic is subject to rounding that may affect the outcome of comparisons on the results of the computations.

Although negative zero and positive zero are generally considered equal for comparison purposes, someprogramming language relational operators and similar constructs might or do treat them as distinct. According to the Java Language Specification,[4] comparison and equality operators treat them as equal, but Math.min() and Math.max() distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods equals(), compareTo() and even compare() of classes Float and Double.

[edit] Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.

  • Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754-2008 this mode is called roundTiesToEven to distinguish it from another round-to-nearest mode)
  • Round toward 0 – directed rounding towards zero
  • Round toward +∞ – directed rounding towards positive infinity
  • Round toward −∞ – directed rounding towards negative infinity.

[edit] Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.[5][6][7]

[edit] Functions and predicates

[edit] Standard operations

The following functions must be provided:

  • Add, subtract, multiply, divide
  • Square root
  • Floating point remainder. This is not like a normal modulo operation, it can be negative for two positive numbers. It returns the exact value of x–(round(x/y)·y).
  • Round to nearest integer. For undirected rounding when halfway between two integers the even integer is chosen.
  • Comparison operations. Besides the more obvious results, IEEE 754 defines that −∞ = −∞, +∞ = +∞ andx ≠ NaN for any x (including NaN).

[edit] Recommended functions and predicates

  • Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). This is one of the few operations which operates on a NaN in a way resembling arithmetic. The functioncopysign is new in the C99 standard.
  • −x returns x with the sign reversed. This is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.
  • scalb(y, N)
  • logb(x)
  • finite(x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf
  • isnan(x) a predicate for "x is a NaN", equivalent to "x ≠ x"
  • x <> y which turns out to have different exception behavior than NOT(x = y).
  • unordered(x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.
  • class(x)
  • nextafter(x,y) returns the next representable value from x in the direction towards y

[edit] History

In 1976 Intel began planning to produce a floating point coprocessor. Dr John Palmer, the manager of the effort, persuaded them that they should try to develop a standard for all their floating point operations. William Kahanwas hired as a consultant; he had helped improve the accuracy of Hewlett Packard‘s calculators. Kahan initially recommended that the floating point base be decimal[8] but the hardware design of the coprocessor was too far advanced to make that change.

The work within Intel worried other vendors, who set up a standardization effort to ensure a ‘level playing field‘. Kahan attended the second IEEE 754 standards working group meeting, held in November 1977. Here, he received permission from Intel to put forward a draft proposal based on the standard arithmetic part of their design for a coprocessor. The arguments over gradual underflow lasted until 1981 when an expert commissioned by DEC to assess it sided against the dissenters.

Even before it was approved, the draft standard had been implemented by a number of manufacturers.[9][10]The Intel 8087, which was announced in 1980, was the first chip to implement the draft standard.

[edit] See also

[edit] References

  1. ^ Hennessy. Computer Organization and Design. Morgan Kaufmann. p. 270.
  2. ^ Computer Arithmetic; Hossam A. H. Fahmy, Shlomo Waser, and Michael J. Flynn;http://arith.stanford.edu/~hfahmy/webpages/arith_class/arith.pdf
  3. a b Prof. W. Kahan (PDF). Lecture Notes on the Status of IEEE 754. October 1, 1997 3:36 am. Elect. Eng. & Computer Science University of California.http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF. Retrieved 2007-04-12.
  4. ^ The Java Language Specification
  5. ^ John R. Hauser (March 1996). "Handling Floating-Point Exceptions in Numeric Programs" (PDF). ACM Transactions on Programming Languages and Systems 18 (2).http://www.jhauser.us/publications/1996_Hauser_FloatingPointExceptions.html.
  6. ^ David Stevenson (March 1981). "IEEE Task P754: A proposed standard for binary floating-point arithmetic". IEEE Computer 14 (3): 51–62.
  7. ^ William Kahan and John Palmer (1979). "On a proposed floating-point standard". SIGNUM Newsletter14 (Special): 13–21. doi:10.1145/1057520.1057522.
  8. ^ W. Kahan 2003, pers. comm. to Mike Cowlishaw and others after an IEEE 754 meeting
  9. ^ Charles Severance (20 February 1998). "An Interview with the Old Man of Floating-Point".http://www.eecs.berkeley.edu/~wkahan/ieee754status/754story.html.
  10. ^ Charles Severance. "History of IEEE Floating-Point Format". Connexions.http://cnx.org/content/m32770/latest/.

[edit] Further reading

时间: 2024-10-08 05:22:10

IEEE 754标准的相关文章

如何理解IEEE 754标准对Java中float值和double值的规定

在Java语言中,我们可以使用float和double这两种基本数据类型来表示特定的数据. 这两种数据类型,本质上是浮点数(floating-point number),浮点是一种对于实数的近似值数值表现法,由一个有效数字加上幂数来表示. 之所以使用浮点数,是因为计算机在使用二进制运算的过程中,无法将所有的十进制小数准确的换算为二进制,只能使用近似值来表示. 使用浮点数表示数值的方法很多,在Java中,和C语言一样,float和double都采用了使用最为广泛的IEEE 754标准. IEEE

IEEE 754标准浮点数

一.IEEE 754浮点数的表示 浮点数数学表示: 符号位(sign):决定该浮点数的正负 尾数(significand):二进制小数,范围在[1,2)或者[0,1)中 阶码(exponent):对浮点数加权,权重为2的E次幂 单精度浮点数:在单精度的浮点数中,符号位编码为1位二进制位,阶码编码为为8位二进制位,尾数编码为23位二进制位: 双精度浮点数:在双精度浮点数中,符号位编码为1位二进制位,阶码编码为为11位二进制位,尾数编码为52位二进制位: 二.浮点数编码知识储备 <1>浮点数阶码的

IEEE 754浮点数表示标准

二进制数的科学计数法 C++中使用的浮点数包括采用的是IEEE标准下的浮点数表示方法.我们知道在数学中可以将任何十进制的数写成以10为底的科学计数法的形式,如下 其中显而易见,因为如果a比10大或者比1小都能够再次写成10的指数的形式,如 然而要想在二进制的世界中将数字写成以10为底的科学计数法的形式,着实有点麻烦,因为你首先需要将二进制的数先化成10进制的表示方法,然后才能写成科学计数法的形式.但是如果我们稍微变通一下科学计数法的标记方法,问题就变得特别的简单了.之所以数学上使用的科学计数法选

IEEE 754 浮点数的表示精度探讨

IEEE 754 浮点数的表示精度探讨 前言 从网上看到不少程序员对浮点数精度问题有很多疑问,在论坛上发贴询问,很多热心人给予了解答,但我发现一些解答中有些许小的错误和认识不当之处.我曾经做过数值算法程序,虽然基本可用,但是被浮点数精度问题所困扰:事情过后,我花了一点时间搜集资料,并仔细研究,有些心得体会,愿意与大家分享,希望对IEEE 754标准中的二进制浮点数精度及其相关问题给予较为详尽的解释.当然,文中任何错误由本人造成,由我承担,特此声明. 1. 什么是IEEE 754标准? 目前支持二

浮点数标准IEEE 754相关材料

下面的内容不是必须掌握的,是为了满足一些好奇心强同学的需要.IEEE 754目前为C标准所支持,而且许多硬件均支持,可以说目前浮点数处理基本是IEEE 754的天下.http://zh.wikipedia.org/wiki/IEEE_754 中文材料1.4 IEEE 浮点运算标准 - 华东师范大学数学系这个比较简略,http://www.pediy.com/kssd/pediy06/pediy6610.htm 该文中有几个更具体的例子,本人未仔细核实http://people.uncw.edu/

IEEE 754 浮点数在计算机中的表示方法

IEEE二进制浮点数算术标准(IEEE 754)是20世纪80年代以来最广泛使用的浮点数运算标准,为许多CPU与浮点运算器所采用.这个标准定义了表示浮点数的格式(包括负零-0)与反常值(denormal number)),一些特殊数值(无穷(Inf)与非数值(NaN)),以及这些数值的“浮点数运算符”:它也指明了四种数值舍入规则和五种例外状况(包括例外发生的时机与处理方式). IEEE 754规定了四种表示浮点数值的方式:单精确度(32位).双精确度(64位).延伸单精确度(43比特以上,很少使

深入理解计算机系统(2.7)------二进制小数和IEEE浮点标准

整数的表示和运算我们已经讲完了,在实际应用中,整数能够解决我们大部分问题.但是某些需要精确表示的数,比如某件商品的价格,某两地之间的距离等等,我们如果用整数表示将会有很大的出入,这时候浮点数就产生了. 在 20世纪80年代以前,每个计算机厂商都设计了自己表示浮点数的规则,以及对浮点数执行运算的细节,这对于应用程序在不同机器上的移植造成了巨大的困难.而在这之后,也就是 1985年左右,IEEE 标准产生了,这是一个仔细制定的表示浮点数及其运算的标准,现在的计算机浮点数也都是采用这个标准. 浮点数不

python标准库《math》

>>> import math>>> help(math)Help on built-in module math:NAME mathDESCRIPTION This module is always available. It provides access to the mathematical functions defined by the C standard.这个模块总是可用的.它提供了对由C标准定义的数学函数.FUNCTIONS acos(x, /) Re

45个实用的JavaScript技巧、窍门和最佳实践

如你所知,JavaScript是世界上第一的编程语言,它是Web的语言,是移动混合应用(mobile hybrid apps)的语言(比如PhoneGap或者Appcelerator),是服务器端的语言(比如NodeJS或者Wakanda),并且拥有很多其他的实现.同时它也是很多新手的启蒙语言,因为它不但可以在浏览器上显示一个简单的alert信息,而且还可以用来控制一个机器人(使用nodebot,或者nodruino).掌握JavaScript并且能够写出组织规范并性能高效的代码的开发人员,已经