# Scala - IEEE754 floating point standard and Float / Double conversion

## I introduction

The last article introduced Conversion between binary and decimal numbers , this paper introduces the widely used floating-point number standard IEEE754.

## II IEEE754 introduction

### 1. Overall introduction IEEE754 represents the binary floating-point arithmetic standard. Generally, the commonly used ones are single precision 32-bit and Double precision 64 bit, and the less commonly used extended single precision 43 bit and extended Double precision 79 bit. The commonly used Float and Double in Scala adopt the single precision 32-bit and Double precision 64 bit standards of IEEE754 respectively. It contains three values: Sign + Exponent + Fraction:

SIgn: SIgn bit. 0 represents positive and 1 represents negative. In many cases, positive numbers omit the SIgn bit 0 of the first bit

Exponent: order code or order, which represents digit

Fraction: fractional value. The corresponding M is the mantissa and represents the significant number of floating-point number

### 2. Formula

For 32-bit single precision floating-point numbers, IEEE754 is expressed as: The precision of IEEE 64 single bit floating point number is: SIgn: the first bit represents the SIgn bit, i.e. positive and negative numbers

M: Where m ∈ [1,2) is written in the form of 1.xxxx. Since the first bit is always 1 when binary numbers are saved, only XXXX needs to be reserved here. Therefore, the form of 1+M is adopted here, which can save 1-bit storage location

Exponent: in the case of 32 bits, the value range of order code E is 8 bits, corresponding to 2-9 bits of 32 bits; In the case of 64 bits, the value range of order code E is 11 bits, corresponding to 2-12 bits in 64 bits. Take the single precision floating-point number as an example. Its exponential field is 8 bits, and its fixed offset value is 2 ^ {8-1} - 1 = 127. IEEE754 stipulates that the order code needs to be added with the corresponding offset, so the final expression of the formula appears: E-127. Similarly, 64 bits need to increase the offset 1023.

The explanation here doesn't seem easy to understand. It can be easily solved through examples later~

### 3. Expression form

For the above binary expression of IEEE754, the numbers expressed are mainly divided into three types:

A. Protocol form

When the binary values of order code E are not all 0 or 1, the represented value is a normalized value or a floating-point number in conventional form.

B. Non conventional form

When the binary of the order code e is all 0, the value represented is a non normalized value or a floating-point number in non conventional form. At this time, the exponent of the floating-point number E=1-127 / E=1-1023, and the significant number M is no longer added with the first 1, but restored to 0 In the form of XXXX, which represents ± 0 or a small number very close to 0.

C. Special form

± Infinity: when the order code E is all 1 and the significant digits M are all 0, it indicates positive and negative infinity according to the size of S

NaN: when E is all 1, if the significant digits M are not all 0, it means that the floating-point number is not a number, that is, NaN

## III Convert Double to IEEE754

To convert a decimal floating-point number into a floating-point standard number corresponding to IEEE754, first convert the decimal floating-point number into a conventional binary expression, and then convert it into a 32-bit and 64 bit form of IEEE754 according to the value form corresponding to IEEE754 and in the order of S+E+M.

### 1. Convert single precision Float # to IEEE754 (manual version)

Given the example of the previous article, Float = 66.59375, its binary corresponding code is 1000010.10011, which is converted to the standard form of 1.00001010011 * 2 ^ {6} the following formula is applied: S: 66.59375 is a positive number, so s=0

M: 1 + M = 1.00001010011 In the form of XXXX, M=0.00001010011

E: 2^{E-127} = 2^{6} push out E-127 = 6 push out E=133133. The binary form of E = 10000101

Splice according to the form of S + E + M:

IEEE754 # 66.59375 f = 0 + 10000101 + 0000100011 + complement 0 to 32 bits

=> 1000010100001010011000000000000

### 2. Convert Double precision to IEEE754 (manual version)

Still use the above example Double = 66.59375 = 1000010.10011 = 1.00001010011 * 2 ^ {6}, and apply the formula: S: 66.59375 is a positive number, so s=0

M: 1 + M = 1.00001010011 In the form of XXXX, M=0.00001010011

E: 2^{E-1023} = 2^{6} push out E-1023 = 6 push out E=10291029 the binary form is {1000000001

Splice according to the form of S + E + M:

IEEE754 # 66.59375 d = 0 + 1000000001 + 0000100011 + complement 0 to 64 bits

=> 0100000001010000101001100000000000000000000000000000000000000000

### 3. Convert float / double to IEEE754 (code version)

The implementation of the code mainly reproduces the above manual process, but the code does not consider the situation of non protocol and special values, so only some common protocol type floating-point numbers are used. The main process is divided into three steps:

A. Judge the value of s according to the positive and negative value of num

B. Exclude the first 1 and obtain the significant number M according to the subsequent numbers

C. Through e_dec calculates the original value of the significant number e, and then increases the offset of e according to the formula of Float or Double to obtain the real E

D. According to the order of S + E + M and fill in 0 to get the final result. If it is difficult to remember, it can be recorded together with the homophony of SIM card

```  def doubleToIEEE754(num: Double, StringType: String): String = {
val binaryString = doubleToBin(num)
val s = if (num >= 0) {
0
} else {
1
}
val m = binaryString.replace(".", "").slice(1, binaryString.length - 1)
val e_dec = binaryString.split("\\.")(0).length - 1

val e = if (StringType.toUpperCase().equals("F")) {
// V = (-1)^s * (1+M) * 2^(E-127) (single precision)
(e_dec + 127).toBinaryString
} else if (StringType.toUpperCase().equals("D")) {
// V = (-1)^s * (1+M) * 2^(E-1023) (double precision)
(e_dec + 1023).toBinaryString
} else {
"NULL"
}

val IEEE754String = if (e != "NULL") {
val re = s + e + m
val length = if (StringType.equals("D")) 64 else if (StringType.equals("F")) 32 else re.length
re + repeatString("0", length - re.length)
} else {
""
}
IEEE754String
}```

Try the example above:

```    val num = 66.59375
println(doubleToIEEE754(num, "D"))
println(doubleToIEEE754(num, "F"))```
```0100000001010000101001100000000000000000000000000000000000000000
01000010100001010011000000000000```

### 4. Convert float / double to IEEE754 (official API version)

java provides an API for transforming Float and Double into IEEE754:

Float 32-bit:

num is required to be Float type

```val bitF = java.lang.Integer.toBinaryString(java.lang.Float.floatToRawIntBits(num))
```

Double 64 bit:

num is required to be of Double type

```val bitD = java.lang.Long.toBinaryString(java.lang.Double.doubleToRawLongBits(num))
```

Both manual derivation and code version can be verified with the results obtained by the official API. Here, it should be noted that the length of the results obtained by the official API when num is a positive number is 31 bits and 63 bits respectively, because the 0 representing the first positive number is automatically omitted.

## IV Convert IEEE754 to Double

The above describes the process of converting Double to IEEE754, in which binary numbers are required for intermediate transition. Similarly, the conversion of IEEE754 to Double also requires binary numbers:

A. Cut out S + E + M of IEEE754 according to the number of bits of Float / Double

B. According to the formula of Value, substitute S, E and M into the formula to obtain the corresponding binary form

C. Convert the floating-point number in binary form into decimal system to complete the conversion of double

```  def IEEE754ToDouble(binaryString: String, stringType: String): Double = {
if (stringType.toUpperCase().equals("F")) {
// V = (-1)^s * (1+M) * 2^(E-127) (single precision)
val s = binaryString.slice(0, 1)
val e = binaryString.slice(1, 9)
val m = binaryString.slice(9, binaryString.length)
var binFloat = if (e.equals("00000000")) {
m
} else {
"1" + m
}
val cut = binToInteger(e) - 127
binFloat = binFloat.slice(0, cut+1) + "." + binFloat.slice(cut+1, binFloat.length)
val floatNum = binToDouble(binFloat)
if (s.equals("0")) {
floatNum
} else {
-1 * floatNum
}
} else if (stringType.toUpperCase().equals("D")) {
// V = (-1)^s * (1+M) * 2^(E-1023) (double precision)
val s = binaryString.slice(0, 1)
val e = binaryString.slice(1, 12)
val m = binaryString.slice(12, binaryString.length)
var binDouble = if (e.equals("00000000000")) {
m
} else {
"1" + m
}
val cut = binToInteger(e) - 1023
binDouble = binDouble.slice(0, cut+1) + "." + binDouble.slice(cut+1, binDouble.length)
val doubleNum = binToDouble(binDouble)
println(binDouble, doubleNum)
if (s.equals("0")) {
doubleNum
} else {
-1 * doubleNum
}
} else {
Double.NaN
}
}```

Non conventional and special values are not considered here, and the scope of Double is not fully supported. It is just an expansion of ideas. Interested students can deepen this method~

## V verification

The following calls and validations are based on the above inter conversion methods and API s:

```    // Double precision & & single precision
println(repeatString("=", 50))
val floatBit = java.lang.Integer.toBinaryString(java.lang.Float.floatToRawIntBits(num.toFloat))
val floatBitDiy = doubleToIEEE754(num, "F")
val floatNum = IEEE754ToDouble("0" + floatBit, "F")
println(s"Num: \$num FloatNum: \$floatNum Single precision: \$floatBit length: \${floatBit.length}")
println("API:" + floatBitDiy)
println("DIY:" + "0" + floatBit)

println(repeatString("=", 50))
val doubleBit = java.lang.Long.toBinaryString(java.lang.Double.doubleToRawLongBits(num))
val doubleBitDiy = doubleToIEEE754(num, "D")
val doubleNum = IEEE754ToDouble("0" + doubleBit, "D")
println(s"Num: \$num DoubleNum: \$doubleNum Double precision: \$doubleBit, length: \${doubleBit.length}")
println("API:" + doubleBitDiy)
println("DIY:" + "0" + doubleBit)
println(repeatString("=", 50))```

The mutual conversion of some conventional specifications can still correspond to the API: The above commonly used repeatString function is:

```  def repeatString(char: String, n: Int): String = List.fill(n)(char).mkString
```

For other BinToInteger, IntegerToBin, decialtobin and BinToDecimal, please refer to the previous articles:

## Vi summary

The above has done some basic decomposition of IEEE754 floating-point standard through demonstration and code, mainly the decomposition and substitution of decimal, binary and formula. In addition, there is no in-depth discussion on non conventional values and special values, which can be further explored in the future.

Tags: Scala

Posted by AndyMoore on Mon, 02 May 2022 10:44:31 +0300