Experimenting with Floating-Point Values
Experimenting with Floating-Point Values
Representations
Java adopts IEEE-754 for its floating-point number representations. Let’s examine the following program and its output
public class FloatpintDemo {
public static void main(String[] args) {
float f;
f = 1.5f;
String fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
f = -1.5f;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
}
public static String getBinaryString(float f) {
StringBuilder sb = new StringBuilder();
int b = Float.floatToIntBits(f);
int bitMask = 0x1;
for (int i=0; i<Float.SIZE; i++) {
int bit = bitMask & b;
if (i != 0 && i % 4 == 0) {
sb.insert(0, ' ');
}
sb.insert(0, Character.forDigit(bit, 10));
b = b >> 1;
}
return sb.toString();
}
}
Largest and Smallest Values
The following code prints out the most positive value, the most negative value, and the smallest positive value.
public class FloatpintDemo {
public static void main(String[] args) {
float f;
f = Float.MAX_VALUE;
fStr = getBinaryString(f);
System.out.printf("%16e_10 = %s_2\n", f, fStr);
f = - Float.MAX_VALUE;
fStr = getBinaryString(f);
System.out.printf("%16e_10 = %s_2\n", f, fStr);
f = Float.MIN_VALUE;
fStr = getBinaryString(f);
System.out.printf("%16e_10 = %s_2\n", f, fStr);
}
public static String getBinaryString(float f) {
StringBuilder sb = new StringBuilder();
int b = Float.floatToIntBits(f);
int bitMask = 0x1;
for (int i=0; i<Float.SIZE; i++) {
int bit = bitMask & b;
if (i != 0 && i % 4 == 0) {
sb.insert(0, ' ');
}
sb.insert(0, Character.forDigit(bit, 10));
b = b >> 1;
}
return sb.toString();
}
}
Special Values
With floating-point representation, we can represent special values:
public class FloatpintSpecialDemo {
public static void main(String[] args) {
float f;
f = +0.0f;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
f = -0.0f;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
f = Float.NEGATIVE_INFINITY;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
f = Float.POSITIVE_INFINITY;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
f = Float.NaN;
fStr = getBinaryString(f);
System.out.printf("%10f_10 = %s_2\n", f, fStr);
}
public static String getBinaryString(float f) {
StringBuilder sb = new StringBuilder();
int b = Float.floatToIntBits(f);
int bitMask = 0x1;
for (int i=0; i<Float.SIZE; i++) {
int bit = bitMask & b;
if (i != 0 && i % 4 == 0) {
sb.insert(0, ' ');
}
sb.insert(0, Character.forDigit(bit, 10));
b = b >> 1;
}
return sb.toString();
}
}
Pitfalls and Errors
To avoid common pitfalls and errors using floating-point data types in our programs, let’s examine several examples:
-
How about the output of the following code snippet?
double big = 1.0; double small = 0.1; double result = big - small; System.out.printf("The result %f - %f is %.20f\n", big, small, result); result = big - small - small - small - small - small; System.out.printf("The result %f - %f - %f - %f - %f - %f is %.20f\n", big, small, small, small, small, small, result);
-
Given that we want to print the table of interests to be paid for a $1,000 loan at different interest rates as follows:
Interest Rate Interest to be Paid ($) 0.01% … 0.02% … 0.03% … … … 100.00% … someone provides a solution as follows:
public class IncorrectInterestTable { public static void main(String[] args) { double principle = 10_000; System.out.printf("%15s %23s\n", "Interest Rate", "Interest to be Paid ($)"); for (double r = 0.01; r <= 1.0; r += 0.01) { double interest = principle * r; System.out.printf("%15.2f %23.2f\n", r, interest); } } }
Is the solution correct, and why? If it is not correct, can you provide a correct solution?
-
Consider that we want to compute the sum of the series 1, 0.99, 0.98, 0.97, … 0.03, 0.02, 0.01, in total, 99 floating-point values with decrement of 0.01. Can you order the following four solutions (Solutions A - D) based on their accuracy (from the one with the largest error to the one with the smallest error), and explain your reasoning?
- Solution A
public class SumOfFloatSeries1 { public static void main(String[] args) { float sum, value; sum = 0.0f; value = 1.0f; for (int i=0; i<100; i++) { sum += value; value -= 0.01f; } System.out.printf("The sum is %.20f\n", sum); } }
- Solution B
public class SumOfFloatSeries1 { public static void main(String[] args) { float sum, value; sum = 0.0f; value = 0.01f; for (int i=0; i<100; i++) { sum += value; value += 0.01f; } System.out.printf("The sum is %.20f\n", sum); } }
- Solution C
public class SumOfFloatSeries1 { public static void main(String[] args) { float sum, value; int intValue; sum = 0.0f; intValue = 1; for (int i=0; i<100; i++) { value = intValue / 100.f; sum += value; intValue ++; } System.out.printf("The sum is %.20f\n", sum); } }
- Solution D
public class SumOfFloatSeries1 { public static void main(String[] args) { float sum; int intValue, intSum; intSum = 0; intValue = 1; for (int i=0; i<100; i++) { intSum += intValue; intValue ++; } sum = intSum / 100.f; System.out.printf("The sum is %.20f\n", sum); } }
- Solution A