Experimenting with Floating-Point Values

Representations

Java adopts IEEE-754 for its floating-point number representations. Let’s examine the following program and its output

public class FloatpintDemo {
    public static void main(String[] args) {
        float f;
        
        f = 1.5f;
        String fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = -1.5f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);
    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Largest and Smallest Values

The following code prints out the most positive value, the most negative value, and the smallest positive value.

public class FloatpintDemo {
    public static void main(String[] args) {
        float f;

        f = Float.MAX_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);
        
        f = - Float.MAX_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);


        f = Float.MIN_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);

    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Special Values

With floating-point representation, we can represent special values:

public class FloatpintSpecialDemo {
    public static void main(String[] args) {
        float f;

        f = +0.0f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = -0.0f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.NEGATIVE_INFINITY;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.POSITIVE_INFINITY;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.NaN;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Pitfalls and Errors

To avoid common pitfalls and errors using floating-point data types in our programs, let’s examine several examples:

  1. How about the output of the following code snippet?

     double big = 1.0;
     double small = 0.1;
     double result = big - small;
     System.out.printf("The result %f - %f is %.20f\n", big, small, result);
    
     result = big - small - small - small - small - small;
     System.out.printf("The result %f - %f - %f - %f - %f - %f is %.20f\n",
         big, small, small, small, small, small, result);
    
  2. Given that we want to print the table of interests to be paid for a $1,000 loan at different interest rates as follows:

    Interest Rate Interest to be Paid ($)
    0.01%
    0.02%
    0.03%
    100.00%

    someone provides a solution as follows:

     public class IncorrectInterestTable {
         public static void main(String[] args) {
             double principle = 10_000;
             System.out.printf("%15s    %23s\n", "Interest Rate", "Interest to be Paid ($)");
             for (double r = 0.01; r <= 1.0; r += 0.01) {
                 double interest = principle * r;
                 System.out.printf("%15.2f    %23.2f\n", r, interest);
             }
         }
     }
    

    Is the solution correct, and why? If it is not correct, can you provide a correct solution?

  3. Consider that we want to compute the sum of the series 1, 0.99, 0.98, 0.97, … 0.03, 0.02, 0.01, in total, 99 floating-point values with decrement of 0.01. Can you order the following four solutions (Solutions A - D) based on their accuracy (from the one with the largest error to the one with the smallest error), and explain your reasoning?

    1. Solution A
        public class SumOfFloatSeries1 {
            public static void main(String[] args) {
                    float sum, value;
                          
                    sum = 0.0f;
                    value = 1.0f;
                    for (int i=0; i<100; i++) {
                        sum += value;
                        value -= 0.01f; 
                    }
                    System.out.printf("The sum is %.20f\n",  sum);
                }
            }
      
    2. Solution B
        public class SumOfFloatSeries1 {
            public static void main(String[] args) {
                    float sum, value;
                          
                    sum = 0.0f;
                    value = 0.01f;
                    for (int i=0; i<100; i++) {
                        sum += value;
                        value += 0.01f;
                    }
                    System.out.printf("The sum is %.20f\n",  sum);
                }
            }
      
    3. Solution C
        public class SumOfFloatSeries1 {
            public static void main(String[] args) {
                    float sum, value;                   
                    int intValue;
      
                    sum = 0.0f;
                    intValue = 1;
                    for (int i=0; i<100; i++) {
                        value = intValue / 100.f;
                        sum += value;
                        intValue ++;
                    }
                    System.out.printf("The sum is %.20f\n",  sum);
                }
            }
      
    4. Solution D
        public class SumOfFloatSeries1 {
            public static void main(String[] args) {
                    float sum;
                    int intValue, intSum;
      
                    intSum = 0;
                    intValue = 1;
                    for (int i=0; i<100; i++) {
                        intSum += intValue;
                        intValue ++;
                    }
                    sum = intSum / 100.f;
                    System.out.printf("The sum is %.20f\n",  sum);
                }
            }