Experimenting with Floating-Point Values

Representations

Java adopts IEEE-754 for its floating-point number representations. Let’s examine the following program and its output

public class FloatpintDemo {
    public static void main(String[] args) {
        float f;
        
        f = 1.5f;
        String fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = -1.5f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);
    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Largest and Smallest Values

The following code prints out the most positive value, the most negative value, and the smallest positive value.

public class FloatpintDemo {
    public static void main(String[] args) {
        float f;

        f = Float.MAX_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);
        
        f = - Float.MAX_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);


        f = Float.MIN_VALUE;
        fStr = getBinaryString(f);
        System.out.printf("%16e_10 = %s_2\n", f, fStr);

    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Special Values

With floating-point representation, we can represent special values:

public class FloatpintSpecialDemo {
    public static void main(String[] args) {
        float f;

        f = +0.0f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = -0.0f;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.NEGATIVE_INFINITY;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.POSITIVE_INFINITY;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

        f = Float.NaN;
        fStr = getBinaryString(f);
        System.out.printf("%10f_10 = %s_2\n", f, fStr);

    }

    public static String getBinaryString(float f) {
        StringBuilder sb = new StringBuilder();

        int b = Float.floatToIntBits(f);

        int bitMask = 0x1;
        for (int i=0; i<Float.SIZE; i++) {
            int bit = bitMask & b;
            if (i != 0 && i % 4 == 0) {
                sb.insert(0, ' ');
            }
            sb.insert(0, Character.forDigit(bit, 10));
            b = b >> 1;
        }
        return sb.toString();
    }
}

Pitfalls and Errors

To avoid common pitfalls and errors using floating-point data types in our programs, let’s examine several examples:

How about the output of the following code snippet?

 double big = 1.0;
 double small = 0.1;
 double result = big - small;
 System.out.printf("The result %f - %f is %.20f\n", big, small, result);

 result = big - small - small - small - small - small;
 System.out.printf("The result %f - %f - %f - %f - %f - %f is %.20f\n",
     big, small, small, small, small, small, result);

Given that we want to print the table of interests to be paid for a $1,000 loan at different interest rates as follows:

Interest Rate	Interest to be Paid ($)
0.01%	…
0.02%	…
0.03%	…
…	…
100.00%	…

someone provides a solution as follows:

 public class IncorrectInterestTable {
     public static void main(String[] args) {
         double principle = 10_000;
         System.out.printf("%15s    %23s\n", "Interest Rate", "Interest to be Paid ($)");
         for (double r = 0.01; r <= 1.0; r += 0.01) {
             double interest = principle * r;
             System.out.printf("%15.2f    %23.2f\n", r, interest);
         }
     }
 }

Is the solution correct, and why? If it is not correct, can you provide a correct solution?

Consider that we want to compute the sum of the series 1, 0.99, 0.98, 0.97, … 0.03, 0.02, 0.01, in total, 99 floating-point values with decrement of 0.01. Can you order the following four solutions (Solutions A - D) based on their accuracy (from the one with the largest error to the one with the smallest error), and explain your reasoning?

Solution A

  public class SumOfFloatSeries1 {
      public static void main(String[] args) {
              float sum, value;
                    
              sum = 0.0f;
              value = 1.0f;
              for (int i=0; i<100; i++) {
                  sum += value;
                  value -= 0.01f; 
              }
              System.out.printf("The sum is %.20f\n",  sum);
          }
      }

Solution B

  public class SumOfFloatSeries1 {
      public static void main(String[] args) {
              float sum, value;
                    
              sum = 0.0f;
              value = 0.01f;
              for (int i=0; i<100; i++) {
                  sum += value;
                  value += 0.01f;
              }
              System.out.printf("The sum is %.20f\n",  sum);
          }
      }

Solution C

  public class SumOfFloatSeries1 {
      public static void main(String[] args) {
              float sum, value;                   
              int intValue;

              sum = 0.0f;
              intValue = 1;
              for (int i=0; i<100; i++) {
                  value = intValue / 100.f;
                  sum += value;
                  intValue ++;
              }
              System.out.printf("The sum is %.20f\n",  sum);
          }
      }

Solution D

  public class SumOfFloatSeries1 {
      public static void main(String[] args) {
              float sum;
              int intValue, intSum;

              intSum = 0;
              intValue = 1;
              for (int i=0; i<100; i++) {
                  intSum += intValue;
                  intValue ++;
              }
              sum = intSum / 100.f;
              System.out.printf("The sum is %.20f\n",  sum);
          }
      }