Structuring computations in batches of five or ten

Floating-point operations are pipelined in the 440 processor so that one floating-point calculation is performed per cycle, with a latency of five cycles. Therefore, to keep the 440d processor's floating-point units busy, organize floating-point computations to perform step-wise operations in batches of five; that is, arrays of five elements and loops of five iterations. For the 440d, which has two FPUs, use batches of ten.

For example, with the 440d, at high optimization with -qfloat=norngchk, the function in Figure 4 should perform ten parallel reciprocal roots in about five cycles more than a single reciprocal root. This is because the compiler will perform two reciprocal roots in parallel and then use the empty cycles to run four more parallel reciprocal roots.

Figure 4. A function to calculate reciprocal roots for arrays of ten elements
__inline void ten_reciprocal_roots (double* x, double* f)
{
#pragma disjoint (*x, *f)
    int i;
    for (i=0; i < 10; i++)
	f[i]= 1.0 / sqrt (x[i]);
}

The definition in Figure 5 shows wrapping of the inlined, optimized ten_reciprocal_roots function inside a function that allows you to pass in arrays of any number of elements. This function then passes the values in batches of ten to the ten_reciprocal_roots function, and calculates the remaining operations individually.

Figure 5. A function to pass values in batches of ten
static void unaligned_reciprocal_roots (double* x, double* f, int n)
{
#pragma disjoint (*x, *f)
    while (n >= 10) {
	ten_reciprocal_roots (x, f);
	x += 10;
	f += 10;
    }
    /* remainder */
    while (n > 0) {
	*f = 1.0 / sqrt (*x);
	f++, x++;
    }
}