The Blue Gene/L architecture allows for two double-precision values to be loaded in parallel in a single cycle, provided that the load address is aligned such that the values loaded do not cross a cache-line boundary (which is 32-bytes). If they cross this boundary, the hardware generates an alignment trap. This trap may cause the program to crash or result in a severe performance penalty to be fixed at run time by the kernel.
The compiler does not generate these parallel load and store instructions unless it is sure that it is safe to do so. For non-pointer local and global variables, the compiler knows when this is safe. To allow the compiler to generate these parallel loads and stores for accesses through pointers, include code that tests for correct alignment and gives the compiler hints.
To test for alignment, first create one version of a function which asserts the alignment of an input variable at that point in the program flow. You can use the C/C++ __alignx builtin function or the Fortran ALIGNX function to inform the compiler that the incoming data is correctly aligned according to a specific byte boundary, so it can efficiently generate loads and stores.
The function takes two arguments. The first argument is an integer constant expressing the number of alignment bytes (must be a positive power of two). The second argument is the variable name, typically a pointer to a memory address.
The C/C++ prototype for the function is:
extern #ifdef __cplusplus "builtin" #endif void __alignx (int n, const void *addr)
Here n is the number of bytes. For example, __align(16, y) specifies that the address y is 16-byte aligned.
In Fortran, the built-in subroutine is ALIGNX(K,M) , where K is of type INTEGER(4), and M is a variable of any type. When M is an integer pointer, the argument refers to the address of the pointee.
Figure 6 asserts that the variables x and f are aligned along 16-byte boundaries.
#include <math.h> __inline void aligned_ten_reciprocal_roots (double* x, double* f) { #pragma disjoint (*x, *f) int i; __alignx (16, x); __alignx (16, f); for (i=0; i < 10; i++) f[i]= 1.0 / sqrt (x[i]); }
After you create a function to handle input variables that are correctly aligned, you can then create a function that tests for alignment and then calls the appropriate function to perform the calculations. The function inFigure 7 checks to see whether the incoming values are correctly aligned. Then it calls the aligned (Example 1-6) or unaligned (Example 1-4) version of the function according to the result.
void reciprocal_roots (double *x, double *f, int n) { /* are both x & f 16 byte aligned? */ if ( ((((int) x) | ((int) f)) & 0xf) == 0) /* This could also be done as: if (((int) x % 16 == 0) && ((int) f % 16) == 0) */ aligned_ten_reciprocal_roots (x, f, n); else ten_reciprocal_roots (x, f, n); }
The alignment test inFigure 7 provides an optimized method of testing for 16-byte alignment by performing a bit-wise OR on the two incoming addresses and testing whether the lowest four bits are 0 (that is, 16-byte aligned).