The Blue Gene/L 440d processor's dual FPU includes special instructions for parallel computations. The compiler tries to pair adjacent single-precision or double-precision floating point values, to operate on them in parallel. Therefore, you can speed up computations by defining data objects that occupy adjacent memory blocks and are naturally aligned. These include arrays or structures of floating-point values and complex data types.
Whether you use an array, a structure, or a complex scalar, the compiler searches for sequential pairs of data for which it can generate parallel instructions. For example, the C code inFigure 1 allows each pair of elements in a structure to be operated on in parallel.
struct quad { double a, b, c, d; }; struct quad x, y, z; void foo() { z.a = x.a + y.a; z.b = x.b + y.b; /* can load parallel (x.a,x.b), and (y.a, y.b), do parallel add, and store parallel (z.a, z.b) */ z.c = x.c + y.c; z.d = x.d + y.d; /* can load parallel (x.c,x.d), and (y.c, y.d), do parallel add, and store parallel (z.c, z.d) */ }
The advantage of using complex types in arithmetic operations is that the compiler automatically uses parallel add, subtract, and multiply instructions when complex types appear as operands to addition, subtraction, and multiplication operators. Furthermore, the data that you provide does not actually need to represent complex numbers. In fact, both elements are represented internally as two real values. See Complex type manipulation functions for a description of the set of built-in functions available for Blue Gene/L. These functions are designed to efficiently manipulate complex-type data and include a function to convert non-complex data to complex types.