You can use prefetching to instruct the compiler to load specific data from main memory into the cache before the data is referenced. Some prefetching can be done automatically by hardware that is POWER3(TM) and above, but since compiler-assisted software prefetching can use information directly from your source code, specifying the directive can significantly reduce the number of cache misses.
XL Fortran provides the following directives for compiler-assisted software prefetching:
The PREFETCH directive can take the following forms:
To use the PREFETCH_BY_STREAM_BACKWARD, PREFETCH_BY_STREAM_FORWARD, PREFETCH_FOR_LOAD and PREFETCH_FOR_STORE directives, you must compile for PowerPC(R) hardware.
When you prefetch a variable, the memory block that includes the variable address is loaded into the cache. A memory block is equal to the size of a cache line. Since the variable you are loading into the cache may appear anywhere within the memory block, you may not be able to prefetch all the elements of an array.
These directives may appear anywhere in your source code where executable constructs may appear.
These directives can add run-time overhead to your program. Therefore you should use the directives only where necessary.
To maximize the effectiveness of the prefetch directives, it is recommended that you specify the LIGHT_SYNC directive after a single prefetch or at the end of a series of prefetches.
Example 1: This example shows valid uses of the PREFETCH_BY_LOAD, PREFETCH_FOR_LOAD, and PREFETCH_FOR_STORE directives.
For this example, assume that the size of the cache line is 64 bytes and that none of the declared data items exist in the cache at the beginning of the program. The rationale for using the directives is as follows:
PROGRAM GOODPREFETCH
REAL*4 A, B, C, TEMP
REAL*4 ARRA(2**5), ARRB(2**10), ARRC(2**5)
INTEGER(4) I, K
! Bring ARRA into cache for writing.
!IBM* PREFETCH_FOR_STORE (ARRA(1), ARRA(2**4+1))
! Bring ARRC into cache for reading.
!IBM* PREFETCH_FOR_LOAD (ARRC(1), ARRC(2**4+1))
! Bring all variables into the cache.
!IBM* PREFETCH_BY_LOAD (A, B, C, TEMP, I , K)
! A subroutine is called to allow clock cycles to pass so that the
! data is loaded into the cache before the data is referenced.
CALL FOO()
K = 32
DO I = 1, 2 ** 5
! Bring ARRB(I*K) into the cache
!IBM* PREFETCH_BY_LOAD (ARRB(I*K))
A = -I
B = I + 1
C = I + 2
TEMP = SQRT(B*B - 4*A*C)
ARRA(I) = ARRC(I) + (-B + TEMP) / (2*A)
ARRB(I*K) = (-B - TEMP) / (2*A)
END DO
END PROGRAM GOODPREFETCH
Example 2: In this example, assume that the total cache line's size is 256 bytes, and that none of the declared data items are initially stored in the cache or register. All elements of array ARRA and ARRC will then be read into the cache.
PROGRAM PREFETCH_STREAM REAL*4 A, B, C, TEMP REAL*4 ARRA(2**5), ARRC(2**5), ARRB(2**10) INTEGER*4 I, K ! All elements of ARRA and ARRC are read into the cache. !IBM* PREFETCH_BY_STREAM_FORWARD(ARRA(1)) ! You can substitute PREFETCH_BY_STREAM_BACKWARD (ARRC(2**5)) to read all ! elements of ARRA and ARRC into the cache. K = 32 DO I = 1, 2**5 A = -i B = i + 1 C = i + 2 TEMP = SQRT(B*B -4*A*C) ARRA(I) = ARRC(I) + (-B + TEMP) / (2*A) ARRB(I*K) = (-B -TEMP) / (2*A) END DO END PROGRAM PREFETCH_STREAM
For information on applying prefetch techniques to loops with a large iteration count, see the STREAM_UNROLL directive.