Setting OMP and SMP run time options

The following sections detail setting the necessary environment variables for writing parallel code. Please see:

The XLSMPOPTS environment variable

The XLSMPOPTS environment variable allows you to specify options that affect SMP execution. You can declare XLSMPOPTS by using the following bash command format:

Read syntax diagramSkip visual syntax diagram                      .-:------------------------------------------.
                      V                                            |
>>-XLSMPOPTS=--+---+----runtime_option_name--=----option_setting---+--+---+-><
               '-"-'                                                  '-"-'
 

You can specify option names and settings in uppercase or lowercase. You can add blanks before and after the colons and equal signs to improve readability. However, if the XLSMPOPTS option string contains imbedded blanks, you must enclose the entire option string in double quotation marks (").

You can specify the following run-time options with the XLSMPOPTS environment variable:

schedule
Selects the scheduling type and chunk size to be used as the default at run time. The scheduling type that you specify will only be used for loops that were not already marked with a scheduling type at compilation time.

Work is assigned to threads in a different manner, depending on the scheduling type and chunk size used. A brief description of the scheduling types and their influence on how work is assigned follows:

dynamic or guided
The run-time library dynamically schedules parallel work for threads on a "first-come, first-do" basis. "Chunks" of the remaining work are assigned to available threads until all work has been assigned. Work is not assigned to threads that are asleep.
static
Chunks of work are assigned to the threads in a "round-robin" fashion. Work is assigned to all threads, both active and asleep. The system must activate sleeping threads in order for them to complete their assigned work.
affinity
The run-time library performs an initial division of the iterations into number_of_threads partitions. The number of iterations that these partitions contain is:
   CEILING(number_of_iterations / number_of_threads)
These partitions are then assigned to each of the threads. It is these partitions that are then subdivided into chunks of iterations. If a thread is asleep, the threads that are active will complete their assigned partition of work.

Choosing chunking granularity is a tradeoff between overhead and load balancing. The syntax for this option is schedule=suboption, where the suboptions are defined as follows:

affinity[=n]
As described previously, the iterations of a loop are initially divided into partitions, which are then preassigned to the threads. Each of these partitions is then further subdivided into chunks that contain n iterations. If you have not specified n, a chunk consists of CEILING(number_of_iterations_left_in_local_partition / 2) loop iterations.

When a thread becomes available, it takes the next chunk from its preassigned partition. If there are no more chunks in that partition, the thread takes the next available chunk from a partition preassigned to another thread.

dynamic[=n]
The iterations of a loop are divided into chunks that contain n iterations each. If you have not specified n, a chunk consists of CEILING(number_of_iterations / number_of_threads) iterations.
guided[=n]
The iterations of a loop are divided into progressively smaller chunks until a minimum chunk size of n loop iterations is reached. If you have not specified n, the default value for n is 1 iteration.

The first chunk contains CEILING(number_of_iterations / number_of_threads) iterations. Subsequent chunks consist of CEILING(number_of_iterations_left / number_of_threads) iterations.

static[=n]
The iterations of a loop are divided into chunks that contain n iterations. Threads are assigned chunks in a "round-robin" fashion. This is known as block cyclic scheduling. If the value of n is 1, the scheduling type is specifically referred to as cyclic scheduling.

If you have not specified n, the chunks will contain CEILING(number_of_iterations / number_of_threads) iterations. Each thread is assigned one of these chunks. This is known as block scheduling.

If you have not specified schedule, the default is set to schedule=static, resulting in block scheduling.

Related Information:
For more information, see the description of the SCHEDULE directive in the XL Fortran Advanced Edition V10.1 for Linux Language Reference.
Parallel execution options
The three parallel execution options, parthds, usrthds, and stack, are as follows:
parthds=num
Specifies the number of threads (num) to be used for parallel execution of code that you compiled with the -qsmp option. By default, this is equal to the number of online processors. There are some applications that cannot use more than some maximum number of processors. There are also some applications that can achieve performance gains if they use more threads than there are processors.

This option allows you full control over the number of execution threads. The default value for num is 1 if you did not specify -qsmp. Otherwise, it is the number of online processors on the machine. For more information, see the NUM_PARTHDS intrinsic function.

usrthds=num
Specifies the maximum number of threads (num) that you expect your code will explicitly create if the code does explicit thread creation. The default value for num is 0. For more information, see the NUM_PARTHDS intrinsic function in the XL Fortran Advanced Edition V10.1 for Linux Language Reference.
stack=num
Specifies the largest amount of space in bytes (num) that a thread's stack will need. The default value for num is 4194304.

Set stack=num so it is within the acceptable upper limit. num can be up to 256 MB for 32-bit mode, or up to the limit imposed by system resources for 64-bit mode. An application that exceeds the upper limit may cause a segmentation fault.

startproc=cpu_id
Enables thread binding and specifies the CPU ID to which the first thread binds. If the value provided is outside the range of available processors, the SMP run time issues a warning message and no threads are bound.
stride=num
Specifies the increment used to determine the CPU ID to which subsequent threads bind. num must be greater than or equal to 1. If the value provided causes a thread to bind to a CPU outside the range of available processors, a warning message is issued and no threads are bound.
Performance tuning options
When a thread completes its work and there is no new work to do, it can go into either a "busy-wait" state or a "sleep" state. In "busy-wait", the thread keeps executing in a tight loop looking for additional new work. This state is highly responsive but harms the overall utilization of the system. When a thread sleeps, it completely suspends execution until another thread signals it that there is work to do. This state provides better utilization of the system but introduces extra overhead for the application.

The xlsmp run-time library routines use both "busy-wait" and "sleep" states in their approach to waiting for work. You can control these states with the spins, yields, and delays options.

During the busy-wait search for work, the thread repeatedly scans the work queue up to num times, where num is the value that you specified for the option spins. If a thread cannot find work during a given scan, it intentionally wastes cycles in a delay loop that executes num times, where num is the value that you specified for the option delays. This delay loop consists of a single meaningless iteration. The length of actual time this takes will vary among processors. If the value spins is exceeded and the thread still cannot find work, the thread will yield the current time slice (time allocated by the processor to that thread) to the other threads. The thread will yield its time slice up to num times, where num is the number that you specified for the option yields. If this value num is exceeded, the thread will go to sleep.

In summary, the ordered approach to looking for work consists of the following steps:

  1. Scan the work queue for up to spins number of times. If no work is found in a scan, then loop delays number of times before starting a new scan.
  2. If work has not been found, then yield the current time slice.
  3. Repeat the above steps up to yields number of times.
  4. If work has still not been found, then go to sleep.

The syntax for specifying these options is as follows:

spins[=num]
where num is the number of spins before a yield. The default value for spins is 100.
yields[=num]
where num is the number of yields before a sleep. The default value for yields is 10.
delays[=num]
where num is the number of delays while busy-waiting. The default value for delays is 500.

Zero is a special value for spins and yields, as it can be used to force complete busy-waiting. Normally, in a benchmark test on a dedicated system, you would set both options to zero. However, you can set them individually to achieve other effects.

For instance, on a dedicated 8-way SMP, setting these options to the following:

parthds=8 : schedule=dynamic=10 : spins=0 : yields=0

results in one thread per CPU, with each thread assigned chunks consisting of 10 iterations each, with busy-waiting when there is no immediate work to do.

Options to enable and control dynamic profiling
You can use dynamic profiling to reevaluate the compiler's decision to parallelize loops in a program. The three options you can use to do this are: parthreshold, seqthreshold, and profilefreq.
parthreshold=num
Specifies the time, in milliseconds, below which each loop must execute serially. If you set parthreshold to 0, every loop that has been parallelized by the compiler will execute in parallel. The default setting is 0.2 milliseconds, meaning that if a loop requires fewer than 0.2 milliseconds to execute in parallel, it should be serialized.

Typically, parthreshold is set to be equal to the parallelization overhead. If the computation in a parallelized loop is very small and the time taken to execute these loops is spent primarily in the setting up of parallelization, these loops should be executed sequentially for better performance.

seqthreshold=num
Specifies the time, in milliseconds, beyond which a loop that was previously serialized by the dynamic profiler should revert to being a parallel loop. The default setting is 5 milliseconds, meaning that if a loop requires more than 5 milliseconds to execute serially, it should be parallelized.

seqthreshold acts as the reverse of parthreshold.

profilefreq=num
Specifies the frequency with which a loop should be revisited by the dynamic profiler to determine its appropriateness for parallel or serial execution. Loops in a program can be data dependent. The loop that was chosen to execute serially with a pass of dynamic profiling may benefit from parallelization in subsequent executions of the loop, due to different data input. Therefore, you need to examine these loops periodically to reevaluate the decision to serialize a parallel loop at run time.

The allowed values for this option are the numbers from 0 to 32. If you set profilefreq to one of these values, the following results will occur.

  • If profilefreq is 0, all profiling is turned off, regardless of other settings. The overheads that occur because of profiling will not be present.
  • If profilefreq is 1, loops parallelized automatically by the compiler will be monitored every time they are executed.
  • If profilefreq is 2, loops parallelized automatically by the compiler will be monitored every other time they are executed.
  • If profilefreq is greater than or equal to 2 but less than or equal to 32, each loop will be monitored once every nth time it is executed.
  • If profilefreq is greater than 32, then 32 is assumed.

It is important to note that dynamic profiling is not applicable to user-specified parallel loops (for example, loops for which you specified the PARALLEL DO directive).

OpenMP environment variables

The following environment variables, which are included in the OpenMP standard, allow you to control the execution of parallel code.

Note:
If you specify both the XLSMPOPTS environment variable and an OpenMP environment variable, the OpenMP environment variable takes precedence.

OMP_DYNAMIC environment variable

The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for the execution of parallel regions. The syntax is as follows:

Read syntax diagramSkip visual syntax diagram>>-OMP_DYNAMIC=--+-TRUE--+-------------------------------------><
                 '-FALSE-'
 

If you set this environment variable to TRUE, the run-time environment can adjust the number of threads it uses for executing parallel regions so that it makes the most efficient use of system resources. If you set this environment variable to FALSE, dynamic adjustment is disabled.

The default value for OMP_DYNAMIC is FALSE. If your code needs to use a specific number of threads to run correctly, you should disable dynamic thread adjustment.

The omp_set_dynamic subroutine takes precedence over the OMP_DYNAMIC environment variable.

OMP_NESTED environment variable

The OMP_NESTED environment variable enables or disables nested parallelism. The syntax is as follows:

Read syntax diagramSkip visual syntax diagram>>-OMP_NESTED=--+-TRUE--+--------------------------------------><
                '-FALSE-'
 

If you set this environment variable to TRUE, nested parallelism is enabled. This means that the run-time environment might deploy extra threads to form the team of threads for the nested parallel region. If you set this environment variable to FALSE, nested parallelism is disabled.

The default value for OMP_NESTED is FALSE.

The omp_set_nested subroutine takes precedence over the OMP_NESTED environment variable.

Currently, XL Fortran does not support OpenMP nested parallelism.

OMP_NUM_THREADS environment variable

The OMP_NUM_THREADS environment variable sets the number of threads that a program will use when it runs. The syntax is as follows:

Read syntax diagramSkip visual syntax diagram>>-OMP_NUM_THREADS=--num---------------------------------------><
 

num
the maximum number of threads that can be used if dynamic adjustment of the number of threads is enabled. If dynamic adjustment of the number of threads is not enabled, the value of OMP_NUM_THREADS is the exact number of threads that can be used. It must be a positive, scalar integer.

The default number of threads that a program uses when it runs is the number of online processors on the machine.

If you specify the number of threads with both the PARTHDS suboption of the XLSMPOPTS environment variable and the OMP_NUM_THREADS environment variable, the OMP_NUM_THREADS environment variable takes precedence. The omp_set_num_threads subroutine takes precedence over the OMP_NUM_THREADS environment variable.

If the number of threads you request exceeds the number your execution environment can support, your application will terminate.

The following example shows how you can set the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=16

OMP_SCHEDULE environment variable

The OMP_SCHEDULE environment variable applies to PARALLEL DO and work-sharing DO directives that have a schedule type of RUNTIME. The syntax is as follows:

Read syntax diagramSkip visual syntax diagram>>-OMP_SCHEDULE=--sched_type--+---------------+----------------><
                              '-,--chunk_size-'
 

sched_type
is either DYNAMIC, GUIDED, or STATIC.
chunk_size
is a positive, scalar integer that represents the chunk size.

This environment variable is ignored for PARALLEL DO and work-sharing DO directives that have a schedule type other than RUNTIME.

If you have not specified a schedule type either at compile time (through a directive) or at run time (through the OMP_SCHEDULE environment variable or the SCHEDULE option of the XLSMPOPTS environment variable), the default schedule type is STATIC, and the default chunk size is set to the following for the first N - 1 threads:

chunk_size = ceiling(Iters/N)

It is set to the following for the Nth thread, where N is the total number of threads and Iters is the total number of iterations in the DO loop:

chunk_size = Iters - ((N - 1) * ceiling(Iters/N))

If you specify both the SCHEDULE option of the XLSMPOPTS environment variable and the OMP_SCHEDULE environment variable, the OMP_SCHEDULE environment variable takes precedence.

The following examples show how you can set the OMP_SCHEDULE environment variable:

export OMP_SCHEDULE="GUIDED,4"
export OMP_SCHEDULE="DYNAMIC"