OpenMP: Compiler Directive Standard


The material here is old, and hence some of it may have been outdated. Use the Google, Luke. Or at least the links below.

This material summarizes some of the implementation details for using OpenMP. However, for B673 the most important take-home messages are:

  1. all variables are shared unless otherwise specified
  2. use guided self-scheduling whenever a scheduling choice is available

The OpenMP standard is a compiler directive driven parallel programming system. It uses a fork-join model, so relies on a (logically) shared memory system and a global data model. OpenMP is a consortium in which several vendors (SGI, IBM, DEC, etc) and the major compiler technology firms (including Portland Group) are taking part. Currently the standard has been available for Fortran for some time, and now has C implementations. Because of the complex ways that pointers and loops can be used in C, it has taken longer to implement - unlike Fortran, where pointer arithmetic is simply not supported. OpenMP replaces the ANSI X3H5 effort, which has become become outdated and has not evolved much in the recent past - particularly to handle CC-NUMA architectures, ones where the memory system is physically partitioned, but presented to the user as a logically single address space.

The home page for OpenMP is at http://www.openmp.org, and you should consult it for the C/C++ (or Fortran) version of the following notes. The language interface specifications are what you should learn, and they have nifty vade mecums for C, C++, and Fortran for quick reference. Also download the "examples" document, since it demonstrates the usage of many of the specs.

OpenMP is a standard intended for Windows, Mac, and Unix systems, and the participation of all major supercomputer and HPC compiler vendors is likely to make it a standard in effect as well as name. The notes which follow address mainly the Fortran version because it is the most commonly used language in scientific programming. Also, anyone under the age of 40 is going to write new programs in C/C++, so you'll learn that variant anyway.

Sentinels

Like all compiler directive systems, OpenMP uses a "sentinel" which looks like a comment line to indicate a directive; for OpenMP it is
!$OMP [directive]
while in C it takes the form
#pragma omp directive clauses
{ ... }
where the braces delimit the lines for which the directive applies. Because Fortran77 does not have code block delimiters, in Fortran you end a section with
!$OMP END [directive]

OpenMP also allows conditional compilation, so you can have calls to OpenMP functions invisible on systems without an OpenMP compiler. The sentinel for that in Fortran is !$. For example,

!$    myrank = OMP_GET_THREAD_NUM()
will let you find your thread number - without having to write a stub function for it on non-OpenMP systems.

Parallel Regions

All the action in OpenMP occurs in "parallel regions", which are started/ended by
            !$OMP PARALLEL [clause]
             ...
            !$OMP END PARALLEL
although C/C++ uses braces to specify the end of a parallel region. Until a parallel region is encountered, only one thread is running: the master thread. On encountering the directive, a team of threads is created. Then both the master and team members share the work in the parallel region. On encountering the end of the parallel region, an implicit barrier causes the threads to join, and only the master thread continues after that point.

Nested parallel regions are also allowed, although by default on encountering an inner parallel region a team of only one thread is created. However, there are mechanisms (OMP_SET_NESTED()) for allowing more threads to be created. In that case, "master thread" refers to the one that invoked the (nested) parallel region.

Worksharing Constructs

There are two basic ways to actually get parallel work done: parallel loops and parallel sections. The first is declared via
            !$OMP DO
            ...
            !$OMP END DO
and it applies to the next loop only. The END DO directive is actually not required, but is good form especially if you use the deprecated form of do-loops that don't end with "end do". [And if you do, stop that. You're doing it wrong.] The second has the form
            !$OMP SECTIONS
                block 1
            !$OMP SECTION
                block 2
                 ...
            !$OMP SECTION
                block n
            !$OMP END SECTIONS
and a thread is assigned in parallel to each section. There are two major restrictions on the worksharing constructs:
  1. Like a BARRIER in MPI, all or none of the threads in a team must encounter the construct.
  2. You cannot branch into or out of the construct.
Because it is often the case that you only want to parallelize a single loop or section, there are also versions that both declare the parallel region and the construct, eg.,
            !$OMP PARALLEL DO
            ...
            !$OMP END PARALLEL DO
However, keep in mind that a team of threads is created on encountering a parallel region, then synchronized and deleted on exiting it. It is typically more efficient to keep the threads around if only a few operations separate parallel regions.

Sychnronization

Because creating and destroying teams of threads may be relatively costly, there are mechanisms to allow you to execute some single-threaded sections of code without destroying the team. Other useful constructs for synchronization are ORDERED, which causes the code inside to be executed only in sequential order (but not necessarily by a single thread) and BARRIER which is an old friend from MPI. A barrier is implicit at the end of most parallel work-sharing constructs.

Loop Scheduling

Suppose a parallel loop is encountered, with several iterations. Should the system assign chunks of iterations to each thread, or assign iterations cyclically to each thread? The first will entail smaller overhead associated with dispatching and synchronizing the work. The second will have a better chance at load balancing, since the threads may enter the loop at different times, or may take widely differing amounts of time to execute even identical code because of interrupts, etc.

One of the "clauses" you can insert into a worksharing construct is a schedule. The allowed ones are:

In order of preference, use GUIDED, DYNAMIC, and then STATIC. Guided self-scheduling (GSS) is one of the most innovative ideas to emerge from the parallelizing compiler community in the late 1980's, and has now worked its way into optimizing compilers. DYNAMIC handles the problem of when one thread gets interrupted or takes much longer to finish its task - which is more common than you think.

Data Scope

By default, all data is shared in your OpenMP-ized code. However, when declaring a parallel region other scopes can be specified for the variables. This is handy for situations like the double-nested loop in HPF, where the inner loop index had to be declared before the outer loop could be parallelized. OpenMP environments include