B673: Introduction to Parallel Computing

Shared Memory Parallelism and Threads

Recall that shared memory systems present to the user a (logically) single address space. That means each processor is accessing what appears to be a single memory system, rather than each addressing their own memory system. Shared memory systems

Allow incremental porting of an application to a parallel version, unlike SPMD versions which are "all or nothing" for parallel versions.
Do not require explicit communications via message-passing. However, this shifts the real burden of work from communications to synchronization.
Are not considered "scalable" in the architectural sense. However, sophisticated interconnect networks tend to be log(p) in memory access time, which is scalable for most practical purposes.

Cache-coherent nonuniform memory access (CC-NUMA) systems have a physically distributed memory, but present a single address space to a user. These systems have a (not yet fully realized) potential to provide the best of both worlds.

Although shared memory machines do not require the programmer to explicitly partition data amongst processors, achieving good performance on them still requires some mental assignment of data to processes - for cache data locality if nothing else.

Shared memory programming approaches loosely fall into three categories:

Annotation systems, in which the user inserts compiler directives specifying how parallelism is to be handled. These are coupled with some automatic parallelizing compilers, which analyze the code and typically try to run loop nests in parallel. A major recent development in this is OpenMP, an attempt to provide a uniform, cross-vendor set of compiler directives.
VHL extensions such as High-Performance Fortran (HPF) which extends Fortran 90, or High-Performance C++ (HPC++).
Thread programming.

First consider thread programming because it requires more time to master and often the other approaches are built on top of threads.

Thread Programming

MPI (Message-Passing Interface) delivers parallelism by splitting the task into processes. Each subtask lives in its own process. Thread programs allow multiple subtasks as individual control streams within a single process. This requires a way of partitioning data and resources within that single process.

Several libraries provide multithreading capabilities: Java threads, Solaris threads, Mach threads, P4 (Parmacs), NT threads. We will use Posix standard threads, otherwise called pthreads. For most of this course, "thread" can be taken to mean a pthread. Most vendors have built pthreads on top of one of their existing thread libraries - much the way MPI is built on a variety of native communication libraries. And like MPI, this provides a single API, portable between machines.

User level threads are used for all forms of concurrency. Places where they are particularly useful include

System calls which may get blocked for long periods of time. The usual suspect here is I/O operations.
Asynchronous signal handling, for such things as listen/wait tasks.
Assigning different scheduling or priority to a subtask.

Note that there is no practical way for a user to achieve the last using MPI or most other SPMD models, unless you can dynamically assign system priority to a process after it has started.

Relation of Threads and Processes

Because multiple threads can share a single process, it is important to know how process resources are allocated amongst them. The general rule is: everything possible is shared by the threads within a single process. Consider the figure below, which shows the layout in memory of a typical Unix process with a main function and two other functions inside of main(). The program code itself is typically called the "text" section of an assembly routine. Below that are any global variables. Above the text section are the two parts which can grow dynamically: the heap section which malloc() or new() use for dynamic memory managment and a stack, used for pushing down stack frames. These frames keep track of where the program is in possibly several layers of function calls. The figure has the program currently executing in function f1(), so both main() and f1() have stack frames on the stack.

In addition to the process in memory, the operating system keeps some associated data for the process:

Process identifiers, including the process ID, the user ID, etc.
Resources assigned to the process, such as open sockets and files.
Registers. These include two particularly important ones: the program counter, which keeps track of which line of code in the text segment is being executed, and a stack pointer which specifies the current stack frame.

Pthreads separates out the process data, which is accessible to and shared by all the threads) from the thread data, which is typically just what is needed to allow multiple control streams. That would be a separate stack (since the threads may be in different layers or even places in the program), and the registers. Automatic variables (ones allocated dynamically on entry) in the start function and its descendents have separate copies for each thread. All else is shared by the threads. This is shown in the next figure, which has two threads indicated in the process.

Several problems occur with this model. Among them:

What if two threads modify global variables "simultaneously"?
What if two threads malloc() data "simultaneously"?
What about process limits, such as number of open files, stack size, etc?
If a Unix signal (e.g., ctl-C to kill a runaway process) is sent to the process, who catches the signal? One particular thread, any thread, or all threads?
If one thread bombs, should the whole process bomb?
If the number of threads exceeds the number of physical processors, how is priority and scheduling established?

Every thread library has its own answers to these questions; the particular answers pthreads supply are not necessarily the best.

This way of getting parallelism in a process should be compared with just using fork() to spawn off another process. The Unix fork() function

Makes an exact copy of the process, differing only in the process ID (PID) and the value returned from the fork() call: 0 to the child process and the child's PID to the parent process.
Thereafter the two processes go their own way
Can specify shared data via a variety of mechanisms, such as "shmem" system calls.

The Unix fork() creates an explicit parent/child relationship between the two processes. Threads, however, are peers as far as the OS is concerned - only the first thread in the process has any special status (the main thread).

Getting Started

The header file pthread.h has all the definitions needed for working with threads. Pthread creation is via

int pthread_create(pthread_t *thread,
                   const pthread_attr_t *attr,
                   void* (*start_ftn) (void *),
                   void *arg);

In this,

pthread_t *thread is a pointer to a thread (a thread handle), which gets set up by the call.
pthread_attr_t is a set of attributes to assign the newly created thread. this allows specifying priority, scheduling, stacksizes, etc.
start_ftn() is pointer to the function the newly created thread will start its life executing.
arg is the one and only argument value you can provide to the start_ftn().

Pthread Return Values

Pthread routines in general return the value 0 on success and nonzero otherwise. The only two exceptions to this are pthread_self() which never fails, and pthread__getspecific() which returns NULL on failure. The latter is associated with the use of "keys", which is more than we need. You can use the normal errno.h stuff:

#include 
...
rtnval = pthread_create( ... );
if (rtnval != 0) {
   printf("Unable to create thread ... ");
   if (rtnval == EINVAL)
      printf("bad arguments to pthread_create()\n");
   else if (rtnval == EAGAIN)
      printf("not enough resources available for pthread_create()\n");
   exit(-1);
}

Depending on the system you are using, perror() may also be available. However, in general there is no guarantee that a thread will set the variable errno, or that perror() will be usable. So don't count on it; use the above code fragment style of working with the returned value. Note that the above executes exit(-1), which kills the process - and hence all threads associated with that process. In some situations you may want to just kill the thread, or wait and try again later, etc.

When a thread finishes

A thread terminates when the end of the start function is reached and a return is executed from it, or when you explicitly call pthread_exit(void *val). The second form allows you to return a value from the thread.

The invoking thread can synchronize on the completion of the created thread by calling

pthread_join(pthread_t thread, void **val);

The caller thread waits until the specified thread terminates, and the return value of the calling thread is in val. By default, threads are joinable in this fashion. You can specify instead that a thread is detached, in which case its exit state and return value is not saved. This allows the OS to reclaim all resources associated with the detached thread as soon as it completes. After a thread is created it can be changed to detached state via pthread_detach(thread). Another way is to do this using thread attributes at the thread creation time, to be covered later.

Example: Dotproduct

The first example is a simple stride-1 dotproduct, using two threads to compute different segments of the vectors. A problem is the need to have a single argument to the thread starting function, so we have to pack up the usual three arguments into a single dp_args structure.

typedef struct{int length;
               double *x;
               double *y;
} dp_args;

...
double dotprod(int n, double x[], double y[]) {

   dp_args seg; 
   pthread_t chunk[2];    /* will start only two threads */
   int retval = 0;
   double sum = 0.0;
   double *val;

   val = (double *) malloc(sizeof(double));

   seg.length = n/2;
   seg.x      = x;
   seg.y      = y;
   retval = pthread_create(&chunk[0], NULL, (void *) local_dot,
                           (void *) &seg);
   if (retval != 0) { /* error handling here */ }

   seg.length = n - n/2 + 1;
   seg.x      = &x[n/2];
   seg.y      = &y[n/2];
   retval = pthread_create(&chunk[1], NULL, (void *(*)(void *))local_dot,
                           (void *) &seg);
   if (retval != 0) { /* error handling here */ }

   pthread_join(chunk[0], (void *) (&val));
   sum += *val;
   pthread_join(chunk[1], (void *) (&val));
   sum += *val;
   return sum;
}

double local_dot(dp_args *seg) {
  int k;
  double sum = 0.0;
  double *x = seg->x;
  double *y = seg->y;
  int n = seg->length;
  for (k = 0; k < n; k++)
    sum += x[k]*y[k];
  return sum;
}

There is a major problem with the above code, which strikes at the heart of shared memory processing and is especially easy to make after dealing with distributed memory programming. Recall that everything not otherwise specified is shared. Above we have only one dp_args variable, seg. After creating the first thread, we then start reloading that variable with the arguments for the second thread. However, the variable seg is shared by both threads, so it is possible that before the first thread can get the argument values from it, the main thread has already started changing them.

Actually, the above code has several errors which are easy to make. One is the "return" value from local_dot(). Pthreads wants only a void return value from the start function. In that case, how do you get results back? Easy: since threads are shared memory, they can just write their results to a global variable - it will automatically belong to the other threads, including the main one.

This is fixed by the second version of the dotproduct, which also sets up the computation for a general number of threads. The CHUNKSZ argument gives the maximum length of a vector segment to use, and MXSEGS is the maximum number of segments. Each segment thread writes its results to a global array locsum.

#include
#include
#include

#define FALSE 0
#define TRUE 1
#define CHUNKSZ 1000
#define MXSEGS  128

double locsum[MXSEGS];

typedef struct{int segment;     /* Segment number */
               int length;      /* Segment length */
               double *x;      
               double *y; } dp_args;

void local_dot(dp_args *seg) {
  int k;
  double *x = seg->x;
  double *y = seg->y;
  double val = 0.0;
  for (k = 0; k < seg->length; k++)
    val += x[k]*y[k];
  locsum[seg->segment] = val;
};

double dotprod(int n, double x[], double y[]) {
   dp_args *seg; 
   pthread_t *chunk;
   static int chunksize = CHUNKSZ;  /* Not a great way to do this */
   int k = 0;
   int retval;
   int start = 0;
   double sum = 0.0;
   double *val;
   int odd = FALSE;
   int nsegs = n/chunksize;

   val = (double *) malloc(sizeof(double));

/* -----------------------------------------------------*/
/* Increase number of segments if chunksize does not    */
/* evenly divide n, and allocate threads and dotproduct */
/* argument structures                                  */
/* -----------------------------------------------------*/
   if (n%chunksize != 0) {nsegs++; odd = TRUE;}
   chunk = (pthread_t *) malloc(nsegs*sizeof(pthread_t));
   seg   = (dp_args *) malloc(nsegs*sizeof(dp_args));
   if (seg == NULL || chunk == NULL) {
       printf("failure to allocate chunk/seg\n");
       exit(-1);
   }

/* ------------------------------------------------------- */
/* Spawn off nsegs threads to compute chunks of dotproduct */
/* ------------------------------------------------------- */

   for (k = 0; k < nsegs; k++) {
/*   ------------------------------------------- */
/*   Load up the dp_args object for k-th segment */
/*   ------------------------------------------- */
     start = k*chunksize;
     seg[k].length = chunksize;
     if (odd == TRUE && k == nsegs-1) {
        seg[k].length = n - (nsegs-1)*chunksize;
     }
     seg[k].x  = &x[start];
     seg[k].y  = &y[start];
     seg[k].segment  = k;

/*   ------------------------ */
/*   Try to create the thread */
/*   ------------------------ */
     printf("Spawning thread %d \n", k);
     retval = pthread_create(&chunk[k], NULL, (void *(*)(void *)) local_dot,
                           (void *) &(seg[k]));
     if (retval != 0) {
        printf("Unable to create thread ... ");
        if (retval == EINVAL)
           printf("bad arguments to pthread_create()\n");
        else if (retval == EAGAIN)
           printf("not enough resources available for pthread_create()\n");
        exit(-1);
     }

   }

/* --------------------------------------------*/
/* Gather results from the different threads.  */
/* --------------------------------------------*/
   for (k = 0; k < nsegs; k++) {
      pthread_join(chunk[k], (void **)&val);
      sum += locsum[k];
   }
   free(chunk);
   free(seg);
   free(val);
   return sum;
};

The above is correct enough. However, in terms of performant computing there may be some problems. A minor problem is the arbitrarily set "chunksize". However, that can be determined fairly easily, and is something you should be able to answer after Exercise 6.

Another problem is that we wait for the threads to complete in order. However, that is not really necessary - since the variable "sum" in function dotprod() is shared, we can just have every thread add its contribution to it on completion. However, we have to make sure only one at a time does this, and that sum is not returned until all the threads complete their computations. These topics lead to thread synchronization.

How many threads are running in the above code? The answer is nsegs+1, not nsegs. This is because the main thread is also present. During the dotproduct computation, if there are nseg or fewer physical processors available, the main thread will be competing with the other threads for processor resources - even though all it is doing is waiting for the other threads to complete. This is a recurring problem in thread programming. This leads to examining thread programming models.

Last Modified: Tue 06 Feb 2018, 03:45 PM