I/O and Collective Communications with MPI-1


A problem with using MPI-1 is I/O. Suppose in the round-robin communication you want to input the message length and a value to use to fill in the messsage. So you put in a line
scanf("%d %f", &size, &val);
and then the MPI program is run with three processors and the user inputs the line
   24  3.1415
Who gets what? Does each process get size and val, or does just process 0 get them? It is possible to have process 1 get the "24" while process 0 gets the "3.1415" and process 2 hanging, waiting in vain for input.

Because of these problems, always have just process 0 do I/O, and relay the results on to the other processes. In MPI, it is guaranteed that the process with rank 0 can do I/O. In practice vendors usually try to supply better I/O mechanisms, but that is a new part of the MPI standard (MPI2). The algorithm to have just process 0 handle input would be something like

   if (myrank = 0)
       read in integer i , double d
       for k = 1 to p-1
           MPI_Send message with i to process of rank k.
           MPI_Send message with d to process of rank k.
       end for 
   else
       MPI_Recv message with i from process of rank 0.
       MPI_Recv message with d from process of rank 0.
   end if 

This is not scalable. Part of the problem is that one processor does all the reading of data, but that is all we can count on being able to do with MPI. A larger scalability problem is that one processor sends all the messages - and this we can improve on. The trick is to use a tree code to send the messages out. Suppose that the number p of processes is a power of 2. Then use the algorithm
     k = p/2
     not_received = true
     for i = 1 to log(p)
        if (myrank() mod 2k = 0) then
           MPI_Send data to myrank()+k
        else if (not_received and myrank() mod 2k = k) then
           MPI_Recv data 
           not_received = false
        endif
        k = k/2
     endfor
This is the classical binary tree for data transfers, with time moving downwards:
                        0
                       /  \
                      /    \
                     /      \
                    /        \
                   0          4   
                  / \        / \
                 /   \      /   \
                0     2    4     6
               / \   / \  / \   / \
              0   1 2   3 4  5 6   7
Of course, no message needs to be sent long the left branches at each interior node. Just to test your understanding, in the algorithm given above, what rank number should the process issuing the MPI_Recv() give as the "source" field?

Collective Communications

Because it is often the case that one process needs to send or receive data to or from all other processes, MPI provides collective communication functions. If you are lucky, the vendor has optimized them for the machine topology; in MPICH they use a tree algorithm like the one above. Here are some of the functions:
  1. int MPI_Bcast(void *msg, int count, MPI_Datatype, int root, MPI_Comm): sends a message from process root to all others in the communicator group. This function must be called by all participating processes. Also, count and datatype must match on all processes, unlike MPI_Send and MPI_Recv.
  2. MPI_Reduce(void *operand, void *result, int count, MPI_Datatype, MPI_OP, int root, MPI_Comm): combines operands stored in operand and leaves the answer in result on process with rank "root". Here count, datatype, and operation MPI_OP must be the same on all processes. The types of operations are MAX, MIN, SUM, PROD, and various logical operations. Warning: although only root gets the result, all the participating processes must provide void *result, and all must provide the actual storage space it uses.
  3. Often you want a global reduction operation with the result left on on every process, not just a root one. Instead of following MPI_Reduce with a MPI_Bcast, use instead MPI_Allreduce(). In general, the "All" word embedded in an MPI function means to have the operation result end up in all tasks in the communicator group.
Other collective communications that can be useful are MPI_Gather, MPI_Scatter, MPI_Allgather, and MPI_Allgatherv. But the most important one for debugging is MPI_Barrier(MPI_Comm comm). Its only argument is the communicator group, and it blocks processes until they have all called it. This is a sychronization primitive. The most common mistake in parallel computing is to implicitly assume a synchronization that does not in fact exist - except in your mind. You should follow each communication phase in your program with a MPI_Barrier, until the code is running correctly. Then one-by-one remove the barriers that you think are unnecessary.

Note about "Blocking"

Until now we have used blocking send and receive. Here is what blocking means: There is almost no synchronization implied by this. Typically, an outgoing message is buffered by the system. So the sending process can return from MPI_Send when the receiving process has not started receiving the message, or even posted a corresponding MPI_Recv. Beware of this. Equally weird, a receive can complete before the matching send completes. See if you can figure out how that could happen!

Although it seems that the above is haphazard, there is one property that MPI imposes on messages: they will be non-overtaking, that is, if process 0 sends out three messages a, b, and c in that order and with identical tags to process 1, and process 1 posts three receives, then the order the messages will be received is a, b, c. However, if more than two processors are involved, and process 2 also sends messages d, e, and f to process 1, and if process 1 has specified a wildcard src = MPI_ANY_SOURCE, then messages from process 0 can be mingled in any order with those from process 2. So, for example, orders in which process 1 could receive messages are

but it could not receive the messages in the order