Basic Architecture of Parallel Systems


Parallel systems tend to be divided into two groups: shared memory, where there is a single address space and physical memory system, and distributed memory, where each processor has its share of the system's memory attached to it. Here address space refers to what you as a programmer see. On a four-processor shared memory machine you might declare an array A(100) of length 100 and access its k-th element by simply referencing it via A(k) from any processor. A distributed memory machine requires you to declare A in four parts, each 25 entries long - and when you reference an element you must keep track of both the entry and which processor "owns" that particular datum.

The division is not strict; the Illinois Cedar machine (circa late 1980's) had local memory attached to each group of eight processors forming a shared memory component, with each processor able to access a global shared memory as well. Long before multicore systems, many companies provided "distributed shared memory" machines, which present a single address space to the user, but have that memory physically distributed. The OS, runtime system, and hardware then handles accessing the right entry and the user does not have to keep track of which processor "owns" it. They are also called NUMA machines, for Nonuniform Memory Access, since it usually takes longer to access an operand physically residing on a remote memory than on a local one. Another related term is CC-NUMA: cache-coherent NUMA.

Some understanding of basic uniprocessor memory systems is needed. The emphasis on memory systems here is because of the fundamental performance principle of scientific computing: most numerical computations are limited not by processor speed, but by the time of getting data to and from the processor.


Memory Banks

[Note: this is not critical for non-vector machines, and is here for completeness. If you don't understand it, don't worry too much] Modern computer systems (including workstations) have memory divided into a set of banks or modules. Logically, memory to a user is just a single vector running from address 0 to M-1, where M is the number of addressable words or bytes on the machine. Physically, however, the memory addressed is interleaved amongst the banks.

An interleaved memory with b banks is said to be b-way interleaved, no big surprise. Memory address m then resides in memory bank mod(m,b). This way consecutive addresses reside in different banks so that if a program is accessing one word after another, the memory system can have the different banks processing the requests simultaneously. The processor can request a transfer from location m on one cycle and from m+1 on the next cycle; the information will be returned on successive cycles. Note that the latency of the request, i.e. the number of cycles a processor has to wait before receiving the contents of a single location is not affected, but the bandwidth is improved. If there are enough banks the memory system can potentially send information at a rate of one word per processor cycle, regardless of what the memory cycle time is.

The decision to allocate addresses as contiguous blocks or in interleaved fashion depends on how one expects information to be accessed. Programs are compiled so instructions reside in successive addresses, so there is a high probability that after a processor executes the instruction at location m it will execute the instruction at m+1. Compilers can also allocate vector elements to successive addresses, so operations on entire vectors can take advantage of interleaving. For these reasons, vector processors invariably have interleaved memory. However, shared memory multiprocessors use a block-oriented scheme since memory referencing patterns in an MIMD system are quite different. There the goal is to connect a processor to a single memory and use as much information as possible from that memory before switching to another memory.


Caches

A cache is a small, fast memory locted near or on-chip with a processor. Access to a memory word causes an entire line or block of words to be loaded into the cache (line sizes are typically 4-32 8-byte words). Time to access a word from cache is typically 50-100 times faster than getting it from memory.

The reason for using a cache and cache lines is based on data locality: if you used the word at location m on one step, the next word you access is likely to have an address near or adjacent to the one just accessed. When the cache is full and a new line is brought in, some line must be removed. The most commonly used replacement policy is LRU: least recently used. The line that was accessed the most distantly in time is replaced. Here the idea is based on temporal locality; recently accessed words are more likely soon to be accessed again (think about a loop index variable, for example.)

How a line is written back to memory can be in two ways:

  1. Write-back: when a word is stored, its value in the cache is changed and a dirty bit is set for its cache line. When a line is replaced, if its dirty bit is set the line is written back to memory; otherwise it is discarded since the version in memory is the same as the version in cache.
  2. Write-through: when a word is stored, send the modified cache line to memory immediately.
Write-back is more efficient generally, since it involves many fewer stores to memory. Write-through makes some parallel processing easier as will be seen later.

Modern processors use multiple levels of cache; three is common and four increasingly so. This trend is helpful for serial computing - otherwise the vendors would not build them. However, this "deep memory hierarchy" has some serious consequences for parallel computing. Essentially, when processors need to communicate by sending a message or datum from one to the other, that datum must burrow its way upwards through the first processor's memory hierarchy, then downwards through the second processor's memory hierarchy, before it can be used by the second processor. So in addition to the cost of sending the data across whatever communication substratum exists, there is the cost of traversing two memory systems - and perturbing data in the caches along the way.


Shared Memory Processor-Memory Organization

A straightforward way to connect several processors together to build a multiprocessor is to have each processor and memory module hang off of a bus, a memory channel which on PC's often takes the form of a broad ribbon connector. The physical connections are quite simple. Most bus structures allow an arbitrary (but not too large) number of devices to communicate over the bus. Bus protocols were initially designed to allow a single processor and one or more disk or tape controllers to communicate with memory. If the I/O controllers are replaced by processors, one has a small single-bus multiprocessor.

The problem with this design is that processors contend for access to the bus. If processor P is fetching an instruction, all other processors must wait until the bus is free. If there are only two processors they can perform close to their maximum rate since the bus can alternate between them: as one processor is decoding and executing an instruction, the other can be using the bus to fetch its next instruction. However, when a third processor is added performance begins to degrade. Usually 10 processors connected to the bus flattens out the performance curve so that adding more processors does not increase performance. The memory and bus have a fixed bandwidth, determined by a combination of the cycle time of the memory and the bus protocol, and in a single-bus multiprocessor this bandwidth is divided among several processors. If the processor cycle time is slow compared to the memory cycle, a fairly large number of processors can be accommodated by this plan, but since processor cycles are usually faster than memory cycles this scheme is not scalable.

A modification to this design will improve performance, but it cannot indefinitely postpone the flattening of the performance curve. If each processor has its own local cache and data locality of the program is good, then it is likely that the data it needs is in the local cache. A good cache hit rate will greatly reduce the number of accesses a processor makes and thus improve overall efficiency. The dogleg of the performance curve, which identifies a point where it is still cost-effective to add processors, extends to around 20 processors, and the curve does not flatten out until around 30 processors.

Giving each processor its own cache introduces the cache coherency problem. Suppose two processors use data item A, so A ends up in the cache of both processors. Next suppose processor 1 performs a calculation that changes A. When it is done, the new value of A is written out to main memory. At a later time, processor 2 needs to fetch A. However, since A was already in its cache, it will use the cached value and not the newly updated value calculated by processor 1. Maintaining a consistent version of shared data requires providing new versions of the cached data to each processor whenever one of the processors updates its copy. The typical approach is called a "snooping protocol", where each processor "listens" on the bus for address requests and update postings.

Another way of building a shared memory multiprocessor is to replace the bus with a switch that routes requests from a processor to one of several different memory modules. Even though there are several physical memories, there is one large virtual address space. The advantage of this organization is based on having switchs that can handle multiple requests in parallel. Each processor can be paired up with a memory, and each can then run at full speed as it accesses the memory it is currently connected to. Contention still occurs, since if two processors make requests of the same memory module only one will be given access and the other will be blocked.

Various switch designs include


Distributed Memory Machines

The great divide in programming parallel architecture is shared versus distributed memory systems, although the distinctions are becoming more blurred as time goes on. In practice, the distinction to a programmer is whether or not the memory is logically shared or distributed. That in turn depends on whether the memory presents a single or a multiple address space. Logically shared memory machines are ones with a single address space. As we will see, this makes the porting and programming problem much easier.

As the material on interconnection networks between processors and memory shows, the problems of bandwidth limits and network congestion can be alleviated by having a large cache with each processor - at the price of worrying about cache coherency. If this idea is carried to extremes, we move all of the memory to be local to the processors. This gives a distributed memory system, one where each processor has its own memory - and its own address space. Now the programmer is required to explicitly distribute the program data amongst the processors, synchronize between them, and communicate results between the processors by sending messages.

The advantage of distributed memory systems is that they are more "scalable". The word scalable is bandied about a great deal in parallel computing, but it is like the word "soup" - it means drastically different things to different people at different times. Here scalability is primarily of an architectural variety: distributed memory machines consist of fungible components, and you buy more and plug them in as needed. With a suitable problem and code, their performance can also be scalable. But using distributed memory introduces two sources of overhead: it takes time to construct and send a message from one processor to another, and a receiving processor must be interrupted to deal with messages from other processors.

So in a distributed memory system the memory is associated with individual processors and a processor is only able to address its own memory. Sometimes this is called a multicomputer system, since the building blocks in the system are themselves small computer systems complete with processor and memory. The IBM SP/2, for example, was originally just a collection of RS/6000 workstations tied together with a fast interconnect network, with each RS/6000 running its own copy of the OS.

In a distributed memory system, each processor can utilize the full bandwidth to its own local memory without interference from other processors. There is no inherent limit to the number of processors as with bus-based systems. The size of the system is now constrained only by the network used to connect processors to each other. There are no cache coherency problems (more accurately, the user becomes responsible for maintaining coherency). Each processor is in charge of its own data, and other processors cannot access it without going through explicit actions commanded by the program.

Programming on a distributed memory machine means organizing your program as a set of independent tasks that communicate with each other via messages. The programmer must be aware of where data is stored (that is, on which processor it resides), which introduces a new form of locality in algorithm design. An algorithm that allows data to be partitioned into discrete units and then runs with minimal communication between units will be more efficient than an algorithm that requires random access to global structures.

Semaphores, monitors, and other concurrent programming techniques are not directly applicable on distributed memory machines, but they can be implemented by a layered software approach. User code can invoke a semaphore, for example, which is itself implemented by passing a message to the node that ``owns'' the semaphore. This approach is not efficient.

Which programming style is easier - shared memory with semaphores, etc. or distributed memory with message passing - is often a matter of background; however, most users find the shared memory model easier to deal with. The message passing style can fit well with an object oriented programming methodology, and if a program is already organized in terms of objects it may be quite easy to adapt it for a distributed memory system. Choosing to implement implement a program in shared memory versus distributed memory is usually based on the amount of information that must be shared by parallel tasks. Whatever information is shared among tasks must be copied from one node to another via messages in a distributed memory system, and this overhead may reduce efficiency to the point where a shared memory system is preferred.


Single nodes in a distributed memory system are called processing elements, or PEs. To any PE, the other PEs are simply I/O devices. To send a message to another PE, a processor copies information into a buffer in its local memory and then tells its local controller to transfer the information to an external device, much the same way a disk controller in a microcomputer would write a block on a disk drive. In this case, however, the block of data is transferred over the interconnection network to an I/O controller in the receiving node. That controller finds room for the incoming message in its local memory and then notifies the processor that a message has arrived. On the Intel Paragon, to avoid tying up the computations while the communication is going on, each PE contained two i860 processors. One handles communications alone and the other handles computation, allowing the two to be overlapped. Modern multicore systems have made that design clunky and unnecessary, however.


Hybrid Memory Organization

As always happens in computer science when there are two paradigms, each with complementary strengths, hybridization efforts try to build systems that have the strengths of both. A blurring of the distinction between shared and distributed memory systems has been going on recently, and takes at least three forms:

  1. Hybrid systems mix the two flavors of memory. One form consists of an array of shared memory multiprocessors, tied together with a ultrafast network. Another flavor is to connect shared memory multiprocessors together with a global memory system, separate from the different shared memories.
  2. Parallel languages such as Titanium and HPC++ rely on compilers to map user data to different address spaces, and then a runtime system manages the necessary message passing for a user.
  3. "Distributed shared memory" systems have physically distributed memory, but rely on a combination of operating system and hardware to move address references where they are needed. Here the user has a single logically shared address space, but accessing data beloging to another processor can take significantly longer than accessing it from the local memory - leading to the term NUMA, or non-uniform memory access.
The SGI Origin 2000 was a hybrid system. It used a 4D hypercube to connect up to 16 "nodes". Each node was a board with two processors sharing a single memory. Systems with more than 16 nodes ere then tied together with a "Craylink Interconnect", high speed links and routers connecting the hypercubes. The user sees a single address space and need not explicitly partition program data or write message passing code. However, the practical experience of the machine is that to get good performance, the user needs to be aware of and take active role in locating data on the machine. More deadly, a single clueless or evil user could make the entire system run hundreds of times slower. If a job on one processor allocates more memory than its node contains, the excess memory would be located on other nodes - which typically are running other jobs. That meant those processors would have to handle sending data to the first processor, drastically slowing down those nodes. If they were fully or mostly using the memory on their node, then their data would in turn be displaced to yet other nodes. A job that might take 30 minutes to run could end up taking over two days, and the only way to avoid the issue was to run a single job at a time.