Basic Parallel Program Modeling

Suppose that a given program is to be run on a multiprocessor system. The following modeling introduces some basic terminology and ideas related to parallel computing, but at the cost of ignoring many real-world issues. Let

t₁ be the time required to run the program on one processor
t_p be the time required to run the program on p processors
S_p = t₁/t_p is called the parallel speedup.
E_p = S₁/p is called the parallel efficiency.

Now some warnings. First, "speedup" can vary from machine to machine, depends strongly on the compiler and system technology applied, and can vary strongly from one run to another. Unless it is measured carefully, it is a useless number.

Secondly, in earlier (more innocent and nobler) days, speedup for a problem was defined as the best uniprocessor algorithm time divided by the multiprocessor algorithm time. The distinction is simple: sometimes we have to change algorithms to get a parallel code, often at the price of using a less sophisticated and slower-converging algorithm. The hope is that the gain from using multiple processors outweighs the loss from using less efficient algorithms. Clearly it is unfair to measure t₁ by simply running the multiprocessor code on one processor. Even when using the same algorithm, typically the code will contain sychronisation and other features which are unnecessary in the uniprocessor case. So when someone claims "parallel efficiency of 80%", be sure to check carefully to see how the uniprocessor time was obtained; if that information is not available you can typically disregard any speedup numbers.

Thirdly, although it seems that we should have 1 ≤ S_p ≤ p, both bounds can be violated. S_p < 1 occurs when the costs (synchronization, communication) of parallelism exceed the benefits. This is common for codes with tight data or temporal dependencies between the computations. S_p = p is called linear speedup and when S_p > p, the problem is said to have achieved superlinear speedup. Superlinear speedup can occur in three cases:

S_p > p can occur if the algorithm being implemented is nondeterministic, and the parallel version actually does less work. An example would be a branch-and-bound algorithm in which the parallel code succeeds in pruning more branches than the sequential one, avoiding work that the uniprocessor version would have done.
S_p > p can occur because bringing in more processors also means bringing in more caches and register, so a faster aggregate memory system is working on the problem.
S_p > p also occurs when people cheat. E.g., make multiple runs and choose the slowest time for t₁ and the fastest time for t_p.

Amdahl's Law

Ware's model of Amdahl's law says if a computation is performed in two modes that run at different rates of execution, the net speed depends on the percentage of the computation in each mode. If f is the fraction of a program that can be run in parallel (and so 1-f is the serial or sequential fraction), then

t = ft_p + (1-f)t₁

and the min rate is 1/t₁, max rate is 1/t_p, and net rate of execution is 1/t.

Some simple algebra shows that the maximum realizable gain achieved by parallelism is

g = r/r_p = t_p/t = 1/[f + (1-f)R]

where R = r_p/r₁ is the ratio of execution rates. The figure below shows the gain for values of R equal to 2, 10, and 20:

Assuming that p processors together run p times faster than one process, to get 50% effective gain you will need to have over 90% of your code running in parallel. Under those assumptions, the speedup achievable is

S_p = t₁/[(1-f)t₁ + f(t₁/p)]

S_p = p/[(1-f)p + f]

and so the parallel efficiency is E_p = 1/[1 + (1-f)(p-1)].

Note that this model assumes the overhead due to synchronization and interprocessor communication, locks on shared resources, etc. remains fixed regardless of the number of processors. We would have to add in another term to the model, which is a monotonically increasing function of p. The minimal big-O growth characteristic of that function must be logarithmic.

Gutafson proposed in the 1980's a different model, starting with the parallel time as the base unit. That model is less pessimistic.

Either model is inappropriate much of the time because it assumes the parallel fraction of a program remains fixed. Algorithmic research in parallel computing consists of trying to "break the law". An approach that is often fruitful uses John Rice's idea of a polyalgorithm , where the algorithm used is changed as the problem size or machine used changes.

Scalability

The term "scalability" is another often-abused term in parallel computing, and means different things to different people. A "scalable algorithm" is one for which the parallel efficiency remains bounded away from 0 for all values of p; E_p ≥ α > 0, as p increases. Amdahl's Law says there are no scalable algorithms, except for perfectly parallel ones - but if an algorithm is completely parallel, then really it is simply processing totally decoupled problems, something usually called multiprogramming. A problem with this definition of scalability is that it assumes there will always be machines with more and more processors. However, there is good reason to believe that no one will ever build a machine with more than 10⁶⁰ processors. And quantum computers don't count here - they work with a form of massive parallelism for which programming methodologies are not yet available.

Another problem occurs. Consider adding two vectors of length n. That is a completely parallel operation ... but when the number of processors p exceeds n, nothing is gained from adding more processors. So a variant on scalability is to use "scaled speedup", where the problem size is allowed to grow with the number of processors used, normally in such a way that the total memory required per processor remains fixed. Larger machines are typically used for larger problems, not just to run the same problem faster. A vector update of length 32 may be run on a 32 processor machine with perfect parallelism, but on 64 processors the efficiency is at best 50%. To handle this, let the problem size grow with the number of processors, so that the algorithm can keep all of them busy. One way is to fix the amount of memory per processor at M, and as more processors are brought to bear the problem size n(p) is increased to keep M constant. This leads to the concept of scaled speedup and efficiency.

Define the scaled speedup as T(1,n(p))/T(p,n(p)) and scaled efficiency as T(1,n(p))/(T(p,n(p))*p). For the matrix multiplication update C = C + A*B, where all the matrices are square of order N = N(p), N should satisfy 3*N² = p*M, where M is the (fixed) amount of memory used per processor. This means the order of the matrix needs to grow as sqrt(p) as p increases.

A problem with this for matrix multiplication is that the number of floating point operations grows as (2/3)*(N³) + O(N²) while the data size grows as 3*N² . So in this case the function n(p) would need to be changed so that the amount of work per processor is fixed as p increases.

The scaled speedup is a good measure whenever it is possible; it helps factor out the problem of additional memory being brought to bear on a problem, eliminating one spurious source of superlinear speedups. However, it is not always possible or practical to (re)define the problem size in this way. For example, solving a PDE on a mesh would require refining the mesh. Such refinements, however, may introduce new physical effects with deleterious (or advantageous!) effects on an iterative solver, requiring many more or many fewer iterations, for example.

Last Modified: Tue 06 Feb 2018, 03:29 PM