Parallel Programming Models

Unless your application is in the large class of applications, you must be exploiting parallelism to get benefit from a high performance computer. Even if you are using large amounts of memory, if you are running on only one processor, you are effectively starving the other processors of memory. A reasonable resource allocation system would charge you for using 50% of the machine if you use 50% of its memory, regardless of whether you're using 1 processor, or half of the processors. Clearly, if you can exploit parallelism, your job will use your allocated resources more efficiently. However, there is a trade off between efficiency of resource usage, and how much effort is required to parallelise your program.

So parallelism is at the core for using high performance computers. What are the main parallel programming paradigms available? It is convenient to introduce the terms coarse grained and fine grained parallelism. Coarse grained parallelism is a top down parallelism, parallelising the outer loop as it were. Fine grained parallelism is the opposite -- parallelising the inner loop, or using explicit vector type operations.

Package
If you are the user of a major commercial or freeware package, there may well already be an option to run the package in parallel. Sometime, this may require the purchase of an additional license. Most of these packages will only use SMP or vector computers.

Library
If you can code your application to use algorithms available in standard numerical libraries, then often a parallel version of the algorithm can be obtained by linking against the vendor supplied version of the library. The classic example of this type of thing is to use LAPACK calls for performing linear algebra operations. These can be called from C and Fortran, and public domain implementations are available for code development on your workstation. Matlab code can be converted to use the Octave interpreter, which in turn (since it is open source) can be linked against the vendor supplied LAPACK library. These libraries are mostly available for SMP or vector computers, although some exist for distributed memory computers, eg ScaLAPACK.

Shell or Queue level parallelisation
This is really breaking a task up into a number of coarse grained tasks that can be submitted to a batch queue, or spawned from a shell using the backgrounding operator &. Communication between processes (if any) take place via persistent objects -- usually files, but sometimes named pipes.

Autoparallelisation
On an SMP or vector computer, the compiler can often automatically parallelise your code by specifying a compiler flag. This is fine-grained parallelism.

OpenMP
Often, a compiler will refuse to parallelise a particular loop because of assumed data dependencies. You as a programmer must supply ``hints'' to the compiler, in the form of compiler directives. One model for doing this is OpenMP, which has been adopted as an industry standard. The advantage of OpenMP is incremental parallelism. You can profile your code, and concentrate changes on those parts of the code that consume most time, without needing to change other parts of the code. OpenMP requires a shared memory.

Data Parallel
programming uses explicit array-based commands and structures to explicitly specify fine-grained parallelism. For example, a whole vector of sines may be computed by the statement y=sin(x), where x and y are array variables. The languages include data distribution statements, which allow the programmer to control which data goes on what processor to minimise the amount of communication between processors. Data Parallel programming can be used on distributed memory systems (ie clusters), and thus is more scalable than OpenMP programming, but at a cost of greater programming complexity, particularly when porting legacy codes. The main languages used for this are High Performance Fortran (HPF) and HPC++.

Message Passing
is a ``lower level'' programming model, where each thread has its own memory space and the programmer explicitly controls the communication between threads.2 This model is ideal for implementing master-slave type calculations, but is otherwise quite complex to program. The industry standard interface for message passing is MPI, which is a library callable from C or Fortran. Numerous class libraries for C++ built on top of MPI can be found which ease the use of MPI in C++. An alternative programming interface is called PVM, which should be considered a legacy environment now.

Threads
At a lower level than OpenMP, one can use a threads library, either natively (eg Solaris threads or Irix threads), or Posix threads. There is little to recommend a threads application over writing the application in OpenMP (unless of course you are writing an OpenMP compiler!).

System V IPC
An alternative to threads (which share the same address space), is to program using processes. Communication between processes can be implemented using the shared memory segments, locks and message queues of System V Interprocess Communication (IPC), or using persistent objects such as pipes and files.

The order in which I have listed the strategies above is roughly going from simplest to use to most difficult. If you have an application already parallelised for you, or you can use a standard library, then this is great. Alternatively, if you have a many style calculation with next to no communication requirements, such as an ensemble or parameter space calculation, you write a sequential program that computes a subtask, then call this from a shell scripts that can be submitted to the batch queue. Otherwise, a coarse-grained calculation is perhaps best implemented as an MPI application. An example of the latter might be domain decomposition, particularly if the domains are irregular in size. (A regular domain decomposition may fit one of the data parallel paradigms). Fine grained strategies on the other hand are best handled by autoparallelisation + OpenMP techniques, unless the problem is so big that a distributed memory computer is required. One important point to note is that SMP computers have large overheads in setting up threads for a parallel loop -- effectively requiring loop counts of order thousands of iterations for the parallelisation to be worthwhile. Vector computers can often deliver acceptable speeds with much smaller loop counts.



Footnotes

... threads.2
Threads are lightweight cousins of processes, and basically refers to an independent execution path through the program


Aegis
2008-01-21