|
Unless your application is in the large class of applications,
you must be exploiting parallelism to get benefit from a high
performance computer. Even if you are using large amounts of memory,
if you are running on only one processor, you are effectively starving
the other processors of memory. A reasonable resource allocation
system would charge you for using 50% of the machine if you use 50%
of its memory, regardless of whether you're using 1 processor, or half
of the processors. Clearly, if you can exploit parallelism, your job
will use your allocated resources more efficiently. However, there is a trade
off between efficiency of resource usage, and how much effort is
required to parallelise your program.
So parallelism is at the core for using high performance
computers. What are the main parallel programming paradigms available?
It is convenient to introduce the terms coarse grained and fine grained parallelism. Coarse grained parallelism is a top down
parallelism, parallelising the outer loop as it were. Fine grained
parallelism is the opposite -- parallelising the inner loop, or using
explicit vector type operations.
- Package
- If you are the user of a major commercial or freeware
package, there may well already be an option to run the package in
parallel. Sometime, this may require the purchase of an additional
license. Most of these packages will only use SMP or vector computers.
- Library
- If you can code your application to use algorithms
available in standard numerical libraries, then often a parallel
version of the algorithm can be obtained by linking against the
vendor supplied version of the library. The classic example of this
type of thing is to use
LAPACK
calls
for performing linear algebra operations. These can be called from C
and Fortran, and public domain implementations are available for code
development on your workstation. Matlab code can be converted to use
the Octave interpreter, which in turn (since it is open source) can
be linked against the vendor supplied LAPACK library. These
libraries are mostly available for SMP or vector computers, although
some exist for distributed memory computers, eg
ScaLAPACK.
- Shell or Queue level parallelisation
- This is really breaking a
task up into a number of coarse grained tasks that can be submitted
to a batch queue, or spawned from a shell using the backgrounding
operator
&. Communication between processes (if any) take
place via persistent objects -- usually files, but sometimes named pipes.
- Autoparallelisation
- On an SMP or vector computer, the compiler
can often automatically parallelise your code by specifying a
compiler flag. This is fine-grained parallelism.
- OpenMP
- Often, a compiler will refuse to parallelise a
particular loop because of assumed data dependencies. You as a
programmer must supply ``hints'' to the compiler, in the form of
compiler directives. One model for doing this is
OpenMP, which has been
adopted as an industry standard. The advantage of OpenMP is incremental parallelism. You can profile your code, and concentrate
changes on those parts of the code that consume most time, without
needing to change other parts of the code. OpenMP requires a shared
memory.
- Data Parallel
- programming uses explicit array-based commands
and structures to explicitly specify fine-grained parallelism. For
example, a whole vector of sines may be computed by the statement
y=sin(x), where x and y are array variables. The
languages include data distribution statements, which allow the
programmer to control which data goes on what processor to minimise
the amount of communication between processors. Data Parallel
programming can be used on distributed memory systems (ie clusters),
and thus is more scalable than OpenMP programming, but at a cost of
greater programming complexity, particularly when porting legacy
codes. The main languages used for this are High
Performance Fortran (HPF)
and HPC++.
- Message Passing
- is a ``lower level'' programming model, where
each thread has its own memory space and the programmer explicitly
controls the communication between threads.2 This model is
ideal for implementing master-slave type calculations, but is
otherwise quite complex to program. The industry standard interface
for message passing is
MPI,
which is a library callable from C or Fortran. Numerous class
libraries for C++ built on top of MPI can be found which ease the
use of MPI in C++. An alternative programming interface is called
PVM, which
should be considered a legacy environment now.
- Threads
- At a lower level than OpenMP, one can use a
threads
library, either natively (eg Solaris threads or Irix threads), or
Posix threads. There is little to recommend a threads application over
writing the application in OpenMP (unless of course you are writing an
OpenMP compiler!).
- System V IPC
- An alternative to threads (which share the same
address space), is to program using processes. Communication between
processes can be implemented using the shared memory segments, locks
and message queues of System V Interprocess
Communication
(IPC), or using
persistent objects such as pipes and files.
The order in which I have listed the strategies above is roughly going
from simplest to use to most difficult. If you have an application
already parallelised for you, or you can use a standard library, then
this is great. Alternatively, if you have a many style
calculation with next to no communication requirements, such as an
ensemble or parameter space calculation, you write a sequential
program that computes a subtask, then call this from a shell scripts
that can be submitted to the batch queue. Otherwise, a coarse-grained
calculation is perhaps best implemented as an MPI application. An
example of the latter might be domain decomposition, particularly if
the domains are irregular in size. (A regular domain decomposition may
fit one of the data parallel paradigms). Fine grained strategies on
the other hand are best handled by autoparallelisation + OpenMP
techniques, unless the problem is so big that a distributed memory
computer is required. One important point to note is that SMP
computers have large overheads in setting up threads for a parallel
loop -- effectively requiring loop counts of order thousands of
iterations for the parallelisation to be worthwhile. Vector computers
can often deliver acceptable speeds with much smaller loop counts.
Footnotes
- ... threads.2
- Threads are
lightweight cousins of processes, and basically refers to an
independent execution path through the program
Aegis
2008-01-21 |