There are many ways of taking advantage of machines with multiple processors. One characteristic of the many methods is the degree to which the developer modifies the code from his/her knowledge of what is going on.
Totally automatic methods may be as simple as using compiler flags (eg: -pca). An example of a semiautomatic method is the use of compiler directives in the code (eg: #pragma parallel). The user locates those portions of the code that may run in parallel and the compiler is given information on how that might be accomplished. A totally manual method is one where the developer actually writes the C code in parallel.
The first totally automatic method described above is certainly the easiest but it is also the one which is likely to yield the lowest speed increase. There is no harm trying totally automatic methods but the authors experience with -pca in particular has been unsatisfactory, there have been serious bugs with the PCA compiler in the past!
The following describes one method which was used to speed up existing
software to take advantage of a 12 processor SGI Power Challenge. The code
existed and was relatively simple, the time consuming parts consisting of
long "for" loops. The operations within these loops were independent
and thus chunks of the outermost loops could readily be split up and
run in parallel.
The particular approach was to use the routines
int m_get_numprocs(void);
int m_set_procs (int numprocs);
int m_fork (void (*func)(), ...);
int m_get_myid();
int m_kill_procs();
The basic idea is as follows, consider a C function which does something
significant inside a large for loop. Note the code presented here is
intentionally made trivial in order to illustrate the ideas.
void DoSomething(void)
{
int i;
for (i=0;i<N;i++) {
do something, preferably significant, N preferable large
where the i1th iteration is independent of any other i2th iteration
}
}
This can be modified to work with m_fork() as shown below:
void DoSomething(void)
{
int i;
int nproc,theid;
nproc = m_get_numprocs();
theid = m_get_myid();
for (i=theid*N/nproc;i<(theid+1)*N/nproc;i++) {
whatever was in the loop before goes here unmodified
}
}
This obviously relies on the fact that the process id's increase by
1 from 0 (parent).
The code that calls this function might do something like
m_set_procs(7);
flag = m_fork((void (*)())DoSomething);
The for loop is simply split across 7 processes which on a multiple
processor machine will be different processors. Interestingly even on
a single processor machine the above can yield significant speed improvements
as a result of cache coherency.
The following shows the speed up factor for a 12 processor machine under "commonly" encountered load conditions. The loop size is increased in powers of 10 from 100, the number of processes increased from 1 to 14. The lines plotted are the ratio of the time taken to the time taken for a single processor.

#include <sys/resource.h>
#include <sys/time.h>
double GetRunTime(void)
{
struct rusage ru;
double sec = 0;
getrusage(RUSAGE_SELF,&ru);
sec += ru.ru_utime.tv_sec;
sec += ru.ru_utime.tv_usec * 1e-6;
sec += ru.ru_stime.tv_sec;
sec += ru.ru_stime.tv_usec * 1e-6;
return(sec);
}
It returns the time in seconds, call before and after a piece of code,
and take the difference.
M_FORK(3P) M_FORK(3P)
NAME
m_fork, m_kill_procs, m_set_procs, m_get_numprocs, m_get_myid, m_next,
m_lock, m_unlock, m_park_procs, m_rele_procs, m_sync - parallel
programming primitives
C SYNOPSIS
#include <ulocks.h>
#include <task.h>
int m_fork (void (*func)(), ...);
int m_kill_procs (void);
int m_set_procs (int numprocs);
int m_get_numprocs (void);
int m_get_myid (void);
unsigned m_next (void);
void m_lock (void);
void m_unlock (void);
int m_park_procs (void);
int m_rele_procs (void);
void m_sync (void);
DESCRIPTION
The m_fork routine creates n-1 processes that execute the given func in
parallel with the calling process. The processes are created using the
sproc(2) system call, and share all attributes (virtual address space,
file descriptors, uid, etc.). Once the processes are all created, they
each start executing the subprogram func taking up to six arguments as
passed to m_fork. The arguments passed must not be larger than pointers
in size, i.e. floating point numbers must be passed by reference.
The processes execute the subprogram and wait until they all return, at
which point the m_fork call returns. The other threads are left busy-
wait spinning until the master again calls m_fork. The overhead of
creating the additional process was done in the first call to m_fork and
is not repeated. m_fork sets a flag (see prctl(2)) so that if any of the
threads terminate, they will all receive a SIGTERM signal. Certain
operations such as profiling require that the various threads exit
normally (via the exit(2)) call. This can best be done by using
m_kill_procs to terminate the slave threads, then have the master itself
exit.
The number of subtasks n can be set and queried using m_set_procs and
m_get_numprocs, where the default is the number of processors in the
system (and hence the maximum number of processes that can truly be run
in parallel). Note that although up to 256 tasks may be requested, the
busy-wait rendezvous mechanism will usually result in a large performance
loss with any more threads than processors.
When the processes are created, each is assigned a unique thread
identifier (its tid). This identifier can be obtained through
m_get_myid. Thread id's range from 0 to n-1.
A global counter and a global lock are provided to simplify
synchronization between the processes. On each m_fork call, the counter
is reset to zero. The counter value is gotten and post incremented
through the m_next routine. The first time m_next is called, it returns
a zero. The global lock is set and unset through m_lock and m_unlock.
The m_park_procs and m_rele_procs are provided to suspend and resume the
child processes created by m_fork. This is useful if you have a phase of
the program where the parent will do setup or reinitialization code and
you do not want to have the children spinning and wasting resources.
m_park_procs should not be called when processes are already suspended.
m_sync is provided to synchronize all threads at some point in the code.
When m_sync is called by each thread, it waits at that point for all
other threads to call m_sync. The global counter is reset, and all
threads resume after the m_sync call.
m_kill_procs terminates the processes created from the previous m_fork.
Most errors from m_fork come from either sproc(2) or usinit(3P). See
those manual pages for more information on errors.
NOTES
These primitives are based on the Sequent Computer Systems parallel
programming primitives, but may not conform to all Sequent semantics.
Diagnostic output analyzing the failures in m_lock(3P) and the routines
that m_fork calls can be gotten using the _utrace mechanism discussed in
the usinit(3P) manual page.
SEE ALSO
sproc(2), blockproc(2), prctl(2), barrier(3P), usinit(3P), ussetlock(3P).
DIAGNOSTICS
m_fork, m_set_procs, m_park_procs, m_rele_procs, and m_kill_procs return
a 0 when successful, and a -1 with errno set upon failure.
m_get_numprocs, m_get_myid, and m_next all return integers. m_lock,
m_unlock, and m_sync return no value.