

The PGI shared-memory parallel C/C++ programming model is defined by a collection of compiler pragmas, library routines, and environment variables that can be used to specify shared-memory parallelism in C/C++ programs. The pragmas include a parallel region construct for writing coarse grain SPMD programs, work-sharing constructs which specify that for loop iterations should be split among the available threads of execution, and synchronization constructs. The data environment is controlled using clauses on the pragmas or with additional pragmas. Run-time library routines are provided to query the parallel runtime environment, for example to determine how many threads are participating in execution of a parallel region. Finally, environment variables are provided to control the execution behavior of parallel programs.
Parallelization pragmas in a C or C++ program are interpreted by the PGCC C and C++ compilers when the option -mp is specified on the pgcc or pgCC command line. The form of a parallelization pragma is:
#pragma pragma_name [clauses]
With the exception of the #, the pragma must start in column 1 (one), and must appear as a single word without embedded white space. Standard C syntax restrictions apply to the pragma line.
The order in which clauses appear in parallelization pragmas is not significant. Commas separate clauses within the pragmas, but commas are not allowed between the pragma name and the first clause. Clauses may be repeated subject to the restrictions listed in the description of each clause. C/C++ pragmas apply to the first statement after the pragma only. Multi-statement parallel regions can be implemented using curly braces:
#pragma parallel
{
...
}
is a parallel region that can encapsulate any number of statements. When in a parallel region, pragmas such as critical, pfor, and one processor have no implied barrier. When exiting the parallel region, a barrier is implied as control is returned to the main thread of execution.
Jumping into or out of parallel regions or subregions is not supported. Multiple directives can be put on a single line or on multiple consecutive lines, as long as all the directives apply to the next statement. Most of the pragmas cannot be nested, and any attempts to nest pragmas will result in all but the outermost pragma being ignored.
In the examples given with each section, the routines omp_get_num_threads() and omp_get_thread_num() are used (refer to section 11.7, Run-time Library Routines). They return the number of threads currently in the team executing the parallel region and the thread number within the team, respectively. Within a C program unit that calls these functions, they must be declared as follows:
extern int omp_get_thread_num(), omp_get_num_threads();
Within a C++ program unit, they must be declared as follows:
extern "C" {
extern int omp_get_thread_num(), omp_get_num_threads();
}
Syntax:
#pragma parallel [shared(shared_list) local(local_list)]
or
#pragma parallel
#pragma shared(shared_list)
#pragma local(local_list)
The next statement is a parallel region. All variables are shared by default. Any variables listed in local_list are local to each thread within the parallel region (i.e. private). Variables declared within the parallel region are shared by default, not private; since their scope is only within the parallel region they can't be declared local. Automatic variables declared within the parallel region are private. Static variables declared within the parallel region are shared.
Example:
#include stdio.h;
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2];
a[0] = -1;
a[1] = -1;
#pragma parallel
a[omp_get_thread_num()] = omp_get_thread_num();
printf("a[0]=%d, a[1]=%d\n",a[0],a[1]);
return;
}
The variables specified in a local clause are private to each thread in a team. In effect, the compiler creates a separate copy of each of these variables for each thread in the team. When an assignment to a private variable occurs, each thread assigns to its local copy of the variable. When operations involving a private variable occur, each thread performs the operations using its local copy of the variable. Other important points to note about private variables are the following:
Example:
/* managing my own parallel thread */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2], b[10], i, start, end, index;
a[0] = -1;
a[1] = -1;
for (i=0;i<10;i++) b[i] = 0;
#pragma parallel shared (b)
#pragma local(i,start,end,index)
{
a[omp_get_thread_num()] = omp_get_thread_num();
i = omp_get_num_threads();
start = omp_get_thread_num() * (10/i);
end = start + (10/i);
for(index = start; index < end; index++) {
b[index] = 1000*omp_get_thread_num() + index;
}
}
printf("a[0] = %d a[1]=%d\n";,a[0],a[1]);
for(i=0; i<10; i++) printf("b[%d] = %d\n", i, b[i]);
return;
}
Syntax:
#pragma critical
The critical pragma defines a subsection of code within a parallel region, referred to as a critical section, which will be executed one thread at a time. The first thread to arrive at a critical section will be the first to execute the code within the section. The second thread to arrive will not begin execution of statements in the critical section until the first thread has exited the critical section. Likewise each of the remaining threads will wait its turn to execute the statements in the critical section.
Critical sections cannot be nested, and any such specifications are ignored. Branching into or out of a critical section is illegal.
Example:
/* critical section example */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2000], i, mx;
#pragma parallel local(a, i)
{
. . .
#pragma critical
{
for (i=0; i<2000; i++) {
if (a[I] > mx) {
mx = a[I];
}
}
}
. . .
}
. . .
}
Syntax:
#pragma one processor
This pragma causes the next statement to be executed by the master thread only. Note this is not the first thread to arrive at the statement, but rather the main thread (thread 0 (zero)). There is no barrier on exit, and jumping into and out of a critical area is not supported.
Example:
/* one processor pragma example */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2], b[100], c[100], d[100];
int i, j, k, l, m, n, index;
a[0] = -1;
a[1] = -1;
for (i=0; i<100; i++) b[i] = 0;
for (i=0; i<100; i++) c[i] = -1;
for (i=0; i<100; i++) d[i] = -1;
#pragma parallel shared (b,c,d) local(j,index)
{
a[omp_get_thread_num()] = omp_get_thread_num();
j = 500*omp_get_thread_num();
#pragma pfor
for (index=0; index<100; index++) {
b[index] = j + index;
}
#pragma one processor
printf("You should only see this once\n");
}
return;
}
Syntax:
#pragma pfor
The real purpose of supporting parallel execution is the distribution of work across the available threads. The user can explicitly manage work distribution with constructs such as:
if (omp_get_thread_num() == 0) {
...
}
elseif (omp_get_thread_num() == 1) {
...
}
However, these constructs are not in the form of pragmas. The pfor pragma provides a convenient mechanism for the distribution of loop iterations across the available threads in a parallel region.
The pfor pragma directs the PGCC C and C++ compilers to distribute the iterative for loop immediately following the pfor pragma across the threads available to the program. The for loop is executed in parallel by the team that was started by an enclosing parallel region. pfor pragmas may not be nested. Branching into or out of a pfor loop is not supported.
By default, there is no implicit barrier after the end of the parallel loop; the first thread to complete its portion of the work will not wait until the other threads have finished their portion of work (unless the end of the parallel region is reached). A synchronize pragma should be used to create a barrier.
Other items to note about pfor loops:
Example:
/* letting pfor manage the loop index */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2], b[10], i, index;
a[0] = -1;
a[1] = -1;
for (i=0; I<10; i++) b[i] = 0;
#pragma parallel shared(b)
#pragma local(index)
{
a[omp_get_thread_num()] = omp_get_thread_num();
#pragma pfor
for(index=0; index<10; index++) {
b[index] = 1000*omp_get_thread_num() + index;
}
}
printf("a[0] = %d, a[1] = %d\n",a[0],a[1]);
for(i=0; I<10; i++) printf("b[%d] = %d\n", i, b[i]);
return;
} /* combining parallel and pfor */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int b[10], i, index;
for (i=0; i<10; i++) b[i] = 0;
#pragma parallel shared(b) local(i, index) pfor
for(index=0; index<10; index++) {
b[index] = 1000*omp_get_thread_num() + index;
}
for(i=0; i<10; i++) printf("b[%d] = %d\n", i, b[i]);
return;
}
...
Syntax:
#pragma synchronize
There may be occasions in a parallel region when it is necessary that all threads complete work to that point before any thread is allowed to continue. The synchronize pragma synchronizes all threads at such a point (a barrier) in a program.
Multiple barrier points are allowed within a parallel region. The synchronize directive must either be executed by all threads executing the parallel region or by none at all.
Example:
/* synchronize pragma example */
extern int omp_get_thread_num(), omp_get_num_threads();
void main() {
int a[2], b[5000], c[5000], d[5000], i, j, k, l, index;
a[0] = -1;
a[1] = -1;
for (i=0; i<5000; i++) b[i] = 0;
#pragma parallel
j = omp_get_num_threads();
if (j==1) k = 0;
if (j==2) k = 10000;
for (i=0; i<2500; i++) c[i] = i;
for (i=2500; i<5000; i++) c[i] = k + i;
#pragma parallel shared (b,d) local(j,k,l,index)
{
a[omp_get_thread_num()] = omp_get_thread_num();
j = 10000*omp_get_thread_num();
#pragma pfor
for(index=0; index<5000; index++) {
b[index] = j + index;
}
/*
* without the next pragma, the next 'for loop' will
* cause the threads to work on their 'partner' threads
* results from the previous distributed loop.
*/
#pragma synchronize
k = 0;
l = 0;
#pragma pfor
for (l=0; l<5000; l++) {
k = 4999 - l;
d[k] = b[k];
}
}
k = 0;
l = 0;
printf("a[0] = %d, a[1]=%d\n",a[0],a[1]);
for (i=0; i<5000; i++) {
if( d[i] != c[i]) {
k = k + 1;
l = l + 1;
if (l<10) {
printf("!! expected d[%d] = %d, actual d[%d] =
%d\n",i,c[i],i,d[i]);
}
}
}
printf("total errors = %d out of 5000\n",k);
return;
}
User-callable functions are available to the C/C++ programmer to query the parallel execution environment.
int omp_get_num_threads()
returns
the number of threads in the team executing the parallel region from which it
is called.
When called from a serial region, this function returns 1. A
nested parallel region is the same as a single parallel region.
int omp_get_thread_num()
returns the thread number within the team. The thread number lies between 0 and omp_get_num_threads()-1. When called from a serial region, this function returns 0. A nested parallel region is the same as a single parallel region.
The environment variable NCPUS specifies the number of threads to use during execution of parallel regions. The default value for this variable is the number of physical processors configured in the system on which the program is executed.

