Parallelization |
10 |
![]() |
Note - Parallelization features are only available on SPARC platforms running Solaris 2.x, and require a Sun WorkShop license.
DO loop, to run over multiple processors with a potentially significant execution speedup.Before an application program can be run efficiently on a multiprocessor system like the SPARCstation 10 or SPARCcenter 2000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations.
Multithreading an application can be done manually by making appropriate calls to the libthread primitives. However, a significant amount of analysis and reprogramming may be required. (See the Solaris Multithreaded Programming Guide for more information.)
Sun compilers can automatically generate multithreaded object code that to run on multiprocessor systems. The Fortran compilers focus on
DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program. Choice of which loops to parallelize and how they should be distributed can be left entirely up to the compiler (-autopar), determined explicitly by the programmer with source code directives (-explicitpar), or done in combination (-parallel).
Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run slower when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.
Note - Programs that do their own (explicit) thread management should not be compiled with any of the compiler's parallelization options. Explicit multithreading (calls to libthread primitives) cannot be combined with routines compiled with these parallelization options.
Sun compilers can detect loops that may be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the
-loopinfo option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.
Probably not. It can be shown (by Amdahl's law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if c is the percentage of the execution time run in parallel, the theoretical speedup limit is 100/(100-c); therefore, if only 60% of a program runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.
As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the run time must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.
Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).
DO I=2,N A(I) = A(I-1)*B(I)+C(I) END DO |
requires the value computed for
A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce results running each iteration in parallel that are the same as with single processor, iteration I must complete before iteration I+1 can execute. Reduction
Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:
DO K = 1,N SUM = SUM + A(I)*B(I) END DO |
If each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in
SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant. Indirect Addressing
Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:
DO L = 1,NW A(ID(L)) = A(L) + B(L) END DO |
In the preceding, repeated values in
ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent. Data Dependent Loops
It may be possible to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring may be required.
The following tables list f77 and f90 parallel directives.
|
f90 |
Parallel Directives |
Purpose |
|
|
!MIC$ DOALL optional qualifiers |
Parallelize next loop, if possible |
-reduction requires -autopar.
-autopar includes -depend and loop structure optimization.
-parallel is equivalent to -autopar -explicitpar.
-noautopar, -noexplicitpar, -noreduction are the negations.
demo% setenv PARALLEL 4 |
enables, for example, the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads may run on the same processor as others, possibly degrading performance.
demo% psrinfo 0 on-line since 03/18/96 15:51:03 1 on-line since 03/18/96 15:51:03 2 on-line since 03/18/96 15:51:03 3 on-line since 03/18/96 15:51:03 |
Stacks, Stack Sizes, and Parallelization
The executing program maintains a main memory stack for the parent program and distinct stacks for each thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations. STATIC (not on the stack). However, the -stackvar option forces allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of -stackvar is recommended with parallelization because it improves the optimizer's ability to parallelize CALLs in loops. -stackvar is required with explicitly parallelized loops containing subprogram calls. (See discussion of -stackvar in the Fortran User's Guide.)
demo% setenv STACKSIZE 8192 <- Set thread stack size to 8 Mb |
Setting the thread stack size to a value larger than the default may be necessary for most parallelized Fortran codes. However, it may not be possible to know just how large to set it, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.
Automatic Parallelization
With the f77 option -autopar and the f90 option -parallel, the compilers automatically find those DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the threads calls needed in the compiled code to make this happen. Loop Parallelization
The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.
|
Processor 1 executing iterations |
1 |
through |
250 |
|
Processor 2 executing iterations |
251 |
through |
500 |
|
Processor 3 executing iterations |
501 |
through |
750 |
|
Processor 4 executing iterations |
751 |
through |
1000 |
Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.
Definitions: Array, Scalar, and Pure Scalar
A few definitions, from the point of view of automatic parallelization, are needed:
dimension a(10) real m(100,10), s, u, x, z equivalence ( u, z ) pointer ( px, x ) s = 0.0 ... |
The variables u, x, z, and px are scalar variables, but not pure scalars.
Automatic Parallelization Criteria
DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar (f77) or -parallel (f90). The general criteria for automatic parallelization are:
Example: Using -autopar, with dependencies eliminated by private arrays:
parameter (n=1000) real a(n), b(n), c(n,n) do i = 1, 1000 <--Parallelized do k = 1, n a(k) = b(k) + 2.0 end do do j = 1, n c(i,j) = a(j) + 2.3 end do end do end |
In the preceding example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array
a(*) appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent. Inhibitors to Automatic Parallelization
Under automatic parallelization, the compilers do not parallelize a loop if:
DO loop is a variable.
DO loop is the innermost in a nest or is a singly-nested loop.
Note - f90 1.2: Innermost or singly-nested loops are not automatically parallelized.
Example: Reduction summation of the elements of a vector:
s = 0.0 do i = 1, 1000 s = s + v(i) end do t(k) = s |
However, for some operations, if the reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.
-reduction compiler option is specified along with -autopar or -parallel. -reduction is specified.
Recognized Reduction Operations
The following table lists the reduction operations that are recognized by f77.
Note - All forms of the MIN and MAX functions are recognized.
Numerical Accuracy and Reduction Operations
Floating-point sum or product reduction operations may be inaccurate due to the following conditions:
Example: Overflow and underflow, with and without reduction:
Example: Roundoff: get the sum of 100,000 random numbers between -1 and +1:
|
Number of Processors |
Output |
|
1 |
s = 0.568582080884714E+02 |
|
2 |
s = 0.568582080884722E+02 |
|
3 |
s = 0.568582080884721E+02 |
|
4 |
s = 0.568582080884724E+02 |
In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.
Explicit Parallelization
This section describes the source code directives recognized by f77 4.2 and f90 1.2 to explicitly indicate which loops to parallelize and what strategy to use.
Note - Be aware that there are differences in directive syntax and features between the f77 and f90 implementations. f77 accepts either Sun or Cray style directives, while f90 accepts only Cray style directives.
DO loops are marked for parallelization by directives placed immediately before them. The compiler options -parallel and -explicitpar must be used for DO loops to be recognized and parallel code generated. Take care when choosing which loops to mark for parallelization. The compiler generates threaded, parallel code for all loops marked with DOALL directives, even if there are data dependencies that will cause the loop to compute incorrect results when run in parallel.libthread primitives, do not use any of the compilers' parallelization options--the compilers cannot parallelize code that has already been parallelized with user calls to the threads library. Parallelizable Loops
A loop is appropriate for explicit parallelization if:
A shared variable or array is shared with all other iterations. The value assigned to a shared variable or array in an iteration is seen by other iterations of the loop.
If an explicitly parallelized loop contains shared references, then you must ensure that sharing does not cause correctness problems. The compiler does no synchronization on updates or accesses to shared variables.
If you specify a variable as private in one loop, and its only initialization is within some other loop, the value of that variable may be left undefined in the loop.
C$PAR) explicit directives, the compiler uses default rules to determine whether a scalar or array is shared or private. You can override the default rules to specify the attributes of scalars or arrays referenced inside a loop. (With Cray-style !MIC$ directives, all variables that appear in the loop must be explicitly declared either shared or private on the DOALL directive.)The compiler applies these default rules:
Example: Potential problem through equivalence:
equivalence (a(1),y) C$PAR DOALL do i = 1,n y = i a(i) = y end do |
In the preceeding example, since the scalar variable y has been equivalenced to a(1), it is no longer a private variable, even though the compiler treats it as such by the default scoping rule. Thus, the presence of the DOALL directive may lead to erroneous results when the parallelized i loop is executed.
Sun-Style Parallelization Directives (f77 only)
Parallelization directives are comment lines that tell the compiler to parallelize (or not to parallelize) the DO loop that follows the directive. Directives are also called pragmas.-mp=sun option). Cray-style or f90 directives are discussed on page 167. A Sun-style directive line is defined as follows:
C$PAR Directive [ Qualifiers ] <- Initial directive line C$PAR& [More_Qualifiers] <- Optional continuation lines |
|
Directive |
Action |
DOALL |
Parallelize the next loop. |
DOSERIAL |
Do not parallelize the next loop. |
DOSERIAL* |
Do not parallelize the next nest of loops. |
Examples: f77 parallel directives:
DOALL Directive
The compilers will parallelize the DO loop following a DOALL directive (if compiled with the -parallel or -explicitpar options).
Example: Explicit parallelization of a loop:
Note - Analysis and transformation of reduction operations within loops is not done if they are explicitly parallelized.
demo% cat t4.f ... C$PAR DOALL do i = 1, n a(i) = b(i) * c(i) end do do k = 1, m x(k) = x(k) * z(k,k) end do ... demo% f77 -explicitpar t4.f |
CALL in a Loop
A subprogram call in a loop (or in any subprograms called from within the called routine) may introduce data dependencies that could go unnoticed without a deep analysis of the data and control flow through the chain of calls. While it is best to parallelize outermost loops that do a significant amount of the work, these tend to be the very loops that involve subprogram calls. DOALL directive that contains calls to subprograms. It is still the programmer's responsibility to insure that no data dependencies exist within the loop and all that the loop encloses, including called subprograms.AUTOMATIC statement or by compiling the subprogram with the -stackvar option. However, local variables initialized in DATA statements must be rewritten to be initialized in actual assignments.
Data dependencies can still be introduced through the data passed down the call tree as arguments or through
Note - Allocating local variables to the stack may cause stack overflow. See page 140 about increasing the size of the stack.
COMMON blocks. This data flow should be analyzed carefully before parallelizing a loop with subprogram calls.
All qualifiers on the DOALL Qualifiers
DOALL directive are optional. Table 10-6 summarizes them.
PRIVATE(varlist)
The PRIVATE(varlist) qualifier specifies that all scalars and arrays in the list varlist are private for the DOALL loop. Both arrays and scalars can be specified as private. In the case of an array, each thread of the DOALL loop gets a copy of the entire array. All other scalars and arrays referenced in the DOALL loop, but not contained in the private list, conform to their appropriate default scoping rules.
C$PAR DOALL PRIVATE(a) do i = 1, n a(1) = b(i) do j = 2, n a(j) = a(j-1) + b(j) * c(j) end do x(i) = f(a) end do |
In the preceding example, the array a is specified as private to the i loop.
SHARED(varlist)
The SHARED(varlist) qualifier specifies that all scalars and arrays in the list varlist are shared for the DOALL loop. Both arrays and scalars can be specified as shared. Shared scalars and arrays are common to all the iterations of a DOALL loop. All other scalars and arrays referenced in the DOALL loop, but not contained in the shared list, conform to their appropriate default scoping rules.
equivalence (a(1),y) C$PAR DOALL SHARED(y) do i = 1,n a(i) = y end do |
In the preceding example, the variable y has been specified as a variable whose value should be shared among the iterations of the i loop.
READONLY(varlist)
The READONLY(varlist) qualifier specifies that all scalars and arrays in the list varlist are read-only for the DOALL loop. Read-only scalars and arrays are a special class of shared scalars and arrays that are not modified in any iteration of the DOALL loop. Specifying scalars and arrays as READONLY indicates to the compiler that it does not need use a separate copy of that variable or array for each thread of the DOALL loop.
x = 3 C$PAR DOALL SHARED(x),READONLY(x) do i = 1, n b(i) = x + 1 end do |
In the preceding example, x is a shared variable but the compiler can rely on the fact that it will not change over each iteration of the i loop because of its READONLY specification.
STOREBACK(varlist)
A STOREBACK variable or array is one whose value is computed in a DOALL loop. The computed value can be used after the termination of the loop. In other words, the last loop iteration values of storeback scalars and arrays may be visible outside of the DOALL loop.
C$PAR DOALL PRIVATE(x), STOREBACK(x,i) do i = 1, n x = ... end do ... = i ... = x |
In the preceding example, both the variables x and i are STOREBACK variables, even though both variables are private to the i loop.
C$PAR DOALL PRIVATE(x), STOREBACK(x) do i = 1, n if (...) then x = ... end if end do print *,x |
In the preceding example, the value of the STOREBACK variable x that is printed out may not be the same as that printed out by a serial version of the i loop. In the explicitly parallelized case, the processor that processes the last iteration of the i loop (when i = n) and performs the STOREBACK operation for x, may not be the same processor that currently contains the last updated value of x. The compiler issues a warning message about these potential problems.
SAVELAST
The SAVELAST qualifier specifies that all private scalars and arrays are STOREBACK for the DOALL loop. A STOREBACK variable or array is one whose value is computed in a DOALL loop; this computed value can be used after the termination of the loop. In other words, the last loop iteration values of STOREBACK scalars and arrays may be visible outside of the DOALL loop.
C$PAR DOALL PRIVATE(x,y), SAVELAST do i = 1, n x = ... y = ... end do ... = i ... = x ... = y |
In the preceding example, variables x, y, and i are STOREBACK variables.
REDUCTION(varlist)
The REDUCTION(varlist) qualifier specifies that all variables in the list varlist are reduction variables for the DOALL loop. A reduction variable is one whose partial values can be individually computed on various processors, and whose final value can be computed from all its partial values.
C$PAR DOALL REDUCTION(x) do i = 1, n x = x + a(i) end do |
In the preceeding example, the variable x is a (sum) reduction variable; the i loop is a (sum) reduction loop.
SCHEDTYPE(t)
The SCHEDTYPE(t) qualifier specifies that the specific scheduling type t be used to schedule the DOALL loop.
Multiple Qualifiers
Qualifiers can appear multiple times with cumulative effect. In the case of conflicting qualifiers, the compiler issues a warning message, and the qualifier appearing last prevails.
C$PAR DOALL MAXCPUS(4) READONLY(S) PRIVATE(A,B,X) MAXCPUS(2) C$PAR DOALL SHARED(B,X,Y) PRIVATE(Y,Z) C$PAR DOALL READONLY(T) |
Example: A one-line equivalent of the preceding three lines:
C$PAR DOALL MAXCPUS(2), PRIVATE(A,Y,Z), SHARED(B,X), READONLY(S,T) |
DOSERIAL Directive
The DOSERIAL directive tells f77 not to parallelize the specified loop. This directive applies to the one loop immediately following it (if you compile it with -explicitpar or -parallel).
do i = 1, n C$PAR DOSERIAL do j = 1, n do k = 1, n ... end do end do end do |
In the preceding example, the j loop is not parallelized, but the i or k loop can be.
DOSERIAL* Directive
The DOSERIAL* directive tells f77 not to parallelize the specified nest of loops. This directive applies to the whole nest of loops immediately following it (if you compile with -explicitpar or -parallel).
do i = 1, n C$PAR DOSERIAL* do j = 1, n do k = 1, n ... end do end do end do |
In the preceeding loops, the j and k loops are not parallelized; the i loop may be.
Interaction Between DOSERIAL* and DOALL
If both DOSERIAL and DOALL are specified, the last one prevails.
C$PAR DOSERIAL* do i = 1, 1000 C$PAR DOALL do j = 1, 1000 ... end do end do |
In the preceeding example, the i loop is not parallelized, but the j loop is.
program caller common /block/ a(10,10) C$PAR DOSERIAL* do i = 1, 10 call callee(i) end do end subroutine callee(k) common /block/a(10,10) do j = 1, 10 a(j,k) = j + k end do return end |
In the preceeding example, DOSERIAL* applies only to the i loop and not to the j loop, regardless of whether the call to the subroutine callee is inlined.
Inhibitors to Explicit Parallelization
In general, the compiler parallelizes a loop if you explicitly direct it to. There are exceptions--some loops the compiler just cannot parallelize.
-vpara.
... C$PAR DOALL do 900 i = 1, 1000 ! do 200 j = 1, 1000 ! ... 200 continue 900 continue ... demo% f77 -explicitpar -vpara t6.f |
Example: A parallelized loop in subroutine:
In the preceeding example, the loop within the subroutine is not parallelized because the subroutine itself is run in parallel.
C$PAR DOALL do i = 1, 1000 ! ... if (a(i) .gt. min_threshold ) go to 20 ... end do 20 continue ... demo% f77 -explicitpar -vpara t9.f |
Example: Index variable subject to side effects:
equivalence ( a(1), y ) ! ... C$PAR DOALL do i = 1, 2000 ! y = i a(i) = y end do ... demo% f77 -explicitpar -vpara t11.f |
Example: Variable in loop has loop-carried dependency:
C$PAR DOALL do 100 i = 1, 200 ! y = y * i ! a(i) = y 100 continue ... demo% f77 -explicitpar -vpara t12.f |
I/O With Explicit Parallelization
You can do I/O in a loop that executes in parallel, provided that:
In the preceeding example, the program may deadlock in libF77_mt and hang. Type Control-C to regain keyboard control.
An interface is colloquially called MT hot if the implementation has been tuned for performance advantage, using the techniques of multithreading. For some formal definitions of multithreading technology, read The Solaris Multithreaded Programming Guide. See also the Threads page by searching for "threads" at: http://www.sun.com/search/search.html
Mixing program units compiled with both Sun and Cray directives can produce different results.
A major difference between Sun and Cray directives is that Cray style requires explicit scoping of every scalar and array in the loop as either
SHARED or PRIVATE.The following table shows Cray style directive syntax.
|
Parallel Directive Syntax (Cray Style) |
!MIC$ DOALL !MIC$& SHARED( v1, v2, ... ) !MIC$& PRIVATE( u1, u2, ... ) ...optional qualifiers |
Cray Directive Syntax
A parallel directive consists of one or more directive lines. A directive line is defined as follows:
f90 -free free-format, leading blanks may appear before !MIC$.
For Cray-style directives, the DOALL directive allows a single scheduling qualifier, for example, !MIC$& CHUNKSIZE(100). Table 10-10 shows the Cray-style DOALL directive
scheduling qualifiers:
The f77 default scheduling type is the Sun-style STATIC. The f90 default is
GUIDED. Inhibitors to f90 Explicit Parallelization
In addition to the explicit parallelization problems listed on page 162, the parallelization inhibitors for f90 include:
DO increment parameter, if specified, is a variable.
-reduction and -depend. Some alternative ways to debug parallelized code are suggested in the following section.
-C.
Note - If you need -explicitpar only (without -autopar), do not compile with -explicitpar and -depend. This method is the same as compiling with -parallel, which, of course, includes -autopar.
|
Replace: DO I=1,N ... CALL SNUBBER(I) ... ENDDO With: DO I1=1,N I=I1 ... CALL SNUBBER(I) ... ENDDO |