Performance and Optimization
Some of the issues considered here are:
Optimization and performance tuning is an art that depends heavily on being able to determine what to optimize or tune.
Performance options are normally off by default because most optimizations force the compiler to make assumptions about a user's source code. Programs that conform to standard coding practices and do not introduce hidden side effects should optimize correctly. However, programs that take liberties with standard practices may run afoul of some of the compiler's assumptions. The resulting code may run faster, but the computational results may not be correct.
Recommended practice is to first compile with all options off, verify that the computational results are correct and accurate, and use these results and performance profile as a baseline. Then, proceed in steps-- recompiling with additional options and comparing execution results and performance against the baseline. If numerical results change, the program may have questionable code, which needs careful analysis to locate and reprogram.
If performance does not improve significantly, or degrades, as a result of adding optimization options, the coding may not provide the compiler with opportunities for further performance improvements. The next step would then be to analyze and restructure the program at the source code level to achieve better performance.
Use various optimization options together
Set compiler optimization level to n
Specify target hardware
Optimize using performance profile data (with
Unroll loops by n
Permit simplifications and optimization of floating-point
Perform dependency analysis to optimize loops
Some of these options will increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.
-fast are subject to change from one release to another, and not all are available on each platform:
-xtarget=native- generates code optimized for the host architecture
-O4- sets optimization level
-libmil- inlines calls to some simple library functions
-fsimple=1- simplifies floating-point code (SPARC only)
-dalign- uses faster, double word loads and stores (SPARC only)
-xlibmopt- use optimized libm math library (SPARC, PowerPC only)
-fns -ftrap=%none- turns off all trapping
-depend- analyze loops for data dependencies (SPARC only)
-nofstore- disables forcing precision on expressions (Intel only)
-fastprovides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User's Guide). Following
-fastwith additional options adds further optimizations. For example:
f77 -fast -O5...
-native. These options may have unexpected side-effects for some programs.
-Ooption is specified explicitly (or implicitly with macro options like
-fast). In nearly all cases, specifying an optimization level for compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size.
For most cases, level
-O3is a good balance between performance gain, code size, and compilation time. Level
-O4adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. Level
-O5adds more aggressive optimization techniques that would not be applied at lower leves. In general, levels above
-O3should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)
O3and above much more efficiently if combined with
-xprofile=use. With this option (available only on SPARC processors), the optimizer is directed by a runtime execution profile produced by the program (compiled with
-xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with
-O5. Here's a typical example of profile collection with higher optimization levels:
demo% f77 -o prg -fast -xprofile=collect prg.f ...
demo% f77 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...
The first compilation above generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.
-dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by
-fast.) The double-word instructions are almost twice as fast as the equivalent single word operations.
-dalign (and therefore
-fast) may cause problems with some programs that have been coded expecting a specific alignment of data in
COMMON blocks. With -
dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or
COMPLEX) are aligned on double word boundaries, with the result that:
COMMONblock of mixed data types as a single array may not work properly with
-dalignbecause the block will be larger (due to padding of double and quad precision variables) than the program expects.
-dependto optimization levels
-O3and higher (on SPARC processors) extends the compiler's ability to optimize
DOloops and loop nests. With this option, the optimizer analyzes inter-iteration loop dependencies to determine whether or not certain transformations of the loop structure can be performed. Only loops without dependencies can be restructured. However, the added analysis may increase compilation time.
-fsimple=0). With the
-fsimple=1is used and some conservative assumptions are made. Adding
-fsimple=2enables the optimizer to make further simplifications with the understanding that this may cause some programs to produce slightly different results due to rounding effects. If
-fsimplelevel 1 or 2 is used, all program units should be similarly compiled to insure consistent numerical accuracy,
DOloop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines whether or not executing the unrolled loop is inappropriate. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation may be needed.
The example that follows shows how a simple loop might be unrolled to a depth of four with
-unroll=4(the source code is not changed with this option):
X(I) = X(I) + Y(I)*A(I)
Unrolled by 4 compiles as:
DO I=1, 19997,4
TEMP1 = X(I) + Y(I)*A(I)
TEMP2 = X(I+1) + Y(I+1)*A(I+1)
TEMP3 = X(I+2) + Y(I+2)*A(I+2)
X(I+3) = X(I+3) + Y(I+3)*A(I+3)
X(I) = TEMP1
X(I+1) = TEMP2
X(I+2) = TEMP3
This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.
-xtarget=. For any given system name (for example,
ss1000, for SPARC Server 1000), -
xtarget expands into a specific combination of
-xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.
-xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture, therefore
-xtarget=generic is the default, although this may produce suboptimal performance.
Other Performance Strategies
Assuming that you have experimented with using a variety of optimization options, compiling your program and measuring actual runtime performance, the next step might be to look closely at the Fortran source program to see what further tuning can be tried.
For example, the Sun Performance LibraryTM is a suite of highly optimized mathematical subroutines based on the standard LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK libraries. Performance improvement using these routines can be significant when compared with hand coding.
Reprogramming techniques that improve performance are dealt with in more detail in some of the reference books listed at the end of the chapter. Three major approaches are worth mentioning here:
Automatic inlining of subprogram calls (using
-O4) is one way to let the compiler replace the actual call with the subprogram itself (pulling the subprogram into the loop). The subprogram source code for the routines that are to be inlined be must be found in the same file as the calling routine.
There are other ways to eliminate subprogram calls:
IF's, replacing them with block
10 XA(I) = XB(I)*B(I,I)
XY(I) = XA(I) - A(I)
11 XA(I) = Z(I)
XY(I) = Z(I)
IF(QZDATA.LT.0.) GOTO 12
ICNT = ICNT + 1
ROX(ICNT) = XA(I)-DELTA/2.
12 SUM = SUM + X(I)
13 SUM = SUM + XA(I)
XA(I) = XB(I)*B(I,I)
XY(I) = XA(I) - A(I)
XA(I) = Z(I)
XY(I) = Z(I)
ICNT = ICNT + 1
ROX(ICNT) = XA(I)-DELTA/2.
SUM = SUM + X(I)
SUM = SUM + XA(I)
IF not only improves the opportunities for the compiler to generate optimal code, it also improves readability and assures portability.
The following reference books provide more details: