Performance and Optimization |
9 |
![]() |
Some of the issues considered here are:
Optimization and performance tuning is an art that depends heavily on being able to determine what to optimize or tune.
Performance options are normally off by default because most optimizations force the compiler to make assumptions about a user's source code. Programs that conform to standard coding practices and do not introduce hidden side effects should optimize correctly. However, programs that take liberties with standard practices may run afoul of some of the compiler's assumptions. The resulting code may run faster, but the computational results may not be correct.
Recommended practice is to first compile with all options off, verify that the computational results are correct and accurate, and use these results and performance profile as a baseline. Then, proceed in steps-- recompiling with additional options and comparing execution results and performance against the baseline. If numerical results change, the program may have questionable code, which needs careful analysis to locate and reprogram.
If performance does not improve significantly, or degrades, as a result of adding optimization options, the coding may not provide the compiler with opportunities for further performance improvements. The next step would then be to analyze and restructure the program at the source code level to achieve better performance.
Some of these options will increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.
This single option selects a number of performance options that, working together, produce object code optimized for execution speed without an excessive increase in compilation time.-fast
-fast are subject to change from one release to another, and not all are available on each platform:
-xtarget=native - generates code optimized for the host architecture
-O4 - sets optimization level
-libmil - inlines calls to some simple library functions
-fsimple=1 - simplifies floating-point code (SPARC only)
-dalign - uses faster, double word loads and stores (SPARC only)
-xlibmopt - use optimized libm math library (SPARC, PowerPC only)
-fns -ftrap=%none - turns off all trapping
-depend - analyze loops for data dependencies (SPARC only)
-nofstore - disables forcing precision on expressions (Intel only)
-fast provides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User's Guide). Following -fast with additional options adds further optimizations. For example:
f77 -fast -O5 ...
Note --fastincludes-dalignand-native. These options may have unexpected side-effects for some programs.
-On
-O option is specified explicitly (or implicitly with macro options like -fast). In nearly all cases, specifying an optimization level for compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size. For most cases, level
-O3 is a good balance between performance gain, code size, and compilation time. Level -O4 adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. Level -O5 adds more aggressive optimization techniques that would not be applied at lower leves. In general, levels above -O3 should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)
O3 and above much more efficiently if combined with -xprofile=use. With this option (available only on SPARC processors), the optimizer is directed by a runtime execution profile produced by the program (compiled with -xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with -O5. Here's a typical example of profile collection with higher optimization levels:
demo% f77 -o prg -fast -xprofile=collect prg.f ... demo% prg demo% f77 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ... demo% prgx |
The first compilation above generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.
-xprofile options.)
With -dalign
-dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by -fast.) The double-word instructions are almost twice as fast as the equivalent single word operations. -dalign (and therefore -fast) may cause problems with some programs that have been coded expecting a specific alignment of data in COMMON blocks. With -dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or COMPLEX) are aligned on double word boundaries, with the result that:
COMMON blocks may be larger than expected due to added padding
COMMON must be compiled with -dalign if any one of them is compiled with -dalign
COMMON block of mixed data types as a single array may not work properly with -dalign because the block will be larger (due to padding of double and quad precision variables) than the program expects.
-depend (f77 only)
-depend to optimization levels -O3 and higher (on SPARC processors) extends the compiler's ability to optimize DO loops and loop nests. With this option, the optimizer analyzes inter-iteration loop dependencies to determine whether or not certain transformations of the loop structure can be performed. Only loops without dependencies can be restructured. However, the added analysis may increase compilation time.
-fsimple=2 (f77 only)
-fsimple=0). With the -fast option, -fsimple=1 is used and some conservative assumptions are made. Adding -fsimple=2 enables the optimizer to make further simplifications with the understanding that this may cause some programs to produce slightly different results due to rounding effects. If -fsimple level 1 or 2 is used, all program units should be similarly compiled to insure consistent numerical accuracy,
-unroll=n
DO loop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines whether or not executing the unrolled loop is inappropriate. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation may be needed. The example that follows shows how a simple loop might be unrolled to a depth of four with
-unroll=4 (the source code is not changed with this option):
This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.
The performance of some programs may benefit if the compiler has an accurate description of the target computer hardware. When program performance is critical, the proper specification of the target hardware could be very important. This is especially true when running on the newer SPARC processors. However, for most programs and older SPARC processors, the performance gain may be negligible and a generic specification may be sufficient.-xtarget=system
-xtarget=. For any given system name (for example, ss1000, for SPARC Server 1000), -xtarget expands into a specific combination of -xarch, -xcache, and -xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.-xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture, therefore -xtarget=generic is the default, although this may produce suboptimal performance.
Other Performance Strategies
Assuming that you have experimented with using a variety of optimization options, compiling your program and measuring actual runtime performance, the next step might be to look closely at the Fortran source program to see what further tuning can be tried.
IF
For example, the Sun Performance LibraryTM is a suite of highly optimized mathematical subroutines based on the standard LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK libraries. Performance improvement using these routines can be significant when compared with hand coding.
Reprogramming techniques that improve performance are dealt with in more detail in some of the reference books listed at the end of the chapter. Three major approaches are worth mentioning here:
Automatic inlining of subprogram calls (using
-inline=x,y,..z, or -O4) is one way to let the compiler replace the actual call with the subprogram itself (pulling the subprogram into the loop). The subprogram source code for the routines that are to be inlined be must be found in the same file as the calling routine.There are other ways to eliminate subprogram calls:
IF's, replacing them with block IF's:
Using block
IF not only improves the opportunities for the compiler to generate optimal code, it also improves readability and assures portability.
Further Reading
The following reference books provide more details: