5.3.6. Performance-Profiling and Optimization

The OpenFAST team has been engaged in performance-profiling and optimization work in an effort to improve the time-to-solution performance for the most computationally expensive use cases. This work is supported by Intel® through its designation of NREL as an Intel® Parallel Computing Center (IPCC).

After initial profiling and hotspot analysis, specific subroutines in the physics modules of OpenFAST were targeted for optimization. Among other takeaways, it was learned that the memory alignment of the derived data types could yield a significant increase in performance. Ultimately, tuning the Intel® tools to perform best on NREL’s hardware and adding high level multithreading yielded a maximum 3.8x time-to-solution improvement for one of the benchmark cases.

5.3.6.1. Approach

The general mechanisms identified for performance improvements in OpenFAST are:

  • Intel® compiler suite and Intel® Math Kernel Library (Intel® MKL)

  • Algorithmic improvements

  • Memory-access optimization enabling more efficient cache usage

  • Data type alignment allowing for SIMD vectorization

  • Multithreading with OpenMP

To establish a path forward with any of these options, OpenFAST was first profiled with Intel® VTune™ Amplifier which provides a clear breakdown of time spent in the simulation. Then, the optimization report generated from the Intel® Fortran compiler was analyzed to determine area which were not autovectorized. Finally, Intel® Advisor was used to highlight areas of the code which the compiler identified as potentially improved with multithreading.

5.3.6.2. Test cases

Two OpenFAST test cases have been chosen to provide meaningful and realistic timing benchmarks. In addition to real-world turbine and atmospheric models, these cases are computationally expensive and expose the areas where performance improvements would make a difference.

5.3.6.2.1. 5MW_Land_BD_DLL_WTurb

Download files here.

The physics modules used in this case are:

  • BeamDyn

  • InflowWind

  • AeroDyn 15

  • ServoDyn

This is a land based NREL 5-MW turbine simulation using BeamDyn as the structural module. It simulates 20 seconds with a time step size of 0.001 seconds and executes in 3m 55s on NREL’s Peregrine supercomputer.

5.3.6.2.2. 5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

Download files here.

This is an offshore, fixed-bottom NREL 5-MW turbine simulation with the majority of the computational expense occurring in the HydroDyn wave-dynamics calculation.

The physics modules used in this case are:

  • ElastoDyn

  • InflowWind

  • AeroDyn 15

  • ServoDyn

  • HydroDyn

  • SubDyn

It simulates 60 seconds with a time step size of 0.01 seconds and executes in 20m 27s on NREL’s Peregrine supercomputer.

5.3.6.3. Profiling

The OpenFAST test cases were profiled with Intel® VTune™ Amplifier to identify performance hotspots. Being that the two test cases exercise difference portions of the OpenFAST software, different hotspots were identified. In all cases and environment settings, the majority of the CPU time was spent in fast_solution loop which is a high-level subroutine that coordinates the solution calculation from each physics module.

5.3.6.3.1. LAPACK

In the offshore case, the LAPACK usage was identified as a performance load. Within the fast_solution loop, the calls to the LAPACK function dgetrs consume 3.3% of the total CPU time.

../../_images/offshore_lapack.png

5.3.6.3.2. BeamDyn

While BeamDyn provides a high-fidelity blade-response calculation, it is a computationally expensive module. Initial profiling highlighted the bd_elementmatrixga2 subroutine, in particular, as a hotspot. However, initial attempts to improve performance in BeamDyn highlighted needs for algorithmic improvements and refinements to the module’s data structures.

5.3.6.4. Results

Though work is ongoing, OpenFAST time-to-solution performance has improved and the performance potential is better understood.

Some keys outcomes from the first year of the IPCC project are as follows:

  • Use of Intel® compiler and MKL library provides dramatic speedup over GCC and LAPACK

    • Additional significant gains are possible through MKL threading for offshore simulations

  • Offshore-wind-turbine simulations are poorly load balanced across modules

    • Land-based-turbine configuration better balanced

    • OpenMP Tasks are employed to achieve better load-balancing

  • OpenMP module-level parallelism provides significant, but limited speed up due to imbalance across different module tasks

  • Core algorithms need significant modification to enable OpenMP and SIMD benefits

5.3.6.4.1. Speedup - Intel® Compiler and MKL

By employing the standard Intel® developer tools tech stack, a performance improvement over GNU tools was demonstrated:

Compiler

Math Library

5MW_Land_BD_DLL_WTurb

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

GNU

LAPACK

2265 s (1.0x)

673 s (1.0x)

Intel® 17

LAPACK

1650 s (1.4x)

251 s (2.7x)

Intel® 17

MKL

1235 s (1.8x)

Intel® 17

MKL Multithreaded

722 s (3.1x)

5.3.6.4.2. Speedup - OpenMP at FAST_Solver

A performance improvement was domenstrated by adding OpenMP directives to the FAST_Solver module. Although the solution scheme is not well balanced, parallelizing mesh mapping and calculation routines resulted in the following speedup:

Compiler

Math Library

5MW_Land_BD_DLL_WTurb

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

Intel® 17

MKL - 1 thread

1073 s (2.1x)

100 s (6.7x)

Intel® 17

MKL - 8 threads

597 s (3.8x)

5.3.6.5. Ongoing Work

The next phase of the OpenFAST performance improvements are focused in two key areas:

  1. Implementing the outcomes from previous work throughout OpenFAST modules and glue codes

  2. Preparing OpenFAST for efficient execution on Intel®’s next generation platforms