Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

https://doi.org/10.1016/j.jpdc.2009.01.006Get rights and content

Abstract

We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. We discuss and compare two implementations of the code: one that has maximum efficiency but is limited to the memory size of the card, and one that can handle larger problems but that is less efficient. We use a coloring scheme to handle efficiently summation operations over nodes on a topology with variable valence. We perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25.

Introduction

Over the past several years, non-graphical applications ported to the GPUs have steadily grown more numerous [25]. Many applications have been adapted to the GPU using one of several specialty languages. These range from graphical languages such as Cg, HLSL, GLSL, to languages that abstract away the graphics component to ease the implementation of non-graphical code, such as Brook, Scout [20], Sh, Rapidmind, and CUDA. CUDA offers significant innovation and gives users a significant amount of control over a powerful streaming/SIMD computer. Since CUDA became available in 2007, the number of General-purpose Processing on Graphical Processing Units (GPGPU) applications has drastically increased. CUDA is being used to study physical problems as diverse as molecular dynamics [2], fluid dynamics [4], or astrophysical simulations [24].

Our research group has developed a software package, called SPECFEM3D_GLOBE, for the three-dimensional numerical simulation of seismic wave propagation resulting from earthquakes in the Earth based on the so-called spectral-element method (SEM) [16], [19], [17]. The SEM is similar to a finite-element method with high-degree polynomial basis functions. The mesh of elements required to cover the Earth is usually so large that calculations are very expensive in terms of CPU time and require a large amount of memory, at least a few terabytes. We therefore usually turn to parallel programming with MPI [17], [15], OpenMP, or both. Typical runs require a few hundred processors and a few hours of elapsed wall-clock time. Current large runs require a few thousand processors, typically 2000–4000, and two to five days of elapsed wall-clock time [17], [15]. It is therefore of interest to try to speed up the calculations on each node of such a parallel machine by turning to GPGPU. We will see in the next section that a significant amount of work has been devoted in the community to porting low-order finite-element codes to GPU, but to our knowledge we are the first to address the issue of porting a more complex higher-order spectral-like finite-element technique.

Rather than attempt to solve the fully resolved problem using both MPI and the GPU simultaneously, which we will consider in a future study, we choose to investigate the potential for speedup on a single CPU, with acceleration provided solely by the GPU. Such an implementation can be very useful in the absence of a GPU cluster for quick parameter studies, parametric explorations, and code development. In the future the experience gained by this exercise will provide the proper intuition to accelerate an MPI version of the code using the graphics processors. Our seismic wave propagation application is written in Fortran95 + MPI, but we wrote a serial version in C for the tests to facilitate interfacing with CUDA, which is currently easier from C.

A major issue in all finite-element codes, either low or high order, is that dependencies arise from the summation of elemental contributions at those global points shared among multiple mesh elements, requiring atomic operations. We will show that use of a mesh coloring technique allows us to overcome this difficulty and obtain a speedup of 25x, consistent with speedups obtained by many other researchers for other realistic applications in different fields.

The remainder of the paper is organized as follows: We begin with a description of similar problems ported to the GPU in Section 2. The serial algorithm is discussed in Section 3, followed by the CUDA implementation of the spectral-element algorithm in Section 4. Within this section, we discuss an implementation that runs fully on the GPU, including the coloring scheme that is needed to avoid atomic operations. We also describe a version of our algorithm capable of solving larger problems that cannot fit fully on the GPU. Optimizations we considered to improve efficiency are treated in Section 5. We provide some numerical code validation in Section 6, and analyze some performance results in Section 7. We conclude in Section 8.

Section snippets

Related work

For a wide range of applications, depending on the algorithms used and the ease of parallelization, applications ported to the GPU have acquired speedups that range from 3x to 50x. Of course, whether a specific speedup is considered impressive or not depends strongly on how well the original code was optimized. Algorithms such as finite-difference and finite-volume methods are amenable to large accelerations due to their sparse structure. On the other hand, dense methods such as spectral

Serial algorithm

The SEM is a high-order finite-element algorithm in which a domain is subdivided into cells within which variables are approximated with high-order interpolation. We have adopted the SEM to simulate numerically the propagation of seismic waves resulting from earthquakes in the Earth [16]. It solves the variational form of the elastic wave equation in the time domain on a non-structured mesh of hexahedral elements called spectral elements to compute the displacement vector of any point of the

CUDA implementation

In this section, we present two implementations of our code using CUDA. The first runs fully on the GPU, and is therefore expected to generate the fastest speedup, but the problem size is limited by the on-board GPU memory. The host calls a sequence of kernels every time iteration (Fig. 3). This approach was also adopted in [27]. All local and global arrays are stored on the GPU. Pre-processing, i.e., mesh creation, and post-processing are the responsibility of the CPU. Since these are cheap

Optimizations common to the two CUDA implementations

In this section, we discuss some of the optimizations incorporated into the kernels. As a rule-of-thumb, the fastest kernels minimize access to device memory, avoid non-coalesced accesses to global memory, avoid bank conflicts when reading from or writing to shared memory, and try to minimize register and/or shared memory usage to maximize occupancy. At the same time, one strives to work with many blocks running per multiprocessor to overlap the latencies of memory transfers with useful

Numerical validation of the two CUDA versions

In Fig. 7 we validate the two CUDA implementations of our spectral-element code by comparing the time evolution of the vertical component of the displacement vector recorded at one grid point, called a ‘seismogram’ in the field of seismic wave propagation, produced by waves generated by an earthquake source located at another grid point and propagating across the mesh. We take the mesh of Fig. 1 (left) and put the earthquake source in element 3711 at global grid point 256,406 and record the

Performance analysis and speedup obtained

Our experimental setup is composed of an NVIDIA GeForce 8800 GTX card installed on the PCI Express 1 bus of a dual-processor dual-core 64 bit Intel Xeon E5345 2.33 GHz PC with 8 GB of RAM and running Linux kernel 2.6.23; and of an NVIDIA GeForce GTX 280 card installed on the PCI Express 1 bus of a dual-processor dual-core 64 bit AMD Opteron PC with 8 GB of RAM and running Linux kernel 2.6.20. The 8800 GTX card has 16 multiprocessors, i.e., 128 cores, and 768 MB of memory, and the memory

Lessons learned, conclusions and future work

We have ported a high-order spectral-element application, which performs the numerical simulation of seismic wave propagation resulting from earthquakes, on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. Since this application runs in single precision, current GPU hardware that supports quasi-IEEE-754 single precision arithmetic is suitable and sufficient.

We discussed two possible implementations of the code: the first is limited to the memory size of the card, and the second

Acknowledgments

The authors would like to thank Christophe Merlet, Jean Roman, Cyril Zeller and Bruno Jobard for fruitful discussions. The Web page of Dominik Göddeke on GPGPU and the http://www.gpgpu.org Web site were very useful. The comments of three anonymous reviewers improved the manuscript. This research was funded in part by French ANR grant NUMASIS ANR-05-CIGC-002 and by US National Science Foundation grant NSF-0345656. The authors thank Hélène Barucq and INRIA for funding a one-month visit of Gordon

Dimitri Komatitsch is a Professor of Computational Geophysics at University of Pau, France. He was born in 1970 and did his Ph.D. at Institut de Physique du Globe de Paris, France, in 1997.

References (30)

  • D. Dobb’s, Dr. Dobb’s Portal web site (March 2008). URL...
  • R. Dolbeau, S. Bihan, F. Bodin, HMPP: A hybrid multi-core parallel programming environment, in: Proceedings of the...
  • D. Göddeke et al.

    Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations

    Internat. J. Parallel Emerg. Distrib. Syst.

    (2007)
  • G. Jost, H. Jin, J. Labarta, J. Giménez, J. Caubet, Performance analysis of multilevel parallel applications on shared...
  • T. Kim, Hardware-aware analysis and optimization of ‘Stable Fluids’, in: Proceedings of the ACM Symposium on...
  • Cited by (148)

    • Mixed precision support in HPC applications: What about reliability?

      2023, Journal of Parallel and Distributed Computing
    • Modeling of Resistivity and Acoustic Borehole Logging Measurements Using Finite Element Methods

      2021, Modeling of Resistivity and Acoustic Borehole Logging Measurements Using Finite Element Methods
    • A GPU-based numerical manifold method for modeling the formation of the excavation damaged zone in deep rock tunnels

      2020, Computers and Geotechnics
      Citation Excerpt :

      Zhang and Shen [48] accelerated the FEM for elasticity problems on NVIDIA GPUs using CUDA and achieved a speedup ratio of approximately 10–20. Komatitsch et al. [49] adopted CUDA for a high–order FEM to simulate the seismic wave propagation resulting from earthquakes and can achieve a best speedup ratio of 25. By adopting CUDA, Wang et al. [50] developed a parallel boundary element method (BEM) to investigate the 3D elastoplastic problems and a maximum speedup ratio of approximately 40 was achieved.

    • HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark

      2023, International Journal of High Performance Computing Applications
    View all citing articles on Scopus

    Dimitri Komatitsch is a Professor of Computational Geophysics at University of Pau, France. He was born in 1970 and did his Ph.D. at Institut de Physique du Globe de Paris, France, in 1997.

    David Michéa is a researcher at INRIA, University of Pau and CNRS, France. He was born in 1973 and did his Master’s thesis at University of Strasbourg, France, in 2006.

    Gordon Erlebacher is a Professor of Computer Science at Florida State University, Tallahassee, USA. He was born in 1957 and did his Ph.D. at Columbia University in New York, USA in 1983.

    View full text