Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

doi:10.1016/j.jpdc.2009.01.006

Journal of Parallel and Distributed Computing

Volume 69, Issue 5, May 2009, Pages 451-460

https://doi.org/10.1016/j.jpdc.2009.01.006 Get rights and content

Abstract

We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. We discuss and compare two implementations of the code: one that has maximum efficiency but is limited to the memory size of the card, and one that can handle larger problems but that is less efficient. We use a coloring scheme to handle efficiently summation operations over nodes on a topology with variable valence. We perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25.

Introduction

Over the past several years, non-graphical applications ported to the GPUs have steadily grown more numerous [25]. Many applications have been adapted to the GPU using one of several specialty languages. These range from graphical languages such as Cg, HLSL, GLSL, to languages that abstract away the graphics component to ease the implementation of non-graphical code, such as Brook, Scout [20], Sh, Rapidmind, and CUDA. CUDA offers significant innovation and gives users a significant amount of control over a powerful streaming/SIMD computer. Since CUDA became available in 2007, the number of General-purpose Processing on Graphical Processing Units (GPGPU) applications has drastically increased. CUDA is being used to study physical problems as diverse as molecular dynamics [2], fluid dynamics [4], or astrophysical simulations [24].

Our research group has developed a software package, called SPECFEM3D_GLOBE, for the three-dimensional numerical simulation of seismic wave propagation resulting from earthquakes in the Earth based on the so-called spectral-element method (SEM) [16], [19], [17]. The SEM is similar to a finite-element method with high-degree polynomial basis functions. The mesh of elements required to cover the Earth is usually so large that calculations are very expensive in terms of CPU time and require a large amount of memory, at least a few terabytes. We therefore usually turn to parallel programming with MPI [17], [15], OpenMP, or both. Typical runs require a few hundred processors and a few hours of elapsed wall-clock time. Current large runs require a few thousand processors, typically 2000–4000, and two to five days of elapsed wall-clock time [17], [15]. It is therefore of interest to try to speed up the calculations on each node of such a parallel machine by turning to GPGPU. We will see in the next section that a significant amount of work has been devoted in the community to porting low-order finite-element codes to GPU, but to our knowledge we are the first to address the issue of porting a more complex higher-order spectral-like finite-element technique.

Rather than attempt to solve the fully resolved problem using both MPI and the GPU simultaneously, which we will consider in a future study, we choose to investigate the potential for speedup on a single CPU, with acceleration provided solely by the GPU. Such an implementation can be very useful in the absence of a GPU cluster for quick parameter studies, parametric explorations, and code development. In the future the experience gained by this exercise will provide the proper intuition to accelerate an MPI version of the code using the graphics processors. Our seismic wave propagation application is written in Fortran95 + MPI, but we wrote a serial version in C for the tests to facilitate interfacing with CUDA, which is currently easier from C.

A major issue in all finite-element codes, either low or high order, is that dependencies arise from the summation of elemental contributions at those global points shared among multiple mesh elements, requiring atomic operations. We will show that use of a mesh coloring technique allows us to overcome this difficulty and obtain a speedup of 25x, consistent with speedups obtained by many other researchers for other realistic applications in different fields.

The remainder of the paper is organized as follows: We begin with a description of similar problems ported to the GPU in Section 2. The serial algorithm is discussed in Section 3, followed by the CUDA implementation of the spectral-element algorithm in Section 4. Within this section, we discuss an implementation that runs fully on the GPU, including the coloring scheme that is needed to avoid atomic operations. We also describe a version of our algorithm capable of solving larger problems that cannot fit fully on the GPU. Optimizations we considered to improve efficiency are treated in Section 5. We provide some numerical code validation in Section 6, and analyze some performance results in Section 7. We conclude in Section 8.

Section snippets

Related work

For a wide range of applications, depending on the algorithms used and the ease of parallelization, applications ported to the GPU have acquired speedups that range from 3x to 50x. Of course, whether a specific speedup is considered impressive or not depends strongly on how well the original code was optimized. Algorithms such as finite-difference and finite-volume methods are amenable to large accelerations due to their sparse structure. On the other hand, dense methods such as spectral

Serial algorithm

The SEM is a high-order finite-element algorithm in which a domain is subdivided into cells within which variables are approximated with high-order interpolation. We have adopted the SEM to simulate numerically the propagation of seismic waves resulting from earthquakes in the Earth [16]. It solves the variational form of the elastic wave equation in the time domain on a non-structured mesh of hexahedral elements called spectral elements to compute the displacement vector of any point of the

CUDA implementation

In this section, we present two implementations of our code using CUDA. The first runs fully on the GPU, and is therefore expected to generate the fastest speedup, but the problem size is limited by the on-board GPU memory. The host calls a sequence of kernels every time iteration (Fig. 3). This approach was also adopted in [27]. All local and global arrays are stored on the GPU. Pre-processing, i.e., mesh creation, and post-processing are the responsibility of the CPU. Since these are cheap

Optimizations common to the two CUDA implementations

In this section, we discuss some of the optimizations incorporated into the kernels. As a rule-of-thumb, the fastest kernels minimize access to device memory, avoid non-coalesced accesses to global memory, avoid bank conflicts when reading from or writing to shared memory, and try to minimize register and/or shared memory usage to maximize occupancy. At the same time, one strives to work with many blocks running per multiprocessor to overlap the latencies of memory transfers with useful

Numerical validation of the two CUDA versions

In Fig. 7 we validate the two CUDA implementations of our spectral-element code by comparing the time evolution of the vertical component of the displacement vector recorded at one grid point, called a ‘seismogram’ in the field of seismic wave propagation, produced by waves generated by an earthquake source located at another grid point and propagating across the mesh. We take the mesh of Fig. 1 (left) and put the earthquake source in element 3711 at global grid point 256,406 and record the

Performance analysis and speedup obtained

Our experimental setup is composed of an NVIDIA GeForce 8800 GTX card installed on the PCI Express 1 bus of a dual-processor dual-core 64 bit Intel Xeon E5345 2.33 GHz PC with 8 GB of RAM and running Linux kernel 2.6.23; and of an NVIDIA GeForce GTX 280 card installed on the PCI Express 1 bus of a dual-processor dual-core 64 bit AMD Opteron PC with 8 GB of RAM and running Linux kernel 2.6.20. The 8800 GTX card has 16 multiprocessors, i.e., 128 cores, and 768 MB of memory, and the memory

Lessons learned, conclusions and future work

We have ported a high-order spectral-element application, which performs the numerical simulation of seismic wave propagation resulting from earthquakes, on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. Since this application runs in single precision, current GPU hardware that supports quasi-IEEE-754 single precision arithmetic is suitable and sufficient.

We discussed two possible implementations of the code: the first is limited to the memory size of the card, and the second

Acknowledgments

The authors would like to thank Christophe Merlet, Jean Roman, Cyril Zeller and Bruno Jobard for fruitful discussions. The Web page of Dominik Göddeke on GPGPU and the http://www.gpgpu.org Web site were very useful. The comments of three anonymous reviewers improved the manuscript. This research was funded in part by French ANR grant NUMASIS ANR-05-CIGC-002 and by US National Science Foundation grant NSF-0345656. The authors thank Hélène Barucq and INRIA for funding a one-month visit of Gordon

Dimitri Komatitsch is a Professor of Computational Geophysics at University of Pau, France. He was born in 1970 and did his Ph.D. at Institut de Physique du Globe de Paris, France, in 1997.

References (30)

J.A. Anderson et al.
General purpose molecular dynamics simulations fully implemented on graphics processing units
J. Comput. Phys.
(2008)
C. Farhat et al.
A general approach to nonlinear finite-element computations on shared-memory multiprocessors
Comput. Methods Appl. Mech. Engrg.
(1989)
D. Göddeke et al.
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster
Parallel Comput.
(2007)
N.K. Govindaraju et al.
Cache-efficient numerical algorithms using graphics hardware
Parallel Comput.
(2007)
T.J.R. Hughes et al.
Large-scale vectorized implicit calculations in solid mechanics on a Cray X-MP/48 utilizing EBE preconditioned conjugate gradients
Comput. Methods Appl. Mech. Engrg.
(1987)
P. McCormick et al.
Scout: A data-parallel programming language for graphics processors
Parallel Comput.
(2007)
R. Abdelkhalek, Évaluation des accélérateurs de calcul GPGPU pour la modélisation sismique, Master’s Thesis, ENSEIRB,...
P. Berger et al.
A mesh coloring method for efficient MIMD processing in finite element problems
T. Brandvik et al.
Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware
Proceedings of the Institute of Mechanical Engineers
Part C: J. Mech. Eng., Part C: J. Mech. Eng. Sci.
(2007)
I. Buck, GeForce 8800 and NVIDIA CUDA: A new architecture for computing on the GPU, in: Proceedings of the...

D. Dobb’s, Dr. Dobb’s Portal web site (March 2008). URL...

R. Dolbeau, S. Bihan, F. Bodin, HMPP: A hybrid multi-core parallel programming environment, in: Proceedings of the...

D. Göddeke et al.

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations

Internat. J. Parallel Emerg. Distrib. Syst.

(2007)

G. Jost, H. Jin, J. Labarta, J. Giménez, J. Caubet, Performance analysis of multilevel parallel applications on shared...

T. Kim, Hardware-aware analysis and optimization of ‘Stable Fluids’, in: Proceedings of the ACM Symposium on...

Cited by (148)

Mixed precision support in HPC applications: What about reliability?
2023, Journal of Parallel and Distributed Computing
In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision, or the careful interlocking of single- and double-precision arithmetic, as a tool to improve performance as well as reduce network and memory boundedness. However, while it is known that modern HPC systems suffer hardware faults at daily rates, the impact of reduced precision on application reliability is yet to be explored. In this work we aim to fill this gap: first, we propose a qualitative survey to identify the branches of HPC where mixed precision is most popular. Second, we show the results of instruction-level fault injection experiments on a variety of representative HPC workloads, comparing vulnerability to Silent Data Errors (SDEs) under different numerical configurations. Our experiments indicate that use of single and mixed precision leads to comparatively more frequent and more severe SDEs, with concerning implications regarding their use on extreme-scale, fault-prone HPC platforms.
Accelerating the finite element method for calculating the full 2-body problem with CUDA
2022, Advances in Space Research
Binary asteroids in solar systems usually have irregularly shaped components; and their internal structures, though lacking direct measurements, are suspected to be uneven given the theoretical considerations of their formation history. This poses a huge challenge to precise dynamic modelling. The finite element method (FEM) is capable of modelling asteroids with irregular shapes and heterogeneous internal structures with good convergence. However, the implementation of the FEM in the full 2-body problem is usually time-consuming for single-core CPU processing. We note that the computing process is comprised of serialized independent calculations that can be parallelized with inexpensive and ubiquitous graphics processing unit (GPU) hardware. We propose a parallel algorithm to accelerate the FEM to calculate the full 2-body problem with CUDA (Compute Unified Device Architecture). Benchmarking tests are performed to verify this method, which is later applied to the study of the motion of the binary asteroid system (66391) 1999KW4. Simulations with more than 10 thousand nodes on a single graphics card exhibit a speedup of up to two orders of magnitude over using a single-core CPU. Both the achieved speedup(s) and the numerical simulation result accuracy suggest that CUDA-based GPU programming can be used in the FEM for computational full 2-body problem methods with efficiency and reliability. As an application, we studied the motion of the binary asteroid system (65803) Didymos, the target object of the DART mission. With the simulation results, we show the influences of the irregular shape/uneven internal structure of the primary on the postimpact motion of the secondary. The code is available at https://github.com/yyu86/femF2BP.git.
Four dimensional wave propagation simulation in liquid metal with realistic thermal fluctuation
2021, Ultrasonics
In this paper, we performed a four-dimensional numerical simulation to calculate wave propagation in a thermal fluctuating liquid metal in order to obtain detailed knowledge on a wave propagation in coolant material of a Sodium-cooled Fast reactor (SFR). The wave and the medium are described in three spatial and one temporal dimensions. We made use of a massive data set to describe the fluctuating situation of the medium. This data set was provided by Computational Fluid Dynamics (CFD) with a Large-Eddy Simulation (LES) model, which calculated the temperature field with a higher spatial resolution than Reynolds-Averaged Navier–Stokes turbulence models (RANS). This data set was furthermore obtained from other studies on a numerical and physical experiment called PLAJEST that created mixing jets of liquid metal in order to simulate the status of running SFRs. Because of the limitation of computational resources, previous acoustic studies applied to such a medium could only use the spatial–temporally averaged fluctuating heterogeneity of a medium calculated by RANS turbulence model. This limitation may overlook wave fluctuation because of the difference of the resolution between computational fluid dynamics and acoustic wave length. Our numerical effort allowed us to study the most realistic acoustic wave propagation in liquid metal than in any former studies. A new important result was obtained in this work as we demonstrated that ultrasonic measurements could follow thermal-hydraulic fluctuations in an opaque liquid with high sensitivity. This result was obtained through the definition of descriptors to analyze medium fluctuations along the wave path. We defined a very new measurement index, called hereafter Cumulated Temperature Fluctuation Intensity (CTFI), to correlate the variations in the thermal-hydraulic conditions to the wave variations. We demonstrated a good correlation between the second derivative of this index and the second derivative of several acoustic measurements, then we discussed the easiest measurements to be used in practice in an industrial setup.
Modeling of Resistivity and Acoustic Borehole Logging Measurements Using Finite Element Methods
2021, Modeling of Resistivity and Acoustic Borehole Logging Measurements Using Finite Element Methods
A GPU-based numerical manifold method for modeling the formation of the excavation damaged zone in deep rock tunnels
2020, Computers and Geotechnics
Citation Excerpt :
Zhang and Shen [48] accelerated the FEM for elasticity problems on NVIDIA GPUs using CUDA and achieved a speedup ratio of approximately 10–20. Komatitsch et al. [49] adopted CUDA for a high–order FEM to simulate the seismic wave propagation resulting from earthquakes and can achieve a best speedup ratio of 25. By adopting CUDA, Wang et al. [50] developed a parallel boundary element method (BEM) to investigate the 3D elastoplastic problems and a maximum speedup ratio of approximately 40 was achieved.
In this study, combined with the zero–thickness cohesive element (ZE) model and explicit integration method, a parallelization technique based on graphics processing units (GPU) is proposed to accelerate the computational efficiency of the NMM for modeling the formation of the excavation damaged zone in deep rock tunnels. To optimize the performance of the original NMM when simulating rock masses with fine grains, a ZE model is adopted and a speedup ratio of 41 compared with the original NMM is achieved. To simulate the behavior of the failed rock masses more efficiently and to achieve a higher speedup ratio after parallelization, the explicit integration scheme is introduced to avoid the assembly and solving of linear equations. To further improve the computation efficiency of hardware–based approaches, the GPU–based parallel technique, including a hybrid CPU–GPU framework, a ‘single instruction on multiple data’ (SIMD) model and a rainbow coloring strategy, is adopted and a speedup ratio of up to 34.7x is achieved. Finally, a series of excavation models are simulated, and the predicted results validate the efficiency and capability of this developed method for modeling the formation of the excavation damaged zone in deeply buried rock tunnels.
HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark
2023, International Journal of High Performance Computing Applications

View all citing articles on Scopus

Dimitri Komatitsch is a Professor of Computational Geophysics at University of Pau, France. He was born in 1970 and did his Ph.D. at Institut de Physique du Globe de Paris, France, in 1997.

David Michéa is a researcher at INRIA, University of Pau and CNRS, France. He was born in 1973 and did his Master’s thesis at University of Strasbourg, France, in 2006.

Gordon Erlebacher is a Professor of Computer Science at Florida State University, Tallahassee, USA. He was born in 1957 and did his Ph.D. at Columbia University in New York, USA in 1983.

View full text

Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

Abstract

Introduction

Section snippets

Related work

Serial algorithm

CUDA implementation

Optimizations common to the two CUDA implementations

Numerical validation of the two CUDA versions

Performance analysis and speedup obtained

Lessons learned, conclusions and future work

Acknowledgments

J. Comput. Phys.

Comput. Methods Appl. Mech. Engrg.

Parallel Comput.

Parallel Comput.

Comput. Methods Appl. Mech. Engrg.

Parallel Comput.

A mesh coloring method for efficient MIMD processing in finite element problems

Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware

Proceedings of the Institute of Mechanical Engineers

Part C: J. Mech. Eng., Part C: J. Mech. Eng. Sci.

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations

Internat. J. Parallel Emerg. Distrib. Syst.