ABSTRACT
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.
- A. Moody, G. Bronevetsky, K. Mohror, B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2010 Google ScholarDigital Library
- X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie. Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009. Google ScholarDigital Library
- Z. Cheng, J. Dongarra, A scalable Checkpoint Encoding Algorithm for Diskless Checkpointing. Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008, Nanjing, China, December, 2008. Google ScholarDigital Library
- J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, Plfs: A checkpoint filesystem for parallel applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009. Google ScholarDigital Library
- L. Bautista-Gomez, N. Maruyama, A. Nukada, F. Cappello, S. Matsuoka, "Low-overhead diskless checkpoint for hybrid computing systems", International Conference on High Performance Computing, Goa, India, December 2010.Google Scholar
- L. Bautista-Gomez, N. Maruyama, F. Cappello, S. Matsuoka, "Distributed Diskless Checkpoint for large scale systems", IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010), Melbourne, Australia, May 2010. Google ScholarDigital Library
- The Top 500 http://www.top500.org/Google Scholar
- The Green 500 http://www.green500.org/Google Scholar
- F. Cappello, Fault tolerance in Petascale/Exascale systems: current knowledge, challenges and research opportunities International Journal on High Performance Computing Applications, SAGE, Volume 23, Issue 3, 2009. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, W. Weber. DRAM errors in the wild: A Large-Scale Field Study. In Proceedings of the 11th international joint conference on Measurement and modeling of computer systems (SIGMETRICS), ACM, New York, NY, USA, 2009. Google ScholarDigital Library
- B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--17, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarDigital Library
- F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies, p.231--244, January 28--30, 2002 Google ScholarDigital Library
- S. Microsystems. Lustre file system, October 2008Google Scholar
- J. S. Plank, Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications, Technical Report CS-07-603, University of Tennessee, September, 2007.Google Scholar
- J. S. Plank, J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O'Hearn. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proceedings of the Seventh USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA, 2009. Google ScholarDigital Library
- S. Matsuoka, The Road to TSUBAME and beyond, Petascale Computing: Algorithms and Applications, Chapman & Hall Crc Computational Science Series, 2008, pp. 289--310.Google Scholar
- A GPU Accelerated Storage System, Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, Matei Ripeanu, IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, June 2010. Google ScholarDigital Library
- A. Petitet, R. Whaley, J. Dongarra and A. Cleary. HPL -- a portable implementation of the high performance Linpack benchmark for distributed computers. http://www.netlib.org/benchmark/hplGoogle Scholar
- NA Kofahi, S Al-Bokhitan, A Al-Nazer, On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis - Information Technology Journal, v.4 n.4, p.367--376, 2005.Google Scholar
- http://www.nvidia.com/object/fermi_architecture.htmlGoogle Scholar
- J. Duell, P. Hargrove and E. Roman, Requirements for Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL-49659, 2002.Google ScholarCross Ref
- E. Roman, A Survey of Checkpoint/Restart Implementations Lawrence Berkeley National Laboratory Technical Report LBNL-54942, 2003.Google Scholar
- J. Duell, P. Hargrove and E. Roman, The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL -- 54941, 2002.Google Scholar
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove and E. Roman, The LAM/MPI checkpoint/restart framework: system-initiated checkpointing Proc. Los Alamos Computer Science Institute (LACSI) Symp. Santa Fe, New Mexico, USA, October 2003.Google Scholar
- J. S. Plank, M. Beck, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX, Technical Conference, 213--223, 1995. Google ScholarDigital Library
- S. Matsuoka, I. Yamagata, H. Jitsumoto, H. Nakada, Speculative Checkpointing: Exploiting Temporal Affinity of Memory Operations, HPC Asia 2009, pp. 390--396, 2009.Google Scholar
- Z. Chen and J. J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. In 20th International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece, april 2006. Google ScholarDigital Library
- J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, v.9 n.10, p.972--986, October 1998. Google ScholarDigital Library
- J. S. Plank and L. Xu, Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications, NCA-06: 5th IEEE International Symposium on Network Computing Applications, Cambridge, MA, July, 2006. Google ScholarDigital Library
- C. Lu, Scalable diskless checkpointing for large parallel systems, PhD. Thesis, University of Illinois at Urbana-Champaign, IL, 2005. Google ScholarDigital Library
- A. Moody, G. Bronevetsky, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. DOE technical report, 2009.Google Scholar
- S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, A. Hasegawa, GPU-accelerated computing-from hype to mainstream, the rebirth of vector computing. Journal of Physics: Conference Series, v.180, no.012043, 2009.Google Scholar
- B. Schroeder, G. A. Gibson, Understanding failures in petascale computers, SciDAC, Journal of Physics: Conference Series, v.78, no.012022, 2007.Google Scholar
- M. Curry, L. Ward, T. Skjellum, and R. Brightwell. Accelerating reed-solomon coding in raid systems with gpus. In International Parallel and Distributed Processing Symposium, April 2008.Google ScholarCross Ref
- W. D. Gropp, R. Ross, and N. Miller. Providing efficient I/O redundancy in MPI environments. Lecture Notes in Computer Science, 3241:7786, September 2004.Google Scholar
- A. Nukada, S. Matsuoka, NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA in Proceedings at the International Heterogeneity in Computing Workshop, Alaska, 2011. (To appear) Google ScholarDigital Library
- D. Komatitsch, S. Tsuboi, C. Ji and J. Tromp, A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, Proceedings of the ACM/IEEE Supercomputing SC'2003 conference, November 2003. Google ScholarDigital Library
- G. Grider, J. Loncaric, and D. Limpart, Roadrunner System Management Report, Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.Google Scholar
- R. A. Oldfield, S. Arunagiri, P. J. Teller et al., Modeling the Impact of Checkpoints on Next-Generation Systems, in MSST'07. Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, 2007, pp. 30--46. Google ScholarDigital Library
- S. Y. Borkar, Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation, IEEE Micro, vol. 25, no. 6, pp. 10--16, 2005. Google ScholarDigital Library
- D. Reed, High-End Computing: The Challenge of Scale, Director's Colloquium, LANL, May 2004.Google Scholar
- K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15--21, 2008, Austin, Texas. Google ScholarDigital Library
- B. Schroeder, G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), p.249--258, June 25--28, 2006. Google ScholarDigital Library
- http://www.open-mpi.org/Google Scholar
- John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (September 1974), 530--531. DOI=10.1145/361147.361115 http://doi.acm.org/10.1145/361147.361115 Google ScholarDigital Library
- http://www.gsic.titech.ac.jp/ccwww/index.php?www&&&/tgc/trouble_list.htmlGoogle Scholar
- D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, Journal of Parallel and Distributed Computing, vol. 69(5), p. 451--460, doi: 10.1016/j.jpdc.2009.01.006, 2009. Google ScholarDigital Library
- D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, Journal of Computational Physics, vol. 229(20), p. 7692--7714, doi: 10.1016/j.jcp.2010.06.024, 2010. Google ScholarDigital Library
- http://icl.cs.utk.edu/papi/Google Scholar
- http://www.geodynamics.org/cig/software/specfem3d-globeGoogle Scholar
- B. Kennet, E. Engdahl, Traveltimes for global earthquake location and phase identification. Geophys. J. Int., 105, 429--465, 1991.Google ScholarCross Ref
- M. Kikuchi, H. Kanamori, Inversion of complex body waves. III, Bull. Seismol. Soc. Am., 81, 2335--2350, 1991.Google Scholar
- M. Kikuchi, H. Kanamori, Note on Teleseismic Body-Wave Inversion Program, 2003. http://www.eri.u-tokyo.ac.jp/ETAL/KIKUCHI/Google Scholar
- D. Komatitsch, J. Ritsema, J. Tromp, The spectral-element method, Beowulf computing, and global seismology, Science 298, 1737--1742, 2002.Google ScholarCross Ref
- C. Lawson, R. Hanson, Solving Least Squares Problems, Prentice-Hall, New Jersey, 340 pp, 1974.Google Scholar
- T. Nakamura, S. Tsuboi, Y. Kaneda, Y. Yamanaka, Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves, Tectonophysics, vol. 491, 72--84, 2010.Google ScholarCross Ref
- S. Tsuboi, D. Komatitsch, C. Ji, J. Tromp, Broadband modelling of the 2002 Denali fault earthquake on the Earth Simulator, Phys. Earth Planet. Inter. 139, 305--312, 2003.Google ScholarCross Ref
Recommendations
Recovery Device for Real-Time Dual-Redundant Computer Systems
This paper proposes the design of specialized hardware, called Recovery Device, for a dual-redundant computer system that operates in real-time. Recovery Device executes all fault-tolerant services including fault detection, fault type determination, ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
Comments