skip to main content
10.1145/2063384.2063427acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

FTI: high performance fault tolerance interface for hybrid systems

Published:12 November 2011Publication History

ABSTRACT

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.

References

  1. A. Moody, G. Bronevetsky, K. Mohror, B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie. Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Cheng, J. Dongarra, A scalable Checkpoint Encoding Algorithm for Diskless Checkpointing. Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008, Nanjing, China, December, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, Plfs: A checkpoint filesystem for parallel applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Bautista-Gomez, N. Maruyama, A. Nukada, F. Cappello, S. Matsuoka, "Low-overhead diskless checkpoint for hybrid computing systems", International Conference on High Performance Computing, Goa, India, December 2010.Google ScholarGoogle Scholar
  6. L. Bautista-Gomez, N. Maruyama, F. Cappello, S. Matsuoka, "Distributed Diskless Checkpoint for large scale systems", IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010), Melbourne, Australia, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. The Top 500 http://www.top500.org/Google ScholarGoogle Scholar
  8. The Green 500 http://www.green500.org/Google ScholarGoogle Scholar
  9. F. Cappello, Fault tolerance in Petascale/Exascale systems: current knowledge, challenges and research opportunities International Journal on High Performance Computing Applications, SAGE, Volume 23, Issue 3, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Schroeder, E. Pinheiro, W. Weber. DRAM errors in the wild: A Large-Scale Field Study. In Proceedings of the 11th international joint conference on Measurement and modeling of computer systems (SIGMETRICS), ACM, New York, NY, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--17, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies, p.231--244, January 28--30, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Microsystems. Lustre file system, October 2008Google ScholarGoogle Scholar
  14. J. S. Plank, Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications, Technical Report CS-07-603, University of Tennessee, September, 2007.Google ScholarGoogle Scholar
  15. J. S. Plank, J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O'Hearn. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proceedings of the Seventh USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Matsuoka, The Road to TSUBAME and beyond, Petascale Computing: Algorithms and Applications, Chapman & Hall Crc Computational Science Series, 2008, pp. 289--310.Google ScholarGoogle Scholar
  17. A GPU Accelerated Storage System, Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, Matei Ripeanu, IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Petitet, R. Whaley, J. Dongarra and A. Cleary. HPL -- a portable implementation of the high performance Linpack benchmark for distributed computers. http://www.netlib.org/benchmark/hplGoogle ScholarGoogle Scholar
  19. NA Kofahi, S Al-Bokhitan, A Al-Nazer, On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis - Information Technology Journal, v.4 n.4, p.367--376, 2005.Google ScholarGoogle Scholar
  20. http://www.nvidia.com/object/fermi_architecture.htmlGoogle ScholarGoogle Scholar
  21. J. Duell, P. Hargrove and E. Roman, Requirements for Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL-49659, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  22. E. Roman, A Survey of Checkpoint/Restart Implementations Lawrence Berkeley National Laboratory Technical Report LBNL-54942, 2003.Google ScholarGoogle Scholar
  23. J. Duell, P. Hargrove and E. Roman, The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL -- 54941, 2002.Google ScholarGoogle Scholar
  24. S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove and E. Roman, The LAM/MPI checkpoint/restart framework: system-initiated checkpointing Proc. Los Alamos Computer Science Institute (LACSI) Symp. Santa Fe, New Mexico, USA, October 2003.Google ScholarGoogle Scholar
  25. J. S. Plank, M. Beck, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX, Technical Conference, 213--223, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Matsuoka, I. Yamagata, H. Jitsumoto, H. Nakada, Speculative Checkpointing: Exploiting Temporal Affinity of Memory Operations, HPC Asia 2009, pp. 390--396, 2009.Google ScholarGoogle Scholar
  27. Z. Chen and J. J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. In 20th International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece, april 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, v.9 n.10, p.972--986, October 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. S. Plank and L. Xu, Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications, NCA-06: 5th IEEE International Symposium on Network Computing Applications, Cambridge, MA, July, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Lu, Scalable diskless checkpointing for large parallel systems, PhD. Thesis, University of Illinois at Urbana-Champaign, IL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Moody, G. Bronevetsky, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. DOE technical report, 2009.Google ScholarGoogle Scholar
  32. S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, A. Hasegawa, GPU-accelerated computing-from hype to mainstream, the rebirth of vector computing. Journal of Physics: Conference Series, v.180, no.012043, 2009.Google ScholarGoogle Scholar
  33. B. Schroeder, G. A. Gibson, Understanding failures in petascale computers, SciDAC, Journal of Physics: Conference Series, v.78, no.012022, 2007.Google ScholarGoogle Scholar
  34. M. Curry, L. Ward, T. Skjellum, and R. Brightwell. Accelerating reed-solomon coding in raid systems with gpus. In International Parallel and Distributed Processing Symposium, April 2008.Google ScholarGoogle ScholarCross RefCross Ref
  35. W. D. Gropp, R. Ross, and N. Miller. Providing efficient I/O redundancy in MPI environments. Lecture Notes in Computer Science, 3241:7786, September 2004.Google ScholarGoogle Scholar
  36. A. Nukada, S. Matsuoka, NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA in Proceedings at the International Heterogeneity in Computing Workshop, Alaska, 2011. (To appear) Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Komatitsch, S. Tsuboi, C. Ji and J. Tromp, A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, Proceedings of the ACM/IEEE Supercomputing SC'2003 conference, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Grider, J. Loncaric, and D. Limpart, Roadrunner System Management Report, Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.Google ScholarGoogle Scholar
  39. R. A. Oldfield, S. Arunagiri, P. J. Teller et al., Modeling the Impact of Checkpoints on Next-Generation Systems, in MSST'07. Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, 2007, pp. 30--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Y. Borkar, Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation, IEEE Micro, vol. 25, no. 6, pp. 10--16, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Reed, High-End Computing: The Challenge of Scale, Director's Colloquium, LANL, May 2004.Google ScholarGoogle Scholar
  42. K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15--21, 2008, Austin, Texas. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. B. Schroeder, G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), p.249--258, June 25--28, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. http://www.open-mpi.org/Google ScholarGoogle Scholar
  45. John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (September 1974), 530--531. DOI=10.1145/361147.361115 http://doi.acm.org/10.1145/361147.361115 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. http://www.gsic.titech.ac.jp/ccwww/index.php?www&&&/tgc/trouble_list.htmlGoogle ScholarGoogle Scholar
  47. D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, Journal of Parallel and Distributed Computing, vol. 69(5), p. 451--460, doi: 10.1016/j.jpdc.2009.01.006, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, Journal of Computational Physics, vol. 229(20), p. 7692--7714, doi: 10.1016/j.jcp.2010.06.024, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. http://icl.cs.utk.edu/papi/Google ScholarGoogle Scholar
  50. http://www.geodynamics.org/cig/software/specfem3d-globeGoogle ScholarGoogle Scholar
  51. B. Kennet, E. Engdahl, Traveltimes for global earthquake location and phase identification. Geophys. J. Int., 105, 429--465, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  52. M. Kikuchi, H. Kanamori, Inversion of complex body waves. III, Bull. Seismol. Soc. Am., 81, 2335--2350, 1991.Google ScholarGoogle Scholar
  53. M. Kikuchi, H. Kanamori, Note on Teleseismic Body-Wave Inversion Program, 2003. http://www.eri.u-tokyo.ac.jp/ETAL/KIKUCHI/Google ScholarGoogle Scholar
  54. D. Komatitsch, J. Ritsema, J. Tromp, The spectral-element method, Beowulf computing, and global seismology, Science 298, 1737--1742, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  55. C. Lawson, R. Hanson, Solving Least Squares Problems, Prentice-Hall, New Jersey, 340 pp, 1974.Google ScholarGoogle Scholar
  56. T. Nakamura, S. Tsuboi, Y. Kaneda, Y. Yamanaka, Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves, Tectonophysics, vol. 491, 72--84, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  57. S. Tsuboi, D. Komatitsch, C. Ji, J. Tromp, Broadband modelling of the 2002 Denali fault earthquake on the Earth Simulator, Phys. Earth Planet. Inter. 139, 305--312, 2003.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384

    Copyright © 2011 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2011

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader