Frontiers in Signal Processing
Adaptive and Power-Aware Resilience for Extreme-scale Computing
Download PDF (1192.8 KB) PP. 24 - 40 Pub. Date: July 10, 2017
Author(s)
- Xiaolong Cui*
Department of Computer Science, University of Pittsburgh, Pittsburgh, United States - Taieb Znati
Department of Computer Science, University of Pittsburgh, Pittsburgh, United States - Rami Melhem
Department of Computer Science, University of Pittsburgh, Pittsburgh, United States
Abstract
Keywords
References
[1] S. Ahern and et. al., “Scientific discovery at the exascale, a report from the doe ascr 2011 workshop on exascale data management, analysis, and visualization,” 2011.
[2] O. Sarood and et. al., “Maximizing throughput of overprovisioned hpc data centers under a strict power budget,” ser. SC ’14, Piscataway, NJ, USA, 2014, pp. 807–818. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.71
[3] O. Villa and et. al., “Scaling the power wall: A path to exascale,” ser. SC ' 14 Piscataway, NJ, USA: IEEE Press, 2014, pp. 830–841. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.73
[4] E. Elnozahy and et. al., “A survey of rollback-recovery protocols in message-passing systems,” ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, 2002.
[5] K. Chandy and C. Ramamoorthy, “Rollback and recovery strategies for computer programs,” Computers, IEEE Transactions on, vol. C-21, no. 6, pp. 546–556, June 1972.
[6] E. Elnozahy and J. Plank, “Checkpointing for peta-scale systems: a look into the future of practical rollbackrecovery,” DSC, vol. 1, no. 2, pp. 97 – 108, april-june 2004.
[7] R. Riesen, K. Ferreira, J. R. Stearley, R. Oldfield, J. H. L. III, K. T. Pedretti, and R. Brightwell, “Redundant computing for exascale systems,” December 2010.
[8] P. Hargrove and J. Duell, “Berkeley lab checkpoint/restart (blcr) for linux clusters,” in Journal of Physics: Conference Series, vol. 46, no. 1, 2006, p. 494.
[9] J. Plank and M. Thomason, “The average availability of parallel checkpointing systems and its importance in selecting runtime parameters,” in Fault-Tolerant Computing, 1999, pp. 250 –257.
[10] B. Randell, “System structure for software fault tolerance,” in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437–449. [Online]. Available: http://doi.acm.org/10.1145/800027.808467
[11] G. Zheng, L. Shi, and L. Kale, “FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI,” in Cluster Computing, 2004 IEEE International Conference on, Sept 2004, pp. 93–103.
[12] A. Guermouche and et. al., “Uncoordinated checkpointing without domino effect for send-deterministic mpi applications,” in IPDPS, May 2011, pp. 989–1000.
[13] H. chang Nam, J. Kim, S. Lee, and S. Lee, “Probabilistic checkpointing,” in In Proceedings of Intl. Symposium on Fault-Tolerant Computing, 1997, pp. 153–160.
[14] S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira, “Adaptive incremental checkpointing for massively parallel systems,” in ICS, St. Malo, France, 2004.
[15] J. Plank and K. Li, “Faster checkpointing with n+1 parity,” in Fault-Tolerant Computing, June 1994, pp. 288–297.
[16] E. Elnozahy and W. Zwaenepoel, “Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output commit,” TC, vol. 41, pp. 526–531, 1992.
[17] K. Li, J. F. Naughton, and J. S. Plank, “Low-latency, concurrent checkpointing for parallel programs,” IEEE Trans. Parallel Distrib. Syst., vol. 5, no. 8, pp. 874–879, Aug. 1994. [Online]. Available: http://dx.doi.org/10.1109/71.298215
[18] A. Moody, G. Bronevetsky, K. Mohror, and B. Supinski, “Design, modeling, and evaluation of a scalable multilevel checkpointing system,” in SC, 2010, pp. 1–11. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.18
[19] D. Hakkarinen and Z. Chen, “Multilevel diskless checkpointing,” Computers, IEEE Transactions on, vol. 62, no. 4, pp. 772–783, April 2013.
[20] F. Chen, D. A. Koufaty, and X. Zhang, “Hystor: Making the best use of solid state drives in high performance storage systems,” ser. ICS, New York, USA, 2011, pp. 22–32. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995902
[21] D. Fiala and et. al., “Detection and correction of silent data corruption for large-scale highperformance computing,” ser. SC, Los Alamitos, CA, USA, 2012, pp. 78:1–78:12. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389102
[22] A. Lefray, T. Ropars, and A. Schiper, “Replication for send-deterministic MPI HPC applications,” ser. FTXS ' 13. New York, NY, USA: ACM, 2013, pp. 33–40. [Online]. Available: http://doi.acm.org/10.1145/2465813.2465819
[23] F. Cappello, “Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities,” IJHPCA, vol. 23, no. 3, pp. 212–226, 2009.
[24] X. Ni, E. Meneses, N. Jain, and L. V. Kalé, “Acr: Automatic checkpoint/restart for soft and hard error protection,” ser. SC. New York, NY, USA: ACM, 2013, pp. 7:1–7:12. [Online]. Available: http://doi.acm.org/10.1145/2503210.2503266
[25] J. Stearley and et. al., “Does partial replication pay off?” in DSN-W, June 2012, pp. 1–6.
[26] J. Elliott and et. al., “Combining partial redundancy and checkpointing for HPC,” ser. ICDCS. Washington, DC, USA: IEEE Computer Society, 2012, pp. 615–626. [Online]. Available: http: //dx.doi.org/10.1109/
[27] C. Engelmann and S. B?hm, “Redundant execution of hpc applications with mr-mpi,” in PDCN, 2011, pp. 15–17.
[28] H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, “Combining Process Replication and Checkpointing for Resilience on Exascale Systems,” INRIA, Rapport de recherche RR-7951, May 2012. [Online]. Available: http://hal.inria.fr/hal-00697180
[29] L. Alvisi and K. Marzullp, “Message logging: Pessimistic, optimistic, causal, and optimal,” IEEE Trans. Softw.Eng., vol. 42, no. 2, pp. 149–159, 1998.
[30] J. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303–312, Feb. 2006. [Online]. Available: http://dx.doi.org/10.1016/j.future.2004.11.016
[31] S. Albers, A. Antoniadis, and G. Greiner, “On multi-processor speed scaling with migration: Extended abstract,” ser. SPAA ' 11. New York, NY, USA: ACM, 2011, pp. 279–288. [Online]. Available: http://doi.acm.org/10.1145/1989493.1989539
[32] P. Kling and P. Pietrzyk, “Profitable scheduling on multiple speed-scalable processors,” ser. SPAA ' 13. New York, NY, USA: ACM, 2013, pp. 251–260. [Online]. Available: http://doi.acm.org/10.1145/2486159.2486183