Please wait a minute...
Tsinghua Science and Technology  2021, Vol. 26 Issue (3): 370-383    doi: 10.26599/TST.2020.9010012
Special Issue on High Performance Computing     
More Bang for Your Buck: Boosting Performance with Capped Power Consumption
Juan Chen*(),Xinxin Qi(),Feihao Wu(),Jianbin Fang(),Yong Dong(),Yuan Yuan(),Zheng Wang(),Keqin Li()
College of Computer, National University of Defense Technology, Changsha 410073, China.
College of Computer, University of Leeds, London LS2 9JT, UK.
School of Science and Engineering, State University of New York, New York, NY 12561, USA.
Download: PDF (5383 KB)      HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Achieving faster performance without increasing power and energy consumption for computing systems is an outstanding challenge. This paper develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints and total energy consumption. Our scheme estimates how the number of processor cores and CPU frequency setting affects the application performance. It then uses the estimate to provide additional compute nodes to memory-bound applications if it is profitable to do so. We implement and apply our algorithm to 12 representative benchmarks from the NAS parallel benchmark and HPC Challenge (HPCC) benchmark suites and evaluate it on a representative HPC cluster. Experimental results show that our approach can effectively mitigate memory contention to improve application performance, and it achieves this without significantly increasing the peak power and overall energy consumption. Our approach obtains on average 12.69% performance improvement over the default resource allocation strategy, but uses 7.06% less total power, which translates into 17.77% energy savings.



Key wordsenergy efficiency      high-performance computing      performance boost      power control      processor frequency scaling     
Received: 25 March 2020      Published: 23 October 2020
Corresponding Authors: Juan Chen     E-mail: juanchen@nudt.edu.cn;qixinxin19@nudt.edu.cn;wufeihao16@nudt.edu.cn;j.fang@nudt.edu.cn;yongdong@nudt.edu.cn;yuanyuan@nudt.edu.cn;z.wang5@leeds.ac.uk;lik@newpaltz.edu
About author: Juan Chen received the PhD degree from National University of Defense Technology, China in 2007. She is now an associate professor at National University of Defense Technology, China. Her research interests focus on supercomputer systems and energy-efficient software optimization method.|Xinxin Qi received the BS degree from Sun Yat-Sen University, China in 2019, and now is a master student at National University of Defense Technology. Her research interests include high performance computing and energy efficiency computing.|Feihao Wu received the MS degree from National University of Defense Technology in 2018. His research interests include the large scale parallel numerical simulation and energy efficiency computing.|Jianbin Fang is an assistant professor in computer science at NUDT. He obtained the PhD degree from Delft University of Technology in 2014. His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms.|Yong Dong received the PhD degree from National University of Defense Technology, China in 2012. He is now an associate professor at National University of Defense Technology, China. His main research interests include supercomputer systems and storage systems.|Yuan Yuan received the PhD degree from National University of Defense Technology, China in 2011. He is now an associate professor at National University of Defense Technology, China. His research interests include supercomputer systems and HPC monitoring and diagnosis.|Zheng Wang received the PhD degree in computer science from University of Edinburgh, UK in 2011. He is an associate professor at University of Leeds. His research interests include the boundaries of parallel program optimisation, systems security, and applied machine learning. He received four best paper awards for his work on machine learning based compiler optimisation (PACT’10, CGO’17, PACT’17, and CGO’19).|Keqin Li received the PhD degree in computer science from the University of Houston, USA in 1990. He is a SUNY distinguished professor of computer science at the State University of New York and a distinguished professor at Hunan University, China. His current research interests include cloud computing, fog computing and mobile edge computing, energy-efficient computing and communication, embedded systems and cyber-physical systems, heterogeneous computing systems, big data computing, high-performance computing, CPU-GPU hybrid and cooperative computing, computer architectures and systems, computer networking, machine learning, and intelligent and soft computing. He has published over 710 journal articles, book chapters, and refereed conference papers, and has received several best paper awards.
Cite this article:

Juan Chen,Xinxin Qi,Feihao Wu,Jianbin Fang,Yong Dong,Yuan Yuan,Zheng Wang,Keqin Li. More Bang for Your Buck: Boosting Performance with Capped Power Consumption. Tsinghua Science and Technology, 2021, 26(3): 370-383.

URL:

http://tst.tsinghuajournals.com/10.26599/TST.2020.9010012     OR     http://tst.tsinghuajournals.com/Y2021/V26/I3/370

Fig. 1 Performance improvement when adding additional computing nodes without using extra parallel processes.
Fig. 2 How memory traffic on the master node changes when the number of computing nodes increases.
Fig. 3 Total peak power on multiple nodes changes when the number of nodes increases. The red dash line denotes the target total power value of 2.6 GHz under N nodes.
Fig. 4 Total energy consumption on different CPU frequencies when using different number of compute nodes. The red dash line denotes the target total energy consumption of 2.6 GHz under N nodes.
NotationDescription
cNumber of processor cores on each computing node
nNumber of processes
NNumber of assigned computing nodes with the default resource allocation strategy
Δ?NNumber of computing nodes to be increased
bi?(t)Memory traffic of computing node i at time t
BN?(t)Average value of bi?(t) for N nodes, 0iN
BNMaximum value of BN?(t), 0tT where T represents program running time on N nodes
BPhysical memory bandwidth on a single computing node
αThreshold for the ratio of memory traffic to memory bandwidth
fmaxMaximum CPU clock frequency
fminMinimum CPU clock frequency
fiProcessor clock frequency, which satisfies fminfifmax
Pcpu?(fi)Power consumption of one CPU core with clock frequency fi
PidlecpuPower consumption for each CPU core in the idle state, which equals kPcpu?(fmax)
PmemPower consumption of DRAM on one computing node
PotherPower consumption of compoents other than the CPU and DRAM on one computing node
Table 1 Parameters used by our power-performance model.
CPU frequency (GHz)Pcpu?(fi) (W)
fmaxPcpu?(fmax)
fmax-Δ?fPcpu?(fmax-Δ?f)
fmax-2×Δ?fPcpu?(fmax-2×Δ?f)
fminPcpu?(fmin)
Table 2 Model parameters for single-core power estimation.
BenchmarkTypeNΔ?N*f*Performance boost (%)Power saving (%)Energy saving (%)
sp.D.169Memory-bound921.213.982.7516.22
RandomAccess.160Memory-bound811.29.800.3511.27
mg.D.128Memory-bound711.918.630.4018.73
STREAM.160Memory-bound811.710.192.428.84
DGEMM.160Memory-bound821.518.822.1219.85
cg.D.128Memory-bound721.2-3.342.90-0.29
bt.D.169Memory-bound911.8-1.984.272.26
lu.D.128Memory-bound702.6000
ft.D.128CPU-bound702.6000
FFT.160CPU-bound802.6000
PRANTS.160CPU-bound802.6000
ep.D.128CPU-bound702.6000
Table 3 Algorithm results for about ten compute nodes.
Fig. 5 Memory traffic histogram for benchmark sp.
CPU frequency (GHz)Pcpu?(fi)CPU frequency (GHz)Pcpu?(fi)
2.67.82.57.7
2.47.52.37.3
2.27.12.17.0
2.06.81.96.7
1.86.41.76.2
1.66.11.56.0
1.45.91.35.8
1.25.7
Table 4 Single-core power and corresponding CPU frequencies.
BenchmarkTypeNΔ?N*f*Performance boost (%)Power saving (%)Energy saving (%)
sp.E.529Memory-bound2741.910.368.0216.23
RandomAccess.512Memory-bound2611.221.1111.1930.37
mg.E.512Memory-bound2651.813.008.7220.45
STREAM.512Memory-bound2631.913.700.3514.73
DGEMM.512Memory-bound2642.210.24-1.218.24
cg.E.512Memory-bound2631.93.1411.1813.25
bt.E.529Memory-bound2722.017.3111.2024.50
lu.E.529Memory-bound2702.6000
ft.E.512CPU-bound2602.6000
FFT.512CPU-bound2602.6000
PRANTS.512CPU-bound2602.6000
ep.E.529CPU-bound2702.6000
Table 5 Algorithm results for about 30 compute nodes.
Fig. 6 Parallel time reduction due to add one or more nodes.
Fig. 7 Details of the increase of power consumption in percentage when all processor frequencies are 2.6 GHz.
N+ΔN*) nodes.
">
Fig. 8 Memory traffics of N nodes and (N+ΔN*) nodes.
Fig. 9 Improvement of performance, and power and energy savings of NAS benchmarks. The further away from the centre, the better the improvement.
BenchmarkTypeNΔ?N* by oursΔ?N* by optimal settingFrequency by ours (GHz)Frequency by optimal setting (GHz)
spMemory-bound9221.51.7
RandomAccessMemory-bound8111.81.8
mgMemory-bound7111.81.8
STREAMMemory-bound8111.81.9
DGEMMMemory-bound8221.21.4
cgMemory-bound7211.82.1
btMemory-bound9112.32.4
luMemory-bound7012.62.3
ftCPU-bound7002.62.6
FFTCPU-bound8002.62.6
PRANTSCPU-bound8002.62.6
epCPU-bound7002.62.6
Table 6 Extra nodes Δ?N* and CPU frequency f* given by our approach and optimal settings.
Fig. 10 Performance improvement comparison of our approach and optimal setting for 10 nodes.
Fig. 11 Performance improvement comparison of our approach and optimal setting for 30 nodes.
Fig. 12 Power savings given by our approach and optimal settings for using 10 computing nodes.
Fig. 13 Power savings given by our approach and optimal settings for using 30 computing nodes.
[1]   Dennard R. H., Gaensslen F. H., Yu H. N., Rideout V. L., Bassous E., and LeBlanc A. R., Design of ion-implanted MOSFET’s with very small physical dimensions, IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256-268, 1974.
[2]   Bohr M., A 30 year retrospective on Dennard’s MOSFET scaling paper, IEEE Solid-State Circuits Society Newsletter, vol. 12, no. 1, pp. 11-13, 2007.
[3]   Kumar R., Farkas K. I., Jouppi N. P., Ranganathan P., and Tullsen D. M., Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction, in Proc. 36th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 36), San Diego, CA, USA, 2003, pp. 81-92.
[4]   Kumar R., Zyuban V., and Tullsen D. M., Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling, ACM SIGARCH Computer Architecture News, vol. 33, no. 2, pp. 408-419, 2005.
[5]   Kumar R., Tullsen D. M., Jouppi N. P., and Ranganathan P., Heterogeneous chip multiprocessors, Computer, vol. 38, no. 11, pp. 32-38, 2005.
[6]   Heath T., Diniz B., Carrera E. V., Meira W., and Bianchini R., Energy conservation in heterogeneous server clusters, in Proc. 10th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Chicago, IL, USA, 2005, pp. 186-195.
[7]   Li Y. M., Skadron K., Brooks D., and Hu Z. G., Performance, energy, and thermal considerations for SMT and CMP architectures, in Proc. 11th Int. Symp. High-Performance Computer Architecture, San Francisco, CA, USA, 2005, pp. 71-82.
[8]   Lukefahr A., Padmanabha S., Das R., Sleiman F. M., Dreslinski R., Wenisch T. F., and Mahlke S., Composite cores: Pushing heterogeneity into a core, in 2012 45th Annu. IEEE/ACM Int. Symp. Microarchitecture, Vancouver, Canada, 2012, pp. 317-328
[9]   Muthukaruppan T. S., Pricopi M., Venkataramani V., Mitra T., and Vishin S., Hierarchical power management for asymmetric multi-core in dark silicon era, in 2013 50th ACM/EDAC/IEEE Design Automation Conf. (DAC), Austin, TX, USA, 2013, pp. 1-9.
[10]   Meng J., Kawakami K., and Coskun A. K., Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints, in Proc. 49th Annu. Design Automation Conf., San Francisco, CA, USA, 2012, pp. 648-655.
[11]   Cao T., Blackburn S. M., Gao T. J., and McKinley K. S., The Yin and Yang of power and performance for asymmetric hardware and managed software, in 2012 39th Annu. Int. Symp. Computer Architecture (ISCA), Portland, OR, USA, 2012, pp. 225-236.
[12]   Gholkar N., Mueller F., and Rountree B., Power tuning HPC jobs on power-constrained systems, in Proc. 2016 Int. Conf. Parallel Architectures and Compilation, Haifa, Israel, 2016, pp. 179-191.
[13]   Patki T., Lowenthal D. K., Sasidharan A., Maiterth M., Rountree B. L., Schulz M., and de Supinski B. R., Practical resource management in power-constrained, high performance computing, in Proc. 24th Int. Symp. High-Performance Parallel and Distributed Computing, Portland, OR, USA, 2015, pp. 121-132.
[14]   Isci C., Buyuktosunoglu A., Cher C. Y., Bose P., and Martonosi M., An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget, in 2006 39th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO’06), Orlando, FL, USA, 2006, pp. 347-358.
[15]   Pagani S., Chen J. J., and Li M. M., Energy efficiency on multi-core architectures with multiple voltage islands, IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp. 1608-1621, 2015.
[16]   Williams S. W., Waterman A., and Patterson D. A., Roofline: An insightful visual performance model for multicore architectures, Communications of the ACM, vol. 52, no. 4, pp. 65-76, 2009.
[17]   Asanovic K., Bodik R., Catanzaro B. C., Gebis J. J., Husbands P., Keutzer K., Patterson D. A., Plishker W. L., Shalf J., Williams S. W., and Yelick K. A., The Landscape of Parallel Computing Research: A View from Berkeley, Electrical Engineering and Computer Sciences, Tech. Rep. UCB/EECS-2006-183, University of California at Berkeley, Berkeley, CA, USA, 2006.
[18]   Luszczek P. R., Bailey D. H., Dongarra J. J., Kepner J., Lucas R. F., Rabenseifner R., and Takahashi D., The HPC Challenge (HPCC) benchmark suite, in Proc. 2006 ACM/IEEE Conf. Supercomputing (SC’06), Tampa, FL, USA, 2006, p. 213.
[19]   Jeffrey R., Intel? VTuneTM Amplifier, , 2018.
[20]   Dimitrov M., Intel? Power Governor, , 2012.
[21]   Viswanathan V., Intel? Memory Latency Checker v3.8, , 2013.
[22]   Rountree B., Lowenthal D. K., de Supinski B. R., Schulz M., Freeh V. W., and Bletsch T., Adagio: Making DVS practical for complex HPC applications, in Proc. 23rd Int. Conf. Supercomputing, New York, NY, USA, 2009, pp. 460-469.
[23]   Wang W., Porterfield A., Cavazos J., and Bhalachandra S., Using per-loop CPU clock modulation for energy efficiency in OpenMP applications, presented at the 2015 44th Int. Conf. Parallel Processing, Beijing, China, 2015, pp. 629-638.
[24]   Bhalachandra S., Porterfield A., Olivier S. L., and Prins J. F., An adaptive core-specific runtime for energy efficiency, peesented at 2017 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), Orlando, FL, USA, 2017, pp. 947-956.
[25]   Stamelakos I., Xydis S., Palermo G., and Silvano C., Variation-aware voltage island formation for power efficient near-threshold manycore architectures, presented at the 2014 19th Asia and South Pacific Design Automation Conf. (ASP-DAC), Singapore, 2014, pp. 304-310.
[26]   Karpuzcu U. R., Sinkar A., Kim N. S., and Torrellas J., EnergySmart: Toward energy-efficient manycores for near-threshold computing, presented at 2013 IEEE 19th Int. Symp. High Performance Computer Architecture (HPCA), Shenzhen, China, 2013, pp. 542-553.
[27]   Begum R., Werner D., Hempstead M., Prasad G., and Challen G., Energy-performance trade-offs on energy-constrained devices with multi-component DVFS, presented at 2015 IEEE Int. Symp. Workload Characterization, Atlanta, GA, USA, 2015, pp. 34-43.
[28]   Liu Q. X., Moreto M., Abella J., Cazorla F. J., and Valero M., DReAM: An approach to estimate per-task DRAM energy in multicore systems, ACM Transactions on Design Automation of Electronic Systems, vol. 22, no. 1, p. 16, 2016.
[29]   Tiwari A., Schulz M., and Carrington L., Predicting optimal power allocation for CPU and DRAM domains, in 2015 IEEE Int. Parallel and Distributed Processing Symp. Workshop, Hyderabad, India, 2015, pp. 951-959.
[30]   Zhang H. Z. and Hoffmann H., Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques, in Proc. 21st Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS’16), Atlanta, GA, USA, 2016, pp. 545-559.
[31]   Zou P. F., Allen T., Davis C. H., Feng X. Z., and Ge R., CLIP: Cluster-level intelligent power coordination for power-bounded systems, presented at the 2017 IEEE Int. Conf. Cluster Computing (CLUSTER), Honolulu, HI, USA, 2017, pp. 541-551.
[32]   Patki T., Lowenthal D. K., Rountree B., Schulz M., and de Supinski B. R., Exploring hardware overprovisioning in power-constrained, high performance computing, in Proc. 27th Int. ACM Conf. Int. Conf. Supercomputing (ICS’13 ), Eugene, OR, USA, 2013, pp. 173-182.
[33]   Lo D. and Kozyrakis C., Dynamic management of TurboMode in modern multi-core chips, presented at 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), Orlando, FL, USA, 2014, pp. 603-613.
[34]   Jang H. B., Lee J., Kong J., Suh T., and Chung S. W., Leveraging process variation for performance and energy: In the perspective of overclocking, IEEE Transactions on Computers, vol. 63, no. 5, pp. 1316-1322, 2014.
[1] Coral Calero, Javier Mancebo, Félix García, María Ángeles Moraga, José Alberto García Berná, José Luis Fernández-Alemán, Ambrosio Toval. 5Ws of Green and Sustainable Software[J]. Tsinghua Science and Technology, 2020, 25(03): 401-414.
[2] Yong Dong, Juan Chen, Yuhua Tang, Junjie Wu, Huiquan Wang, Enqiang Zhou. Lazy Scheduling Based Disk Energy Optimization Method[J]. Tsinghua Science and Technology, 2020, 25(02): 203-216.
[3] Guohua Xi, Xibin Zhao, Yan Liu, Jin Huang, Yangdong Deng. A Hierarchical Ensemble Learning Framework for Energy-Efficient Automatic Train Driving[J]. Tsinghua Science and Technology, 2019, 24(2): 226-237.
[4] Fulu Li, Junwei Cao, Chunfeng Wang, Kui Wu. On the Mathematical Nature of Wireless Broadcast Trees[J]. Tsinghua Science and Technology, 2018, 23(03): 223-232.
[5] Catalina Aranzazu-Suescun,Mihaela Cardei. Distributed Algorithms for Event Reporting in Mobile-Sink WSNs for Internet of Things[J]. Tsinghua Science and Technology, 2017, 22(4): 413-426.
[6] Yanhua Li,Youhui Zhang,Weiming Zheng. HW/SW Co-optimization for Stencil Computation: Beginning with a Customizable Core[J]. Tsinghua Science and Technology, 2016, 21(5): 570-580.
[7] Ying Liu,Wade Trappe. Topology Adaptation for Robust Ad Hoc Cyberphysical Networks under Puncture-Style Attacks[J]. Tsinghua Science and Technology, 2015, 20(4): 364-375.
[8] . Optimal Power Control for OFDM Signals over Two-Way Relay with Physical Network Coding[J]. Tsinghua Science and Technology, 2011, 16(6): 569-575.
[9] . Reliable and Energy Efficient Target Coverage for Wireless Sensor Networks[J]. Tsinghua Science and Technology, 2011, 16(5): 464-474.
[10] . Improved Adaptive Compression Arbitration System for Wireless Sensor Networks[J]. Tsinghua Science and Technology, 2010, 15(2): 202-208.