More Bang for Your Buck: Boosting Performance with Capped Power Consumption
Received: 2020-03-25 Accepted: 2020-04-02
作者简介 About authors
Juan Chen received the PhD degree from National University of Defense Technology, China in 2007 She is now an associate professor at National University of Defense Technology, China Her research interests focus on supercomputer systems and energy-efficient software optimization method , E-mail：email@example.com.
Xinxin Qi received the BS degree from Sun Yat-Sen University, China in 2019, and now is a master student at National University of Defense Technology Her research interests include high performance computing and energy efficiency computing , E-mail：firstname.lastname@example.org.
Feihao Wu received the MS degree from National University of Defense Technology in 2018 His research interests include the large scale parallel numerical simulation and energy efficiency computing , E-mail：email@example.com.
Jianbin Fang is an assistant professor in computer science at NUDT He obtained the PhD degree from Delft University of Technology in 2014 His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms , E-mail：firstname.lastname@example.org.
Yong Dong received the PhD degree from National University of Defense Technology, China in 2012 He is now an associate professor at National University of Defense Technology, China His main research interests include supercomputer systems and storage systems , E-mail：email@example.com.
Yuan Yuan received the PhD degree from National University of Defense Technology, China in 2011 He is now an associate professor at National University of Defense Technology, China His research interests include supercomputer systems and HPC monitoring and diagnosis , E-mail：firstname.lastname@example.org.
Zheng Wang received the PhD degree in computer science from University of Edinburgh, UK in 2011 He is an associate professor at University of Leeds His research interests include the boundaries of parallel program optimisation, systems security, and applied machine learning He received four best paper awards for his work on machine learning based compiler optimisation (PACT’10, CGO’17, PACT’17, and CGO’19) , E-mail：email@example.com.
Keqin Li received the PhD degree in computer science from the University of Houston, USA in 1990 He is a SUNY distinguished professor of computer science at the State University of New York and a distinguished professor at Hunan University, China His current research interests include cloud computing, fog computing and mobile edge computing, energy-efficient computing and communication, embedded systems and cyber-physical systems, heterogeneous computing systems, big data computing, high-performance computing, CPU-GPU hybrid and cooperative computing, computer architectures and systems, computer networking, machine learning, and intelligent and soft computing He has published over 710 journal articles, book chapters, and refereed conference papers, and has received several best paper awards , E-mail：firstname.lastname@example.org.
Cite this article
Chen Juan, Qi Xinxin, Wu Feihao, Fang Jianbin, Dong Yong, Yuan Yuan, Wang Zheng, Li Keqin.
High-Performance Computing (HPC) systems capability is increasingly constrained by their power consumption, and this will become worse due to the end of Dennard scaling[1, 2]. However, HPC users want a higher performance to run the more complex model or larger datasets. As such, there is a vital need to find ways to boost the efficiency of HPC applications without increasing power consumption significantly.
Power and performance optimization for HPC systems is certainly not a new research topic. There is considerable work on designing innovative compute architecture for better energy efficiency for HPC systems[3, 4, 5, 6, 7, 8, 9, 10, 11]. Other work utilizes a software-based resource scheduling approach by carefully determining the computation resource settings, such as the number of assigned compute nodes and processor frequency to match the workload to improve application performance under a power constraint[12, 13]. A key advantage of a software-based approach is that it is readily deployable on existing hardware as no hardware modification is required[15, 14].
The allocation of computing resources, such as computing nodes or Control Processing Unit (CPU) cores, is essential for power and performance optimization. Current resource allocation strategies aimed at maximizing system use, i.e., no additional computing nodes would be allocated unless the processor cores of existing nodes have been fully utilized. Despite these techniques, they do not offer the optimal performance for memory-bound applications by using as few computing nodes as possible to limit the overall power consumption. As we can see in this paper, maximizing processor core usage can lead to serious memory contention, which in turn leads to sub-optimal performance. Because the memory sub-system is now becoming a bottleneck of HPC systems[17, 16], we need a better resource allocation strategy to effectively tackle the memory contention problem to achieve higher performance.
This work aims at developing a new approach for HPC resource allocation, specifically targeting memory-bound data-parallel applications. Our insight is that instead of maximizing processor utilization, one can allocate additional computing nodes to reduce memory contention. If we can do this, we can then free some processor cores (as well as reduce the number of parallel processes) on a node to reduce memory contention to improve memory-bound workload performance.
However, translating this high-level idea into a practical system is non-trivial. A key challenge is to determine the optimal number of computing nodes to be allocated to an application, for a given number of user-requested processes. If we allocate too few or too many computing nodes, we either will be unable to gain much, or the extra power consumption will outweigh the benefit. Also, our scheduling decision would need to ensure that additional computing nodes will not generate substantially more power consumption as compared to the default approach. To overcome these challenges, we need a new technique to accurately model the application behavior to drive a precise resource allocation strategy.
In this paper, we present a novel HPC resource allocation strategy, which aims to boost the application performance without significantly compromising the power consumption budget for memory-bound workloads. We achieve this by firstly characterizing the application workload behavior based on offline profiling information (Section 3.1). The workload characteristics and profiling information are used to decide how many computing nodes the target application will be run. We then propose a new technique for modeling the subtle interactions between application performance, power consumption, compute nodes, memory bandwidth congestion, and CPU clock frequency. The model is then used as a method to change the number of allocated computing nodes and the CPU frequency to fit the hardware configuration to the workload behavior. Specifically, if the target application is perceived to be a memory-bound workload, we reduce the CPU frequency to lower the system’s power consumption, as the bottleneck is on memory assessment.
We apply our approach to 12 representative Message Passing Interface (MPI) benchmarks from the NAS parallel benchmark and HPC Challenge (HPCC) benchmark suites. Our evaluation platform is an HPC cluster with 20 cores per node. Experimental results show that our approach achieves on average a 12.69% performance improvement over the conventional resource allocation strategy, but uses 7.06% less total power, which translates into 17.77% energy savings.
This paper makes the following technical contributions:
(1) We show that by allocating additional computing nodes with appropriate CPU frequency settings, one can boost application performance without increasing power consumption of the system for memory-bound workloads (Section 2.2).
(2) We present a novel technique for capturing interactions between workload characteristics and computation resource allocation and CPU settings (Section 3.1).
(3) We present a novel algorithm for scheduling and configuring resource allocation at the application level, showing how performance can be enhanced without infringing the overall power consumption constraint (Section 3.2).
2 Background and Overview
2.1 Problem scope
Our work tackles the question of resource allocation for distributed HPC workloads. We target typical HPC environments where the user job is submitted to a job queue. As part of the job submission, the user may specify the number of processes required. We are interested in developing a resource scheduler for mapping requested resources (i.e., the number of parallel processes) to compute nodes in this work. A default strategy for the problem would be to allocate the minimal number of compute nodes to run one parallel process on a physical core. However, as we will show in this paper, such an approach often leads to sub-optimal performance for memory-intensive applications. Our goal is to determine the optimal number of compute nodes and the processor clock frequency of each compute node to reduce the running time of memory-bound applications, while at the same time to cap the peak power and energy consumption as the default strategy.
2.2 Motivating example
Considering scheduling bt and cg from the NAS MPI parallel benchmark suite on an HPC cluster where each computing node has 20 processor cores (see Section 4.1), in this example, we assume that a user requests to run bt and cg with 169 and 128 processes respectively and both benchmarks run with the class D input.
Figure 1 shows how the number of compute nodes affects application performance. The baseline here is to maximize the processor utilization, i.e., using the minimum number of compute nodes where each parallel process runs on a physical processor core. With this baseline strategy, on our evaluation platform, we would allocate 9 and 7 compute nodes respectively to bt and cg. In this example, we experimentally allocate one and two extra compute nodes to the applications but keep the number of parallel processes unchanged (which will be evenly distributed across compute nodes).
Figure 3 compares the total peak power of different number of nodes and different frequencies. The red dash line denotes the target total power value of 2.6 GHz under
To summarize, running examples have shown that achieving better performance without increasing the power consumption and energy usage by the use of additional compute nodes is possible. However, the optimum number of compute nodes and the CPU frequency setting depend on both user-specific resource requirements and workload characteristics. In the remainder of this paper, we show how an adaptive resource scheduler can be developed to perform compute resource allocation to boost the application performance without increasing the total power consumption and energy consumption.
2.3 Overview of our approach
Our approach is implemented on an HPC cluster as part of the central job scheduler used. This is completely automated, and no modification to the application source code is required. To determine the compute resources to use for a job that is ready to run, our approach takes as input the program binary, input data, and user-supplied requirement the number of parallel processes in our case. It then determines the number of compute nodes to be provided to the job and the CPU’s clock frequency to be used across allocated machines.
To determine the optimal hardware resource allocation and frequency settings, we first profile the application as part of the strategy for resource allocation to capture the memory and CPU characteristics of the target application (Section 3.1). Specifically, VTune is used to collect the memory trace and performance metrics of the target application from a single node, and we use Intel Running Average Power Limit (RAPL) to take power measurements from all computing nodes. Then, under different allocation strategies, the profiling information is used to estimate the performance gain and energy consumption to find the optimum setting (Section 3.1). Finally, we actively analyze the instantaneous power readings to dynamically adjust the CPU frequency to cap the peak power of each computing node (Section 3.2) .
3 Our Approach
The heart of our approach is a set of functions for modeling how the resource allocation and frequency setting affect the application’s performance and power consumption.
3.1 Power and performance modeling
The overall goal of our power-performance model is to improve the program performance without increasing the total power consumption. It achieves this by finding the relationship of energy consumption of allocated computing nodes, CPU frequencies, and parallel execution time. The application execution time,
Table 1 lists all the parameters used by our model and their descriptions.
|Number of processor cores on each computing node|
|Number of processes|
|Number of assigned computing nodes with the default resource allocation strategy|
|Number of computing nodes to be increased|
|Memory traffic of computing node |
|Average value of |
|Maximum value of |
|Physical memory bandwidth on a single computing node|
|Threshold for the ratio of memory traffic to memory bandwidth|
|Maximum CPU clock frequency|
|Minimum CPU clock frequency|
|Processor clock frequency, which satisfies |
|Power consumption of one CPU core with clock frequency |
|Power consumption for each CPU core in the idle state, which equals |
|Power consumption of DRAM on one computing node|
|Power consumption of compoents other than the CPU and DRAM on one computing node|
Suppose we allocate
We use the maximum value of
However, Eq. (
where bound denotes the degree of memory contention. We use VTune in this work to count the bound value for each program, but other profiling tools can also be used. For example, running sp with 169 processes on 9 of our computing nodes gives a bound value of 0.401. As such, the performance improvement,
For memory-bound programs, the increased memory traffic per node after adding extra computing nodes can be calculated as
The maximum number of extra computing nodes,
Since the current multi-core design often does not provide voltage scaling on a per-processor core basis, using extra computing nodes will increase the energy consumption even some of the cores are not utilized, if we do not scale down the CPU frequency. Recall that our goal is to ensure the total power consumption of (
Here, the left term of Formula (
Using Formula (
Take the power consumption into consideration, the number of extra computing nodes to use will need to fall into the range of [0,
Considering both the power and performance constraints, the additional profitable computing nodes
To maximize the performance boost, we take the maximum number of this interval in Formula (
In Formula (
Let the initial optimal processor frequency
The adjustable frequency range for a parallel program running on our supercomputer system is
|CPU frequency (GHz)|
Besides the optimal
3.2 Resource allocation
Our resource allocation scheme is described in Algorithm 1. This algorithm obtains nearly optimal values for a given program,
The algorithm’s input parameters include the maximum/minimum CPU frequencies
Our resource allocation scheme works as follows:
Firstly, we measure all the platform related parameters including all power values in Table 2 under different frequency levels and
Secondly, we run the program and obtain corresponding parameters by performance profile data and power profile data. Performance related profile data for the given program include
Characterizing the applications via profiling does not restrict our approach, as most scientific computing applications usually run many times. Even if the profile-based approach consumes a great deal of time on data profiling, we can still benefit from the later process.
Finally, we get the target power
According to this algorithm, to obtain a better optimal frequency setting, we run the program again for real-time power limiting. We use RAPL to ensure that the average power does not surpass
4 Experimental Setup
4.1 Platform and benchmarks
Hardware. We evaluate our approach with 64 computing nodes on an HPC cluster. Each of the computing nodes has 64 GB DDR4 RAM and two Intel Haswell 10-core E5-2660v3 processors running at 2.6 GHz. The multi-core processor supports Dynamic Voltage and Frequency Scale (DVFS) with 15 states, from 1.2 to 2.6 GHz at a step of 0.1 GHz. Thermal Design Power (TDP) for this processor is 105 W. Each computing node supports RAPL for power measurement. We disable hyperthreading to obtain stable performance.
|Benchmark||Type||Performance boost (%)||Power saving (%)||Energy saving (%)|
Software. Each computing node runs CentOS 7.4 with Linux kernel 3.10. We rely on the local Operating System (OS) to schedule processes and do not bind tasks to specific cores. All the benchmarks are compiled with gcc 4.8.5 with "-O3" as the compiler option and run with openmpi 4.0.0.
4.2 Evaluation methodology
Memory tracing. We use VTune to collect the memory trace with a sample rate of one second by running the program on the master node of a cluster. We take the weighted average of the memory histogram produced by VTune as the memory traffic value, i.e.,
Energy measurement. We use powergov to measure processor power consumption and memory power consumption. Powergov uses RAPL to modelthe power consumption for processor and memory. We use mlc to measure memory bandwidth and memory access delay under the various memory traffic conditions. We measure the CPU consumption under different frequency settings on a per-core level. Table 4 gives the measured per-core power consumption under different CPU frequencies. For memory power consumption, our experiments on real hardware show that the memory power consumption is constant as the CPU frequency changes. Power consumption of the rest system component, i.e.,
|CPU frequency (GHz)||CPU frequency (GHz)|
Performance report. We run each model on each input to collect performance profile data and power profile data until the 95% confidence bound per model per input is less than 5%. On average, we run each benchmark three times for each evaluation setting and remove obvious outliers. Then, we report the average performance across multiple runs.
5 Experimental Result
5.1 Overall result
Table 3 shows the performance improvement over the default resource allocation strategy when using our approach. It also gives the number of compute nodes and CPU frequency settings given by our analysis. Here, column "
Overall, our approach significantly improves the performance of memory-bound applications. It achieves on average 9.44% improvement for the seven memory-bound benchmarks, but using less 2.17% total power consumption. This translates to an average 10.98% reduction in energy consumption. Our approach improves the performance for most of the memory-bound applications (up to 18.82%) without incurring power consumption increase.
We also evaluated our approach using a larger dataset with a larger number of parallel processes using 30 computing nodes. The results are given in Table 5 where our approach delivers a performance improvement of up to 12.69% for the seven memory-bound benchmarks, with an average total savings of 7.06% for power and 17.77% energy consumption.
|Benchmark||Type||Performance boost (%)||Power saving (%)||Energy saving (%)|
5.2 Impact of using more nodes
Figure 6 shows using more nodes to run programs can reduce the execution time when the CPU frequency is fixed at 2.6 GHz. The performance improvement depends on the degree of memory contention. The higher the memory contention is, the better the performance improvement could be achieved by adding more nodes. However, using more nodes will increase the total power consumption.
Figure 7 shows the details of the increase of power consumption in percentage when all processor frequencies are 2.6 GHz.
Details of the increase of power consumption in percentage when all processor frequencies are 2.6 GHz.
5.3 Bandwidth comparison
We compare the sum of memory traffic on
5.4 Result using power limit
After calculating the optimal number of compute nodes
Improvement of performance, and power and energy savings of NAS benchmarks. The further away from the centre, the better the improvement.
Figure 9 shows the performance, power, and energy consumption changes of seven applications under three scenarios (Case 0, Case 1, and Case 2) compared with the default resource allocation strategy. The further away from the center, the better performance improvement, power savings, and energy savings an application has. Combining these three graphs, we find that compared with Case 1, Case 2 improves the performance most, but it causes the worst power consumption, which is not in line with the original intention of this paper. The purpose of this paper is to improve the performance of the applications under power constraints. However, to improve the performance of the program at the cost of causing, a lot of power consumption is not what we advocate. Unlike Case 2, Case 1 strictly complies with the power constraints, and the percentage of power consumption reduction tends to zero, which maximizes the performance of the application within the power constraints.
Cases 0 and 1 are the recommended methods. Both of them use the resource allocation strategy proposed in this paper to find the optimal number of compute nodes and frequency. The only difference is that RAPL has been carried on to power capping in Case 1, which achieves higher performance improvement. It is worth emphasizing that both Cases 0 and 1 have applied our approach proposed in this paper, which shows that this algorithm is effective and necessary.
5.5 Using up all the cores of
Because our approach adds
5.6 Compared to the optimal performance
In this experiment, we enumerate all
|Benchmark||Type||Frequency by ours (GHz)||Frequency by optimal setting (GHz)|
Performance improvement comparison of our approach and optimal setting for 10 nodes.
Performance improvement comparison of our approach and optimal setting for 30 nodes.
The power savings comparison of our approach and optimal setting by traversal methods is shown in Fig. 12 and Fig. 13. The power savings difference between them for 12 benchmarks for 10 nodes is 0.9% on average. By our approach, programs mg and DGEMM get more power savings compared to optimal settings. That is because the processor frequency by our approach is a little lower than that by the optimal setting.
Power savings given by our approach and optimal settings for using 10 computing nodes.
Power savings given by our approach and optimal settings for using 30 computing nodes.
6 Related Work
Energy consumption has become one of the most important concerns in computing systems and HPC in particular. Several researchers have developed techniques and systems to save energy with a slight increase in execution time. Reference  used Duty Cycle Modulation technology to save energy in MPI applications and Ref.  used DVFS technology to save energy in OpenMP applications. Reference  found that combining DVFS and Duty Cycle Modulation can get more energy savings. These approaches saved processor energy consumption by scaling down processor clock frequency with a modest increase in execution time. The increase in execution time depends on the accuracy of processor idle time prediction. References [25,26] used Near Threshold Computing (NTC) to save processor energy consumption. Except for saving processor energy consumption, Refs. [27,28] focused on how to reduce memory energy consumption.
Besides low-power techniques, power-constrained problems for compute nodes are also focused. References [29,30] reasonably allocated power to CPU and memory for performance improvement with the power limits. The main idea is that the power demands of the processor and memory are different for different applications. According to applications’ characteristics, they allocated power to CPU and memory to satisfy their demand for performance and power. Furthermore, Refs. [31,32] focused on a cluster. Firstly, when the power of a cluster is limited, they needed to set the number of active nodes according to an application’s scalability. Secondly, they needed to allocate the power to compute nodes, and also allocate the power to processor and memory in one node ultimately. Finally, they improved the application’s performance with power constraints.
Processor overclocking has been used to improve energy efficiency. For some applications, Ref.  found using turbo technology can achieve better performance under power constraints. It changes the clock speed of each socket (include turbo frequency), core use per socket, hyperthreading, the number of sockets in use, and the number of memory controllers in use to improve performance. Reference  found F-overclocking technology can achieve greater energy efficiency than DVFS, low voltage technology, and baseline. Reference  also found Turbo Boost Technology would enhance the application’s energy efficiency. This prior work is thus complementary to our work and can be used to control the CPU frequency when adding extra nodes.
This paper has presented a novel resource allocation scheme for HPC workloads, specifically targets memory-bound data-parallel applications. Our approach exploits a key observation that to improve performance, by reducing the number of parallel processes on a single host, one can reduce the memory contention. Unlike prior work that aims to maximize the system utilization, our approach judiciously allocates additional computing nodes to run a fewer number of parallel processes on a single node. Furthermore, to cap the total power consumption, our approach automatically determines the best CPU frequency to suit the CPU performance with the memory throughput. We propose a set of analytical models to estimate the profitability of using additional compute nodes based on profiling information. We evaluate our approach by applying it on a high-performance cluster to 12 MPI benchmarks. Experimental findings show that our approach improves the performance of seven memory-bound applications by 12.69% on average, but using 7.06% less overall power consumption, which translates into 17.77% energy savings when compared to the default resource allocation strategy.
Design of ion-implanted MOSFET’s with very small physical dimensions, vol.
A 30 year retrospective on Dennard’s MOSFET scaling paper, vol.
Single-ISA heterogeneous multi-core architectures: The potential for processor power reductionin
Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling, vol.
Heterogeneous chip multiprocessors, vol.
Energy conservation in heterogeneous server clustersin
Performance, energy, and thermal considerations for SMT and CMP architecturesin
Composite cores: Pushing heterogeneity into a corein
Hierarchical power management for asymmetric multi-core in dark silicon erain
Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraintsin
The Yin and Yang of power and performance for asymmetric hardware and managed softwarein
Power tuning HPC jobs on power-constrained systemsin
Practical resource management in power-constrained, high performance computingin
An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budgetin
Energy efficiency on multi-core architectures with multiple voltage islands, vol.
Roofline: An insightful visual performance model for multicore architectures, vol.
The Landscape of Parallel Computing Research: A View from Berkeley, Electrical Engineering and Computer Sciences, Tech. Rep. UCB/EECS-2006-183
The HPC Challenge (HPCC) benchmark suitein
Intel® VTuneTM Amplifier,
Intel® Power Governor,
Intel® Memory Latency Checker v3.8,
Adagio: Making DVS practical for complex HPC applicationsin
Using per-loop CPU clock modulation for energy efficiency in OpenMP applicationspresented at
An adaptive core-specific runtime for energy efficiencypeesented at
Variation-aware voltage island formation for power efficient near-threshold manycore architecturespresented at
EnergySmart: Toward energy-efficient manycores for near-threshold computingpresented at
Energy-performance trade-offs on energy-constrained devices with multi-component DVFSpresented at
DReAM: An approach to estimate per-task DRAM energy in multicore systems, vol.
Predicting optimal power allocation for CPU and DRAM domainsin
Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniquesin
CLIP: Cluster-level intelligent power coordination for power-bounded systemspresented at
Exploring hardware overprovisioning in power-constrained, high performance computingin
Dynamic management of TurboMode in modern multi-core chipspresented at
Leveraging process variation for performance and energy: In the perspective of overclocking, vol.
版权所有 © 《清华大学学报自然科学版（英文版）》编辑部