Tsinghua Science and Technology  2019, Vol. 24 Issue (06): 677-693    doi: 10.26599/TST.2018.9010103
 REGULAR ARTICLES
Deep Model Compression for Mobile Platforms: A Survey
Kaiming Nan, Sicong Liu, Junzhao Du, Hui Liu*
∙ Kaiming Nan and Sicong Liu are with School of Computer Science and Technology, Xidian University, Xi’an 710071, China. E-mail: nankaiming@stu.xidian.edu.cn; liusc@stu.xidian.edu.cn.
∙ Junzhao Du and Hui Liu are with School of Software and Institute of Software Engineering, Xidian University, Xi’an 710071, China. E-mail: dujz@xidian.edu.cn.

Abstract

Despite the rapid development of mobile and embedded hardware, directly executing computation-expensive and storage-intensive deep learning algorithms on these devices’ local side remains constrained for sensory data analysis. In this paper, we first summarize the layer compression techniques for the state-of-the-art deep learning model from three categories: weight factorization and pruning, convolution decomposition, and special layer architecture designing. For each category of layer compression techniques, we quantify their storage and computation tunable by layer compression techniques and discuss their practical challenges and possible improvements. Then, we implement Android projects using TensorFlow Mobile to test these 10 compression methods and compare their practical performances in terms of accuracy, parameter size, intermediate feature size, computation, processing latency, and energy consumption. To further discuss their advantages and bottlenecks, we test their performance over four standard recognition tasks on six resource-constrained Android smartphones. Finally, we survey two types of run-time Neural Network (NN) compression techniques which are orthogonal with the layer compression techniques, run-time resource management and cost optimization with special NN architecture, which are orthogonal with the layer compression techniques.

Received: 31 January 2018      Published: 20 June 2019
Corresponding Authors: Hui Liu
 Table 1 Comparison and quantification of layer compression techniques for DNNs. Fig. 1 Weight factorization by using Tech1 or Tech2. Fig. 2 Example of direct sparse convolution computation (Tech5). Fig. 3 An example of Tech9. Fig. 4 Novel combination of compression techniques in LeNet. Fig. 5 Novel combination of compression techniques in AlexNet. Table 2 Recognition tasks, datasets, and models. Table 3 Resource constraints study on six resource-constrained Android platforms. 𝜸 values on Tech6 at CONV layers, and (e) different $𝜽$ values on Tech7 at CONV layers. In this figure, A means accuracy, $Sp$ is parameter size, $Sf$ is the intermediate feature size, T stands for latency, and E and C are energy cost and MAC number, respectively. x-axis shows a different parameter setup, and y-axis is the accuracy/cost ratio of compressed layer to origin layers."> Fig. 6 Impact of various compression hyper-parameters to the metrics for different techniques. (a) Different k values on Tech1 at FC layers, (b) different k values on Tech1 at CONV layers, (c) different k values on Tech2 at FC layers, (d) different $𝜸$ values on Tech6 at CONV layers, and (e) different $𝜽$ values on Tech7 at CONV layers. In this figure, A means accuracy, $Sp$ is parameter size, $Sf$ is the intermediate feature size, T stands for latency, and E and C are energy cost and MAC number, respectively. x-axis shows a different parameter setup, and y-axis is the accuracy/cost ratio of compressed layer to origin layers. Sp, intermediate feature size $Sf$, latency T, energy cost E, and MAC computation C. Tech1 is used for FC1 and CONV2, Tech2 and Tech3 act on FC1, Tech4 and Tech5, Tech6, Tech7, Tech8, and Tech9 target at CONV2, and Tech10 is used for all FC layers. y-axis shows the cost ratio of compressed layers to the original layers."> Fig. 7 Comparison of the impact of different compression techniques to the certain layers. (a) LeNet model and (b) AlexNet model in terms of accuracy A, parameter size $Sp$, intermediate feature size $Sf$, latency T, energy cost E, and MAC computation C. Tech1 is used for FC1 and CONV2, Tech2 and Tech3 act on FC1, Tech4 and Tech5, Tech6, Tech7, Tech8, and Tech9 target at CONV2, and Tech10 is used for all FC layers. y-axis shows the cost ratio of compressed layers to the original layers. Fig. 8 Comparison of the impact of different compression techniques to the whole model. (a) LeNet model and (b) AlexNet model. Tech1 is used for FC1 and CONV2, Tech2 and Tech3 act on FC1, Tech4 and Tech5, Tech6, Tech7, Tech8, and Tech9 target at CONV2, and Tech10 is used for all FC layers. The y-axis shows the overhead of compressed model compared with the original model. Table 4 Performance of compression techniques on LeNet + MNIST and AlexNet + CIFAR-10, as evaluated on RedMi 3S phone (Device1). Table 5 Performance of different recognition tasks which choose best compression techniques with Eq. (4). Table 6 Performance of NN compression techniques on different resource-constrained Android devices.