Please wait a minute...
Tsinghua Science and Technology  2019, Vol. 24 Issue (06): 663-676    doi: 10.26599/TST.2018.9010100
SPECIAL SECTION ON COGNITIVE SYSTEMS AND COMPUTATION     
Deep Learning Based 2D Human Pose Estimation: A Survey
Qi Dang, Jianqin Yin*, Bin Wang, Wenqing Zheng
∙ Qi Dang and Jianqin Yin are with Automation School, Beijing University of Posts and Telecommunications, Beijing 100876, China. E-mail: dangqi213@163.com.
∙ Qi Dang and Bin Wang are with State Key Lab. of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China. E-mail: wangbinth@tsinghua.edu.cn.
∙ Wenqing Zheng is with School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China. E-mail: zhengwenqing@bupt.edu.cn.
Download: PDF (7163 KB)      HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Human pose estimation has received significant attention recently due to its various applications in the real world. As the performance of the state-of-the-art human pose estimation methods can be improved by deep learning, this paper presents a comprehensive survey of deep learning based human pose estimation methods and analyzes the methodologies employed. We summarize and discuss recent works with a methodology-based taxonomy. Single-person and multi-person pipelines are first reviewed separately. Then, the deep learning techniques applied in these pipelines are compared and analyzed. The datasets and metrics used in this task are also discussed and compared. The aim of this survey is to make every step in the estimation pipelines interpretable and to provide readers a readily comprehensible explanation. Moreover, the unsolved problems and challenges for future research are discussed.



Key wordshuman pose estimation      deep learning      computer vision     
Received: 05 March 2018      Published: 20 June 2019
Corresponding Authors: Jianqin Yin   
About author:

Wenqing Zheng is an undergraduate in Beijing University of Posts and Telecommunications.

Cite this article:

Qi Dang, Jianqin Yin, Bin Wang, Wenqing Zheng. Deep Learning Based 2D Human Pose Estimation: A Survey. Tsinghua Science and Technology, 2019, 24(06): 663-676.

URL:

http://tst.tsinghuajournals.com/10.26599/TST.2018.9010100     OR     http://tst.tsinghuajournals.com/Y2019/V24/I06/663

5]. (b) Multi-person pose estimation results from Ref. [3].">
Fig. 1 Examples of pose estimation results. (a) Single person pose estimation results from Ref. [5]. (b) Multi-person pose estimation results from Ref. [3].
10].">
Fig. 2 Example of HOG features for keypoints detection[10].
Fig. 3 Taxonomy of this review.
24].">
Fig. 4 Example of stickman annotations[24].
Fig. 5 Framework of the pipeline for single-person pose estimation. (a) Heatmap-based framework: There are two steps, generating heatmaps and regressing keypoints. (b) One step framework: There is just one step, human keypoints are regressed directly.
Fig. 6 An example of heatmap-based single-person pipeline with heatmap. (a) Original image, (b) heatmap generated by estimator, and (c) detection result.
FrameworkAdvantageDisadvantage
Direct regression basedQuick and direct, trained with an end-to-end fashion.Difficult to learn mapping.
Can be applied to 3D scenarios without much changes.Cannot be applied to multi-person case.
Heatmap-basedEasy to be visualized.High memory consumption for getting high resolution heat map.
Can be applied to complicated case.Hard to be extended to 3D scenarios.
Table 1 Comparison between direct regression based framework and heatmap-based framework.
Fig. 7 Framework of top-down pipeline.
Fig. 8 An illustration of top down pipeline. (a) Input image, (b) two persons detected by human detector, (c) cropped single person image, (d) single person pose detection result, and (e) multi-person pose detection result.
39]. The data of this chart is from the slides presented by author.">
Fig. 9 Relationship of human detection mAP and keypoints mAP in Ref. [39]. The data of this chart is from the slides presented by author.
Fig. 10 Framework of bottom-up pipeline.
Fig. 11 An illustration of bottom-up pipeline. (a) Input image, (b) keypoints of all the person, and (c) all detected keypoints are connected to form human instance.
Key procedureApproachSingleTop-downBottom-up
Data augmentationTraditional: cropping, rotating, scaling, and horizontally flipping???
Using unlabeled data: Data distillation??
Data preprocessingResize without distortion???
Network designHole algorithm,???
Upsampling,
Output stride <= 8,
Skip connections,
Big Effective receptive field,
Search automatically
Post-processingDetection NMS?
Skeleton NMS??
Table 2 A list of key procedures.
YearDatasetSizeTypeNumDescription
2008Buffy[24]472 frames training
276 frames testing
Upper body6 body partsData are from TV show. Line segments are provided to indicate position. Size and orientation of body parts are also provided. Only one person is annotated in each image.
2010LSP[57]1000 images training
1000 images testing
Full body14 keypointsData are from Flicker with sport category tag. Images are scaled. Only one person is annotated in each image.
2013FLIC[34]3987 images training
1016 images testing
Upper body10 keypointsData are from Hollywood movies. The persons are occluded or severely non-frontal are deleted.
2014Parse[58]100 images training
205 images testing
Full body14 keypointsIt is a small dataset with extended annotations including facial expression, gaze direction, and gender.
2014MPII Human pose[59]410 activities
2.5×104 images
Full body16 keypointsData are from YouTube videos. It covers 410 human activities and each image is provided with activity label.
2014Poses in the wild[60]30 sequences
900 frames
Upper body5 keypointsThe data are 30 videos sequences generated from 3 Hollywood movies.
2014MSCOCO[61]115×103 images training
5×103 images validation
20×103 images test-Dev
20×103 images test-Challenge
Full body17 keypointsData are from Internet. It contains diverse activities.
2017AI Challenger[62]210×103 images training
30×103 images validation
60×103 images testing
Full body14 keypointsData are crawled from Internet. It is the largest human pose image dataset currently.
2017PoseTrack[63]514 videos including 66 374 frames
300 videos training
50 videos validation
208 videos testing
Full body15 keypointsThe videos are from MPII Human Pose dataset. This dataset focusses on 3 aspects: (1) single-frame multi-person pose estimation. (2) multi-person pose estimation in videos. (3) multi-person articulated tracking.
Table 3 Human pose estimation datasets.
Fig. 12 Examples of different datasets.
[1]   Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., and Murphy K., Towards accurate multiperson pose estimation in the wild, arXiv preprint arXiv:1701.01779, 2017.
[2]   Insafutdinov E., Pishchulin L., Andres B., Andriluka M., and Schiele B., Deepercut: A deeper, stronger, and faster multi-person pose estimation model, in European Conference on Computer Vision, 2016, pp. 34-50.
[3]   Cao Z., Simon T., Wei S.-E., and Sheikh Y., Realtime multi-person 2d pose estimation using part affinity fields, in CVPR, 2017, vol. 1, p. 7.
[4]   Wei S.-E., Ramakrishna V., Kanade T., and Sheikh Y., Convolutional pose machines, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4724-4732.
[5]   Toshev A. and Szegedy C., Deeppose: Human pose estimation via deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1653-1660.
[6]   Chen X. and Yuille A. L., Articulated pose estimation by a graphical model with image dependent pairwise relations, in Advances in Neural Information Processing Systems, 2014, pp. 1736-1744.
[7]   Martinez J., Hossain R., Romero J., and Little J. J., A simple yet effective baseline for 3d human pose estimation, in IEEE International Conference on Computer Vision, Venice, Italy, 2017, vol. 206, p. 3.
[8]   Chen Y., Shen C., Wei X.-S., Liu L., and Yang J., Adversarial posenet: A structure-aware convolutional network for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1212-1221.
[9]   Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P. V., and Schiele B., Deepcut: Joint subset partition and labeling for multi person pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4929-4937.
[10]   Yang Y. and Ramanan D., Articulated pose estimation with flexible mixtures-of-parts, in Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 2011, pp. 1385-1392.
[11]   Yang Y. and Ramanan D., Articulated human detection with flexible mixtures of parts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878-2890, 2013.
[12]   Wang F. and Li Y., Beyond physical connections: Tree models in human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 596-603.
[13]   Sun M. and Savarese S., Articulated part-based model for joint object detection and pose estimation, in Computer Vision (ICCV), Barcelona, Spain, 2011, pp. 723-730.
[14]   Eichner M., Marin-Jimenez M., Zisserman A., and Ferrari V., 2d articulated human pose estimation and retrieval in (almost) unconstrained still images, International Journal of Computer Vision, vol. 99, no. 2, pp. 190-214, 2012.
[15]   Eichner M. and Ferrari V., We are family: Joint pose estimation of multiple persons, in European Conference on Computer Vision, Crete, Greece, 2010, pp. 228-242.
[16]   Guo Y., Liu Y., Oerlemans A., Lao S., Wu S., and Lew M. S., Deep learning for visual understanding: A review, Neurocomputing, vol. 187, pp. 27-48, 2016.
[17]   Poppe R., Vision-based human motion analysis: An overview, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 4-18, 2007.
[18]   Liu Z., Zhu J., Bu J., and Chen C., A survey of human pose estimation: The body parts parsing based methods, Journal of Visual Communication and Image Representation, vol. 32, pp. 10-19, 2015.
[19]   Zhang H.-B., Lei Q., Zhong B.-N., Du J.-X., and Peng J., A survey on human pose estimation, Intelligent Automation & Soft Computing, vol. 22, no. 3, pp. 483-489, 2016.
[20]   Gong W., Zhang X., Gon?alez J., Sobral A., Bouwmans T., Tu C., and Zahzah E.-h., Human pose estimation from monocular images: A comprehensive survey, Sensors, vol. 16, no. 12, p. 1966, 2016.
[21]   Murphy-Chutorian E. and Trivedi M. M., Head pose estimation in computer vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607-626, 2009.
[22]   Erol A., Bebis G., Nicolescu M., Boyle R. D., and Twombly X., Vision-based hand pose estimation: A review, Computer Vision and Image Understanding, vol. 108, nos. 1&2, pp. 52-73, 2007.
[23]   Asadi-Aghbolaghi M., Clapés A., Bellantonio M., Escalante H. J., Ponce-López V., Baró X., Guyon I., Kasaei S., and Escalera S., A survey on deep learning based approaches for action and gesture recognition in image sequences, in Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 2017, pp. 476-483.
[24]   Ferrari V., Marin-Jimenez M., and Zisserman A., Progressive search space reduction for human pose estimation, in Computer Vision and Pattern Recognition, Anchorage, AK, USA, 2008, pp. 1-8.
[25]   Carreira J., Agrawal P., Fragkiadaki K., and Malik J., Human pose estimation with iterative error feedback, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4733-4742.
[26]   Sun X., Shang J., Liang S., and Wei Y., Compositional human pose regression, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, vol. 2.
[27]   Luvizon D. C., Tabia H., and Picard D., Human pose regression by combining indirect part detection and contextual information, arXiv preprint arXiv:1710.02322, 2017.
[28]   Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y., Generative adversarial nets, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672-2680.
[29]   Newell A., Yang K., and Deng J., Stacked hourglass networks for human pose estimation, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 483-499.
[30]   Chu X., Yang W., Ouyang W., Ma C., Yuille A. L., and Wang X., Multi-context attention for human pose estimation, arXiv preprint arXiv:1702.07432, 2017.
[31]   Pfister T., Simonyan K., Charles J., and Zisserman A., Deep convolutional neural networks for efficient pose estimation in gesture videos, in Asian Conference on Computer Vision, Singapore, 2014, pp. 538-552.
[32]   Tompson J. J., Jain A., LeCun Y., and Bregler C., Joint training of a convolutional network and a graphical model for human pose estimation, in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1799-1807.
[33]   Pfister T., Charles J., and Zisserman A., Flowing convnets for human pose estimation in videos, in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1913-1921.
[34]   Sapp B. and Taskar B., Modec: Multimodal decomposable models for human pose estimation, in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013, pp. 3674-3681.
[35]   Radosavovic I., Doll?′ar P., Girshick R., Gkioxari G., and He K., Data distillation: Towards omni-supervised learning, arXiv preprint arXiv:1712.04440, 2017.
[36]   Fang H., Xie S., Tai Y.-W., and Lu C., Rmpe: Regional multi-person pose estimation, in The IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017.
[37]   He K., Gkioxari G., Doll?′ar P., and Girshick R., Mask rcnn, in Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.
[38]   Iqbal U. and Gall J., Multi-person pose estimation with local joint-to-person associations, in European Conference on Computer Vision, Amsterdam, Netherlands, 2016, pp. 627-642.
[39]   Chen Y., Wang Z., Peng Y., Zhang Z., Yu G., and Sun J., Cascaded pyramid network for multi-person pose estimation, arXiv preprint arXiv:1711.07319, 2017.
[40]   Simonyan K. and Zisserman A., Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
[41]   He K., Zhang X., Ren S., and Sun J., Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[42]   Szegedy C., Ioffe S., Vanhoucke V., and Alemi A. A., Inception-v4, inception-resnet and the impact of residual connections on learning, in AAAI, San Francisco, CA, USA, 2017, vol. 4, p. 12.
[43]   Lin T.-Y., Doll?′ar P., Girshick R., He K., Hariharan B., and Belongie S., Feature pyramid networks for object detection, in CVPR, Honolulu, HI, USA, 2017, vol. 1, p. 4.
[44]   Bodla N., Singh B., Chellappa R., and Davis L. S., Improving object detection with one line of code, arXiv preprint arXiv:1704.04503, 2017.
[45]   Wang Y. W., Wang C., Li Q., Leng B., Li Z., and Yan J., Team oks keypoint detection, , 2017.
[46]   Burgos-Artizzu X. P., Hall D. C., Perona P., and Dollár P., Merging pose estimates across space and time, in British Machine Vision Conference (BMVC), Bristol, UK, 2013.
[47]   Chen X. and Yuille A., Parsing occluded people by flexible compositions, in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3945-3954.
[48]   Chen X., Mottaghi R., Liu X., Fidler S., Urtasun R., and Yuille A., Detect what you can: Detecting and representing objects using holistic models and body parts, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 1971-1978.
[49]   Insafutdinov E., Andriluka M., Pishchulin L., Tang S., Levinkov E., Andres B., and Schiele B., Arttrack: Articulated multi-person tracking in the wild, in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, USA, 2017.
[50]   Zhu X., Jiang Y., and Luo Z., Multi-person pose estimation for posetrack with enhanced part affinity fields, presented at the ICCV PoseTrack Workshop, Venice, Italy, 2017.
[51]   Newell A., Huang Z., and Deng J., Associative embedding: End-to-end learning for joint detection and grouping, in Advances in Neural Information Processing Systems, San Francisco, CA, USA, 2017, pp. 2274-2284.
[52]   Xie S., Girshick R., Doll?′ar P., Tu Z., and He K., Aggregated residual transformations for deep neural networks, in Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987-5995.
[53]   Huang G., Liu Z., Weinberger K. Q., and van der Maaten L., Densely connected convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, vol. 1, p. 3.
[54]   Chen L.-C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 2018.
[55]   Zhong Z., Yan J., and Liu C.-L., Practical network blocks design with q-learning, arXiv preprint arXiv:1708.05552, 2017.
[56]   Ren S., He K., Girshick R., and Sun J., Faster r-cnn: Towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems, Austin, TX, USA, 2015, pp. 91-99.
[57]   Johnson S. and Everingham M., Clustered pose and nonlinear appearance models for human pose estimation, in Proceedings of the British Machine Vision Conference, Aberystwyth, UK, 2010.
[58]   Antol S., Zitnick C. L., and Parikh D., Zero-shot learning via visual abstraction, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 401-416.
[59]   Andriluka M., Pishchulin L., Gehler P., and Schiele B., 2d human pose estimation: New benchmark and state of the art analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 3686-3693.
[60]   Cherian A., Mairal J., Alahari K., and Schmid C., Mixing body-part sequences for human pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, WI, USA, 2014, pp. 2353-2360.
[61]   Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Doll?′ar P., and Zitnick C. L., Microsoft coco: Common objects in context, in European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 740-755.
[62]   Wu J., Zheng H., Zhao B., Li Y., Yan B., Liang R., Wang W., Zhou S., Lin G., Fu Y., et al., Ai challenger: A large-scale dataset for going deeper in image understanding, arXiv preprint arXiv:1711.06475, 2017.
[63]   Andriluka M., Iqbal U., Milan A., Insafutdinov E., Pishchulin L., Gall J., and Schiele B., Posetrack: A benchmark for human pose estimation and tracking, arXiv preprint arXiv:1710.10000, 2017.
[64]   Mscoco keypoint evaluation metric, , 2017.
[1] Weiwei Jiang, Lin Zhang. Geospatial Data to Images: A Deep-Learning Framework for Traffic Forecasting[J]. Tsinghua Science and Technology, 2019, 24(1): 52-64.
[2] Xiaocheng Feng,Lifu Huang,Bing Qin,Ying Lin,Heng Ji,Ting Liu. Multi-Level Cross-Lingual Attentive Neural Architecture for Low Resource Name Tagging[J]. Tsinghua Science and Technology, 2017, 22(6): 633-645.
[3] Zhenlong Yuan,Yongqiang Lu,Yibo Xue. DroidDetector: Android Malware Characterization and Detection Using Deep Learning[J]. Tsinghua Science and Technology, 2016, 21(1): 114-123.