Please wait a minute...
Tsinghua Science and Technology  2019, Vol. 24 Issue (2): 207-215    doi: 10.26599/TST.2018.9010044
    
CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation
Zhenyang Wang, Zhidong Deng*, Shiyao Wang
∙ Zhenyang Wang, Zhidong Deng, and Shiyao Wang are with the Department of Computer Science, Tsinghua University, Beijing 100084, China. E-mail: crazycry2010@gmail.com; sy-wang14@mails.tsinghua.edu.cn.
Download: PDF (4717 KB)      HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Semantic segmentation is a fundamental topic in computer vision. Since it is required to make dense predictions for an entire image, a network can hardly achieve good performance on various kinds of scenes. In this paper, we propose a cascade coarse-to-fine network called CasNet, which focuses on regions that are difficult to make pixel-level labels. The CasNet comprises three branches. The first branch is designed to produce coarse predictions for easy-to-label pixel regions. The second one learns to distinguish the relatively difficult-to-label pixels from the entire image. Finally, the last branch generates final predictions by combining both the coarse and the fine prediction results through a weighting coefficient that is estimated by the second branch. Three branches focus on their own objectives and collaboratively learn to predict from coarse-to-fine predictions. To evaluate the performance of the proposed network, we conduct experiments on two public datasets: SIFT Flow and Stanford Background. We show that these three branches can be trained in an end-to-end manner, and the experimental results demonstrate that the proposed CasNet outperforms existing state-of-the-art models, and it achieves prediction accuracy of 91.6% and 89.7% on SIFT Flow and Standford Background, respectively.



Key wordssemantic segmentation      convolutional neural network      hard negative mining     
Received: 25 October 2017      Published: 29 April 2019
Corresponding Authors: Zhidong Deng   
About author:

Shiyao Wang received the BS degree from Tianjin University, China, in 2014 and is pursuing the PhD degree in Tsinghua University, Beijing, China. Her research interests include computer vision, deep learning, and machine learning.

Cite this article:

Zhenyang Wang, Zhidong Deng, Shiyao Wang. CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation. Tsinghua Science and Technology, 2019, 24(2): 207-215.

URL:

http://tst.tsinghuajournals.com/10.26599/TST.2018.9010044     OR     http://tst.tsinghuajournals.com/Y2019/V24/I2/207

Fig. 1 Examples of semantic segmentation.
Fig. 2 A cascade coarse-to-fine network architecture for semantic segmentation.
DatasetMethodCoarse branchRefine branchAttention branchPixel acc. (%)
SIFT Flow(a) 89.2
(b) 91.0 0.8
(c)91.6 1.4
Stanford Background(a) 88.5
(b) 89.20.7
(c)89.71.2
Table 1 Ablation study on SIFT Flow and Stanford Background datasets.
MethodPixel acc. (%)Class acc. (%)
Liu et al.[6]76.7
Tighe and Lazebnik[23] SVM75.641.4
Tighe and Lazebnik[24]78.639.2
SVM+MRF
Farabet et al.[11] natural72.350.8
Farabet et al.[11] balanced78.529.6
Pinheiro and Collobert[25]77.729.8
Liang et al.[17]84.341.0
Long et al.[4]85.953.9
Jin et al.[26]86.956.5
He et al.[1]90.52
Ours91.652.5
Table 2 Segmentation results on SIFT Flow.
MethodPixel acc. (%)Class acc. (%)
Gould et al.[7]76.4
Tighe and Lazebnik[23]77.5
Eigen and Fergus[27]75.366.5
Singh and Kosecka[28]74.162.2
Lempitsky et al.[9]81.972.4
Liang et al.[17]83.174.8
Jin et al.[26]86.679.0
Ours89.775.4
Table 3 Segmentation results on the Stanford Background benchmark.
Fig. 3 Prediction results on Stanford Background dataset.
Fig. 4 CasNet visualization predictions.
[1]   He K. M., Zhang X. Y., Ren S. Q., and Sun J., Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[2]   Graves A., Mohamed A. R., and Hinton G., Speech recognition with deep recurrent neural networks, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 6645-6649.
[3]   Sermanet P., Eigen D., Zhang X., Mathieu M., Fergus R., and LeCun Y., Overfeat: Integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv: 1312.6229, 2013.
[4]   Long J., Shelhamer E., and Darrell T., Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.
[5]   Sung K. K. and Poggio T., Example-based learning for view-based human face detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 39-51, 1998.
[6]   Liu C., Yuen J., and Torralba A., Sift flow: Dense correspondence across scenes and its applications, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978-994, 2011.
[7]   Gould S., Fulton R., and Koller D., Decomposing a scene into geometric and semantically consistent regions, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 1-8.
[8]   Ladicky L., Russell C., Kohli P., and Torr P. H. S., Associative hierarchical CRFs for object class image segmentation, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 739-746.
[9]   Lempitsky V., Vedaldi A., and Zisserman A., A pylon model for semantic segmentation, in Proc. 24th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 1485-1493.
[10]   He X. M., Zemel R. S., and Carreira-Perpinan M. A., Multiscale conditional random fields for image labeling, in Proc. 2004 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, 2004, pp. 695-702.
[11]   Farabet C., Couprie C., Najman L., and LeCun Y., Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915-1929, 2013.
[12]   Couprie C., Farabet C., Najman L., and LeCun Y., Indoor semantic segmentation using depth information, arXiv preprint arXiv: 1301.3572, 2013.
[13]   Zhao H. S., Shi J. P., Qi X. J., Wang X. G., and Jia J. Y., Pyramid scene parsing network, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017.
[14]   Lin T. Y., Dollár P., Girshick R., He K. M., Hariharan B., and Belongie S., Feature pyramid networks for object detection, arXiv preprint arXiv: 1612.03144, 2016.
[15]   Yu F. and Koltun V., Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv: 1511.07122, 2015.
[16]   Visin F., Romero A., Cho K., Matteucci M., Ciccone M., Kastner K., Bengio Y., and Courville A., Reseg: A recurrent neural network-based model for semantic segmentation, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 2016, pp. 41-48.
[17]   Liang M., Hu X. L., and Zhang B., Convolutional neural networks with intra-layer recurrent connections for scene labeling, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 937-945.
[18]   Chen L. C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, arXiv preprint arXiv: 1606.00915, 2016.
[19]   Zheng S., Jayasumana S., Romera-Paredes B., Vineet V., Su Z. Z., Du D. L., Huang C., and Torr P. H. S., Conditional random fields as recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1529-1537.
[20]   Mostajabi M., Yadollahpour P., and Shakhnarovich G., Feedforward semantic segmentation with zoom-out features, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3376-3385.
[21]   Veit A., Wilber J. M., and Belongie S., Residual networks behave like ensembles of relatively shallow networks, in Advances in Neural Information Processing Systems 29, Barcelona, Spain, 2016, pp. 550-558.
[22]   Jia Y. Q., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., Guadarrama S., and Darrell T., Caffe: Convolutional architecture for fast feature embedding, in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, 2014, pp. 675-678.
[23]   Tighe J. and Lazebnik S., Superparsing: Scalable nonparametric image parsing with superpixels, in Proc. 11th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 352-365.
[24]   Tighe J. and Lazebnik S., Finding things: Image parsing with regions and per-exemplar detectors, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3001-3008.
[25]   Pinheiro P. O. and Collobert R., Recurrent convolutional neural networks for scene labeling, in Proc. 31st International Conference on Machine Learning, Beijing, China, 2014, pp. 82-90.
[26]   Jin X. J., Chen Y. P., Jie Z. Q., Feng J. S., and Yan S. C., Multi-path feedback recurrent neural networks for scene parsing, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017, pp. 4096-4102.
[27]   Eigen D. and Fergus R., Nonparametric image parsing using adaptive neighbor sets, in Proc. 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2799-2806.
[28]   Singh G. and Kosecka J., Nonparametric scene parsing with adaptive feature relevance and semantic context, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3151-3157.
[1] Weiwei Jiang, Lin Zhang. Geospatial Data to Images: A Deep-Learning Framework for Traffic Forecasting[J]. Tsinghua Science and Technology, 2019, 24(1): 52-64.
[2] Jingchun Cheng, Yali Li, Jilong Wang, Le Yu, Shengjin Wang. Exploiting Effective Facial Patches for Robust Gender Recognition[J]. Tsinghua Science and Technology, 2019, 24(03): 333-345.