Incremental Face Clustering with Optimal Summary Learning Via Graph Convolutional Network

Xuan Zhao,Zhongdao Wang,Lei Gao,Yali Li,Shengjin Wang^{*}()

Department of Electronic Engineering, Tsinghua University, Beijing100084, China. First Research Institute of Ministry of Public Security, Beijing100084, China.

In this study, we address the problems encountered by incremental face clustering. Without the benefit of having observed the entire data distribution, incremental face clustering is more challenging than static dataset clustering. Conventional methods rely on the statistical information of previous clusters to improve the efficiency of incremental clustering; thus, error accumulation may occur. Therefore, this study proposes to predict the summaries of previous data directly from data distribution via supervised learning. Moreover, an efficient framework to cluster previous summaries with new data is explored. Although learning summaries from original data costs more than those from previous clusters, the entire framework consumes just a little bit more time because clustering current data and generating summaries for new data share most of the calculations. Experiments show that the proposed approach significantly outperforms the existing incremental face clustering methods, as evidenced by the improvement of average F-score from 0.644 to 0.762. Compared with state-of-the-art static face clustering methods, our method can yield comparable accuracy while consuming much less time.

Fund:National Natural Science Foundation of China(61701277);State Key Development Program in 13th Five-Year(2016YFB0801301)

Corresponding Authors:
Shengjin Wang
E-mail: wgsgj@tsinghua.edu.cn

About author: Xuan Zhao is a PhD candidate at the Department of Electronic Engineering, Tsinghua Univerisity, advised by Prof. Shengjin Wang. Before that, he received the bachelor and master degrees from Tsinghua University in 2004 and 2007, respectively. His research interests include image processing, pattern recognition, computer vision, video analysis, etc.|Zhongdao Wang is a PhD candidate at the Department of Electronic Engineering, Tsinghua Univerisity, advised by Prof. Shengjin Wang. Before that, he received the bachelor degree in mathematics and physics from Tsinghua University in 2017. His research interests are computer vision and machine learning.|Lei Gao received the PhD degree in computer science from Beihang University, Beijing, China in 2008. He is an associate professor at the First Research Institute of Ministry of Public Security. His current research interests include computer vision and artificial intelligence.|Yali Li received the BE degree with excellent graduates award from Nanjing University in 2007 and the PhD degree from Tsinghua University in 2013. Currently, she is a research assistant at Department of Electronic Engineering, Tsinghua University. Her research interests include image processing, pattern recognition, computer vision, video analysis, etc.|Shengjin Wang received the BEng from Tsinghua University in 1985 and the PhD degree from Tokyo Institute of Technology, Tokyo, Japan in 1997. From May 1997 to August 2003, he was a member of the research staff at Internet System Research Laboratories, NEC Corporation, Japan. Since September 2003, he has been a professor at Department of Electronic Engineering, Tsinghua University, where he is currently also a director of the Research Institute of Image and Graphics (RIIG). He has published more than 50 papers on image processing. He is the holder of ten patents. His current research interests include image processing, computer vision, video surveillance, and pattern recognition.

Xuan Zhao,Zhongdao Wang,Lei Gao,Yali Li,Shengjin Wang. Incremental Face Clustering with Optimal Summary Learning Via Graph Convolutional Network. Tsinghua Science and Technology, 2021, 26(4): 536-547.

Fig. 1High-level idea of the proposed method. (a) We address the problem of incremental face clustering. (b) Estimating summaries is a fundamental approach for incremental clustering. Summaries are a small number of features that represent existing data in feature space. In (b),n denotes a new face feature, ${\mathit{s}}_{\text{\U0001d7cf}}$ is the summary of cluster 1, and ${\mathit{s}}_{\text{\U0001d7d0}}$ is the summary of cluster 2. If ${\mathit{l}}_{\text{\U0001d7cf}}$$\mathbf{<}$${\mathit{l}}_{\text{\U0001d7d0}}$, thenn should be grouped into cluster 1; thus, we do not need to comparen with each existing feature. In (c) and (d), a situation wherein existing clusters have errors, that is, features of the blue and red persons are grouped into one cluster, is analyzed. To estimate summaries, (c) previous methods mainly use the statistical information of existing clusters, such as cluster centers. (d) Our method directly learns summaries from original data distribution using a graph convolutional network. In addition, our method generates improved summaries when previous clusters contain errors (best viewed in color).

Fig. 2Main framework of our method for incremental face clustering. When a new batch of face embedding features ${\mathit{F}}_{\mathit{i}}$ arrives, we take four steps. In Step 1, we generate candidate graphs by summaries. We cluster ${\mathit{F}}_{\mathit{i}}$ with previous summaries ${\mathit{S}}_{\mathit{i}\mathbf{-}\text{\U0001d7cf}}$ and then change each summary in ${\mathit{S}}_{\mathit{i}\mathbf{-}\text{\U0001d7cf}}$ to its corresponding features in ${\mathit{E}}_{\mathit{i}\mathbf{-}\text{\U0001d7cf}}$, thereby forming a set of clusters. For each cluster, we generate several graphs by multi-thresholds. In Step 2, we predict the confidence score for each graph through a learned GCN. In Step 3, as the candidate graphs overlap, we merge them for clustering results by confidence scores. In Step 4, we generate current summaries ${\mathit{S}}_{\mathit{i}}$ and their corresponding features ${\mathit{E}}_{\mathit{i}}$ and store them for new data.

Fig. 3Toy example to illustrate the effectiveness of the confidence metric: (a) perfect cluster, (b) an error node, (c) a missing node, and (d) an error node and a missing node. The proposed fusion metric is sensitive to error and missing nodes.

Method

AF

ANMI

AT (s)

i-Kmeans^{[10]}

0.375

0.856

849

BSAS^{[28]}

0.462

0.859

342

MBSAS^{[29]}

0.470

0.898

1383

i-DBSCAN^{[10]}

0.644

0.925

109

Xmeans^{[30]}

0.540

0.883

136 800

HAC^{[7]}

0.637

0.934

20 877

ARO^{[3]}

0.192

0.862

367

Affinity^{[1]}

0.740

0.951

339

Linkage^{[2]}

0.692

0.957

1099

Our method

0.762

0.955

257

Table 1Results on the MS small testing dataset. Comparison with baseline and state-of-the-art methods in terms of the AF, ANMI, and AT of 10 batches. For all methods, we tune the corresponding hyperparameters and report the best result.

Method

AF

ANMI

AT (min)

Affinity^{[1]}

0.660

0.950

237

Linkage^{[2]}

0.661

0.952

554

Our method

0.672

0.951

203

Table 2Results on the MS big testing dataset. We compare the proposed method with state-of-the-art methods in terms of the AF, ANMI, and AT of 10 batches. For all methods, we tune the corresponding hyperparameters and report the best result.

Method

AF

ANMI

AT (s)

Affinity^{[1]}

0.677

0.934

41

Linkage^{[2]}

0.669

0.937

126

Our method

0.690

0.936

31

Table 3Results on the IJB-B testing dataset. We compare our method with state-of-the-art methods in terms of the AF, ANMI, and AT of 10 batches. For all methods, we tune the corresponding hyperparameters and report the best result.

Fig. 4AF of 10 batches vs. AT (s). The closer to the top left, the better. For easy display, we set a maximum AT of 500 s. AT for i-Kmeans, BSAS, MBSAS, Xmeans, and HAC is 849, 1876, 1383, 136 860, and 20 877 s, respectively.

Metric

Precision

Recall

AF

ANMI

Coverage

0.801

0.706

0.750

0.954

Purity

0.912

0.616

0.735

0.946

Fusion

0.839

0.699

0.762

0.955

Table 4Comparison with different metrics for learning graph confidence.

$\eta $

AF

ANMI

AT (s)

(0.30, 0.35)

0.624

0.918

44

(0.35, 0.40)

0.670

0.929

35

(0.40, 0.45)

0.690

0.936

31

(0.45, 0.50)

0.672

0.934

30

(0.50, 0.55)

0.652

0.927

29

Table 5AF, ANMI, and AT of 10 batches vs. $\bm{\eta}$ on the IJB-B testing set.

Fig. 5AF, ANMI, and AT of 10 batches vs $\bm{\tau}$ on the IJB-B testing set. For improved view, we divide AT by 50 s in the figure.

[1]

Yang L., Zhan X. H., Chen D. P., Yan J. J., Loy C. C., and Lin D. H., Learning to cluster faces on an affinity graph, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 2293-2301.

[2]

Wang Z. D., Zheng L., Li Y. L., and Wang S. J., Linkage based face clustering via graph convolution network, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 1117-1125.

[3]

Otto C., Wang D. Y., and Jain A. K., Clustering millions of faces by identity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 289-303, 2018.

[4]

Lloyd S., Least squares quantization in PCM, IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129-137, 1982.

[5]

Shi J. and Malik J., Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888-905, 2000.

[6]

Ester M., Kriegel H. P., Sander J., and Xu X. W., A density-based algorithm for discovering clusters in large spatial databases with noise, in KDD-96 Proc., Menlo Park, CA, USA, 1996, pp. 226-231.

[7]

Sibson R., SLINK: An optimally efficient algorithm for the single-link cluster method, Comput. J., vol. 16, no. 1, pp. 30-34, 1973.

[8]

Ailon N., Jaiswal R., and Monteleoni C., Streaming k-means approximation, presented at Neural Processing Systems 22: 23rd Annu. Conf. Neural Information Processing Systems, Vancouver, Canada, 2009, pp. 10-18.

[9]

Shindler M., Wong A., and Meyerson A., Fast and accurate k-means for large datasets, in Proc. of the 24th International Conference on Neural Information Processing Sysems, Granada, Spain, 2011, pp. 2375-2383.

[10]

Bagirov A. M., Ugon J., and Webb D., Fast modified global k-means algorithm for incremental cluster construction, Pattern Recognit., vol. 44, no. 3, pp. 866-876, 2011.

[11]

Ester M., Kriegel H. P., Sander J., Wimmer M., and Xu X. W., Incremental clustering for mining in a data warehousing environment, in Proc. 24th Int. Conf. Very Large Data Bases, New York, NY, USA, 1998, pp. 323-333.

[12]

Mansfield P. A., Wang Q., Downey C., Wan L., and Moreno I. L., Links: A high-dimensional online clustering method, arXiv preprint arXiv: 1801.10123, 2018.

[13]

Kulshreshtha P. and Guha T., An online algorithm for constrained face clustering in videos, in Proc. 25th IEEE Int. Conf. Image Proc. (ICIP), Athens, Greece, 2018, pp. 2670-2674.

[14]

Arthur D. and Vassilvitskii S., k-means++: The advantages of careful seeding, in Proc. 18th Annu. ACM-SIAM Symp. on Discrete Algorithms, Philadelphia, PA, USA, 2007.

[15]

Sun Y. F. and Lu Y. S., A grid-based subspace clustering algorithm for high-dimensional data streams, in Proc. Int. Conf. Web Information Systems, Wuhan, China, 2006.

[16]

Zhou A. Y., Cao F., Qian W. N., and Jin C. Q., Tracking clusters in evolving data streams over sliding windows, Knowl. Inf. Syst., vol. 15, no. 2, pp. 181-214, 2008.

[17]

Namadchian A. and Esfandani G., DSCLU: A new data stream clustring algorithm for multi density environments, in Proc. 13th ACIS Int. Conf. Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing (SNPD), Kyoto, Japan, 2012.

[18]

Hamilton W. L., Ying Z., and Leskovec J., Inductive representation learning on large graphs, arXiv preprint arXiv: 1706.02216, 2017.

[19]

Kipf T. N. and Welling M., Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2017.

[20]

Yan S. J., Xiong Y. J., and Lin D. H., Spatial temporal graph convolutional networks for skeleton-based action recognition, arXiv preprint arXiv: 1801.07455, 2018.

[21]

Chen Z. M., Wei X. S., Wang P., and Guo Y. W., Multi-label image recognition with graph convolutional networks, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 5172-5181.

[22]

Shi Y. C., Otto C., and Jain A. K., Face clustering: Representation and pairwise constraints, IEEE Trans. Inf. Forens. Secur., vol. 13, no. 7, pp. 1626-1640, 2018.

[23]

Li Y. L., Wang S. J., Tian Q., and Ding X. Q., A survey of recent advances in visual feature detection, Neurocomputing, vol. 149, pp. 736-751, 2015.

[24]

Li Y. L., Wang S. J., Tian Q., and Ding X. Q., Feature representation for statistical-learning-based object detection: A review, Pattern Recognit., vol. 48, no. 11, pp. 3542-3559, 2015.

[25]

Guo Y. D., Zhang L., Hu Y. X., He X. D., and Gao J. F., MS-celeb-1m: A dataset and benchmark for large-scale face recognition, in Proc. 14th European Conf. Computer Vision, Amsterdam, the Netherlands, 2016.

[26]

Yi D., Lei Z., Liao S. C., and Li S. Z., Learning face representation from scratch, arXiv preprint arXiv: 1411.7923, 2014.

[27]

Whitelam C., Taborsky E., Blanton A., Maze B., Adams J. C., Miller T., Kalka N. D., Jain A. K., Duncan J. A., Allen K., et al., IARPA Janus benchmark-B face dataset, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 2017, pp. 592-600.

[28]

Theodoridis S. and Koutroumbas K., Pattern Recognition. 4th ed. San Diego, CA, USA: Elsevier, 2009.

[29]

Duda R. O., Hart P. E., and Stork D. G., Pattern Classification. 2nd ed. New York, NY, USA: John Wiley & Sons, 2001.

[30]

Pelleg D. and Moore A., X-means: Extending k-means with efficient estimation of the number of clusters, in Proc. 17th Int. Conf. Machine Learning, Pattsburgh, PA, USA, 2000.