Please wait a minute...
Tsinghua Science and Technology  2021, Vol. 26 Issue (4): 387-402    doi: 10.26599/TST.2020.9010021
    
DGA-Based Botnet Detection Toward Imbalanced Multiclass Learning
Yijing Chen(),Bo Pang(),Guolin Shao*(),Guozhu Wen(),Xingshu Chen()
College of Cybersecurity, Sichuan University, Chengdu 610065, China.
Cybersecurity Research Institute, Sichuan University, Chengdu 610065, China.
Download: PDF (10212 KB)      HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Botnets based on the Domain Generation Algorithm (DGA) mechanism pose great challenges to the main current detection methods because of their strong concealment and robustness. However, the complexity of the DGA family and the imbalance of samples continue to impede research on DGA detection. In the existing work, the sample size of each DGA family is regarded as the most important determinant of the resampling proportion; thus, differences in the characteristics of various samples are ignored, and the optimal resampling effect is not achieved. In this paper, a Long Short-Term Memory-based Property and Quantity Dependent Optimization (LSTM.PQDO) method is proposed. This method takes advantage of LSTM to automatically mine the comprehensive features of DGA domain names. It iterates the resampling proportion with the optimal solution based on a comprehensive consideration of the original number and characteristics of the samples to heuristically search for a better solution around the initial solution in the right direction; thus, dynamic optimization of the resampling proportion is realized. The experimental results show that the LSTM.PQDO method can achieve better performance compared with existing models to overcome the difficulties of unbalanced datasets; moreover, it can function as a reference for sample resampling tasks in similar scenarios.



Key wordsbotnet      Domain Generation Algorithm (DGA)      multiclass imbalance      resampling     
Received: 04 October 2019      Published: 12 January 2021
Fund:  National Natural Science Foundation of China(61272447);National Entrepreneurship & Innovation Demonstration Base of China(C700011);Key Research & Development Project of Sichuan Province of China(2018G20100)
Corresponding Authors: Guolin Shao     E-mail: 2016141531037@stu.scu.edu.cn;2016141231157@stu.scu.edu.cn;sgllearn@163.com;2016141531010@stu.scu.edu.cn;chenxsh@scu.edu.cn
About author: Yijing Chen received the bachelor degree from Sichuan University, Chengdu, China in 2020. Her research interests include deep learning of network security and big data analysis. She has won the network security scholarship of Sichuan University in 2017 and 2018, the first prize of the 12th China University Computer Design Competition in 2019, the Honorable Mention of the Mathematical Contest in Modeling in 2019, etc.|Bo Pang received the bachelor degree from Sichuan University, Chengdu, China in 2020. His research interests include Web security, deep learning for Web security, big data analysis, etc. He has received numerous honors and awards, including the fifth place in the Sichuan University AI Challenge in 2018 and the third prize of Sichuan University Student Information Security Technology Competition.|Guolin Shao received the BS and PhD degrees from Sichuan University, Chengdu, China in 2013 and 2018, respectively. His research interests include deep learning for cyber security and big data analysis. He has published more than 16 peer reviewed papers. He has received numerous honors and awards, including the National Cyber Security Scholarship in 2016 and 2018, the National Scholarship in 2017, the Top Ten Academic Star of Sichuan University, and the First Prize Scholarship in 2013, 2015, and 2017, respectively.|Guozhu Wen received the bachelor degree from Sichuan University, Chengdu, China in 2020. His research interests include Web security, deep learning for Web security, big data analysis, etc. He has received numerous honors and awards, including the fifth place in the Sichuan University AI Challenge in 2018 and the third prize of Sichuan University Student Information Security Technology Competition.|Xingshu Chen is a full professor at College of Cybersecurity, Sichuan University, Chengdu, China. She received the master and PhD degrees from Sichuan University in 1999 and 2004, respectively. Her main research interest focuses on cloud computing, big data analysis, and network security.
Cite this article:

Yijing Chen,Bo Pang,Guolin Shao,Guozhu Wen,Xingshu Chen. DGA-Based Botnet Detection Toward Imbalanced Multiclass Learning. Tsinghua Science and Technology, 2021, 26(4): 387-402.

URL:

http://tst.tsinghuajournals.com/10.26599/TST.2020.9010021     OR     http://tst.tsinghuajournals.com/Y2021/V26/I4/387

Fig. 1 Detection model for DGA based on LSTM.
Fig. 2 Number of samples in different DGA families.
Fig. 3 Flow chart of LSTM.PQDO algorithm in a single iteration.
DGA classNumber of samplesDGA classNumber of samples
Bamital239Proslikefan249
Banjori452 426Padcrypt377
Bedep177Post62 429
Beebone208Pushdo865
Corebot278Pykspa1453
Cryptolocker848Qadars1999
Cryptowall92Qakbot15 882
Dircrypt763Ramdo429
Dyre1267Ramnit55 161
Fobber282Ranbyus22 312
Geodo574Shifu2547
Hesperbot190Shiotob8270
Kraken5548Simda23 872
Locky1036Suppobox2306
Matsnu432Symmi4256
Murofet25 247Tempedre204
Necurs33 467Tinba30 950
Nymaim463Volatile238
Table 1 Summary of the collected dataset.
ModeInit-methodMicro-F1Macro-PREMacro-RECMacro-F1AVG
1Original89.3161.2254.4655.2672.29
2Random86.3762.6870.7963.4274.89
3α=0.189.1058.1858.3957.0073.05
α=0.289.4062.1658.4057.7773.59
α=0.387.6864.0467.9362.9875.33
α=0.488.8163.9169.2664.9776.89
α=0.584.7058.9567.6460.8872.79
α=0.686.7561.4973.8863.9675.35
α=0.785.7061.7772.1663.1574.42
α=0.885.1561.3074.6863.4474.30
α=0.981.4654.4373.3857.2769.37
α=1.076.6757.0471.6958.1067.39
Table 2 Performance comparison of the initialization methods. (%)
Fig. 4 Actual number of DGA classes and relative sampling coefficients optimized by LSTM.PQDO method.
Fig. 5 Variation of relative sampling coefficients of DGA classes optimized by LSTM.PQDO algorithm.
Fig. 6 Confusion matrix of LSTM and LSTM.PQDO.
CategoryLSTMLSTM.PQDO
PRERECF1PRERECF1
Geodo0.18510.04340.07040.33231.00000.4989
Beebone0.83870.61900.71230.91301.00000.9545
Murofet0.91800.75070.82600.82950.85720.8431
Pykspa0.81610.88040.84700.83470.88630.8597
Padcrypt0.86400.93910.90000.88090.96520.9211
Ramnit0.80460.94490.86910.83560.88770.8609
Volatile0.94311.00000.97070.99001.00000.9950
Ranbyus0.83680.86290.84960.93660.75780.8378
Qakbot0.65250.65320.65290.68870.63300.6597
Simda0.63880.95230.76470.79700.99260.8841
Ramdo0.99251.00000.99620.99740.99250.9949
Suppobox0.00000.00000.00000.40000.50000.4444
Locky0.88370.33140.48210.54090.48970.5140
Tempedreve0.00000.00000.00000.05880.06120.0600
Qadars0.00000.00000.00000.44440.50000.4705
Symmi0.98680.98680.98681.00001.00001.0000
Banjori0.99971.00000.99980.99981.00000.9999
Tinba0.93790.99160.96400.91210.99350.9510
Hesperbot0.00000.00000.00000.40000.05260.0930
Fobber0.00000.00000.00000.35340.63330.4537
Dyre0.96261.00000.98090.96900.99870.9836
Cryptowall0.00000.00000.00000.01420.05550.0227
Corebot0.98180.96420.97290.93331.00000.9655
Proslikefan1.00000.12170.21710.43560.28200.3424
Bedep0.00000.00000.00000.52940.51420.5217
Matsnu0.00000.00000.00000.81811.00000.9000
Post0.99950.99940.99950.99870.99950.9991
Necurs0.91070.82800.86740.94440.81300.8738
Pushdo0.86280.84220.85240.80270.88390.8413
Cryptolocker0.44110.55580.49180.51940.51090.5151
Dircrypt0.00000.00000.00000.25970.13880.1809
Shifu0.15780.06520.09230.31250.97820.4736
Bamital0.00000.00000.00000.00000.00000.0000
Kraken0.86370.82910.84610.84160.81350.8273
Nymaim0.00000.00000.00000.00000.00000.0000
Shiotob0.95590.99720.97610.95690.99520.9757
W32.Virut0.00000.00000.00000.47820.91660.6285
Micro-AVG0.89270.89270.89270.89020.89020.8902
Macro-AVG0.55230.51780.51860.64750.70550.6580
Table 3 Performance comparison between LSTM and LSTM.PQDO.
CategoryOversamplingQDBPLSTM.MILSTM.PQDO
PRERECF1PRERECF1PRERECF1PRERECF1
Geodo0.23650.99130.38190.23711.00000.38330.00000.00000.00000.33231.00000.4989
Beebone0.91301.00000.95450.97671.00000.98821.00001.00001.00000.91301.00000.9545
Murofet0.81460.72390.76650.84350.77090.80550.53300.74230.62050.82950.85720.8431
Pykspa0.88770.72000.79510.87960.75030.80980.80230.74300.77150.83470.88630.8597
Padcrypt0.79170.99130.88030.98291.00000.99141.00000.75000.85710.88090.96520.9211
Ramnit0.84980.39190.53650.83760.81290.82500.60680.80620.69250.83560.88770.8609
Volatile0.97071.00000.98510.98031.00000.99001.00001.00001.00000.99001.00000.9950
Ranbyus0.73880.83010.78180.82390.83430.82900.36170.70730.47870.93660.75780.8378
Qakbot0.73770.43380.54630.65740.56950.61030.77160.43500.55640.68870.63300.6597
Simda0.51701.00000.68160.71430.98900.82950.95791.00000.97850.79700.99260.8841
Ramdo0.99500.99750.99630.98771.00000.99380.95241.00000.97560.99740.99250.9949
Suppobox0.34380.55000.42310.18670.70000.29470.41670.50000.45450.40000.50000.4444
Locky0.36760.57450.44830.43890.58130.50010.00000.00000.00000.54090.48970.5140
Tempedreve0.02560.28570.04700.00000.00000.00000.00000.00000.00000.05880.06120.0600
Qadars0.04880.50000.08890.50000.50000.50001.00000.62500.76920.44440.50000.4705
Symmi0.98701.00000.99351.00001.00001.00000.50000.15380.23531.00001.00001.0000
Banjori0.99851.00000.99920.99751.00000.99880.99881.00000.99940.99981.00000.9999
Tinba0.90120.83740.86810.90060.97710.93720.89510.99610.94290.91210.99350.9510
Hesperbot0.01190.21050.02240.11430.10530.10960.33330.02630.04880.40000.05260.0930
Fobber0.08280.90830.15180.24450.65000.35540.00000.00000.00000.35340.63330.4537
Dyre0.96660.97750.97200.96501.00000.98220.98161.00000.99070.96900.99870.9836
Cryptowall0.02650.33330.04920.01770.44440.03400.62500.26320.37040.01420.05550.0227
Corebot0.93331.00000.96550.94921.00000.97390.75000.60000.66670.93331.00000.9655
Proslikefan0.12680.45510.19830.21530.19870.20670.33330.55000.41510.43560.28200.3424
Bedep0.06510.57140.11700.75000.25710.38300.68750.32350.44000.52940.51420.5217
Matsnu0.90001.00000.94740.42860.66670.52171.00000.70000.82350.81811.00000.9000
Post0.99430.99830.99630.99220.99930.99570.99851.00000.99920.99870.99950.9991
Necurs0.97340.71450.82410.95440.76190.84740.52480.11040.18240.94440.81300.8738
Pushdo0.49760.92860.64800.74940.87200.80610.65710.67650.66670.80270.88390.8413
Cryptolocker0.24730.61920.35350.42060.47670.44690.20000.01670.03080.51940.51090.5151
Dircrypt0.01970.51390.03790.10480.27080.15120.00000.00000.00000.25970.13880.1809
Shifu0.22090.82610.34860.30230.84780.44570.37110.76600.50000.31250.97820.4736
Bamital0.00000.00000.00000.00000.00000.00001.00000.75000.85710.00000.00000.0000
Kraken0.93890.76960.84590.89890.80190.84760.08000.01960.03150.84160.81350.8273
Nymaim0.05790.72830.10720.02380.03260.02750.29890.21670.25120.00000.00000.0000
Shiotob0.98150.95530.96820.96630.99640.98110.97410.90040.93580.95690.99520.9757
W32.Virut0.47060.66670.55170.08150.91670.14970.71430.41670.52630.47820.91660.6285
Micro-AVG0.77190.77190.77190.86460.86460.86460.87280.87750.87510.89020.89020.8902
Macro-AVG0.55780.72980.57510.59790.69690.60950.60340.53500.56710.64750.70550.6580
Table 4 Performance comparison of the oversamping, QDBP, LSTM.MI, and LSTM.PQDO.
[1]   Hoque N., Bhattacharyya D. K., and Kalita J. K., Botnet in DDoS attacks: Trends and challenges, IEEE Commun. Surv. Tutor., vol. 17, no. 4, pp. 2242-2270, 2015.
[2]   Zhou C. L., Chen K., Gong X. X., Chen P., and Ma H., Detection of fast-flux domains based on passive DNS analysis, (in Chinese), Acta Sci. Natur. Univ. Pekinensis, vol. 52, no. 3, pp. 396-402, 2016.
[3]   Chang C. D. and Lin H. T., On similarities of string and query sequence for DGA botnet detection, in Proc. 2018 Int. Conf. on Information Networking, Chiang Mai, Thailand, 2018, pp. 104-109.
[4]   Kwon J., Lee J., Lee H., and Perrig A., PsyBoG: A scalable botnet detection method for large-scale DNS traffic, Comput Networks, vol. 97, pp. 48-73, 2016.
[5]   Yadav S., Reddy A. K. K., Reddy A. L. N., and Ranjan S., Detecting algorithmically generated domain-flux attacks with DNS traffic analysis, IEEE/ACM Trans. Netw., vol. 20, no. 5, pp. 1663-1677, 2012.
[6]   Schiavoni S., Maggi F., Cavallaro L., and Zanero S., Phoenix: DGA-based botnet tracking and intelligence, presented at 11th Int. Conf. on Detection of Intrusions and Malware, and Vulnerability Assessment, Egham, UK, 2014, pp. 192-211.
[7]   Truong D. T. and Cheng G., Detecting domain-flux botnet based on DNS traffic features in managed network, Secur. Commun. Networks, vol. 9, no. 14, 2016, pp. 2338-2347.
[8]   Tong V. and Nguyen G., A method for detecting DGA botnet based on semantic and cluster analysis, in Proc. Seventh Symp. on Information and Communication Technology, Ho Chi Minh City, Vietnam, 2016, pp. 272-277.
[9]   Mathew J., Luo M., Pang C. K., and Chan H. L., Kernel-based SMOTE for SVM classification of imbalanced datasets, in Proc. 41st Conf. of the IEEE Industrial Electronics Society, Yokohama, Japan, 2015, pp. 1127-1132.
[10]   Lin W. C., Tsai C. F., Hu Y. H., and Jhang J. S., Clustering-based undersampling in class-imbalanced data, Inf Sci, vol. 409-410, pp. 17-26, 2017.
[11]   Ha J. and Lee J. S., A new under-sampling method using genetic algorithm for imbalanced data classification, presented at 10th Int. Conf. on Ubiquitous Information Management and Communication, Danang, Vietnam, 2016.
[12]   Gazzah S., Hechkel A., and Amara N. E. B., A hybrid sampling method for imbalanced data, in Proc. 2015 IEEE 12th Int. Multi-Conference on Systems, Signals & Devices, Mahdia, Tunisia, 2015, pp. 1-6.
[13]   Tran D., Mac H., Tong V., Tran H. A., and Nguyen L. G., A LSTM based framework for handling multiclass imbalance in DGA botnet detection, Neurocomputing, vol. 275, pp. 2401-2413, 2018.
[14]   Chen Y. C., Li Y. J., Tseng A., and Lin T., Deep learning for malicious flow detection, arXiv preprint arXiv: 1802.03358, 2018.
[15]   Woodbridge J., Anderson H. S., Ahuja A., and Grant D., Predicting domain generation algorithms with long short-term memory networks, arXiv preprint arXiv: 1611.00791, 2016.
[16]   Li Y., Xiong K. Q., Chin T., and Hu C., A machine learning framework for domain generation algorithm-based malware detection, IEEE Access, vol. 7, pp. 32 765-32 782, 2019.
[17]   Zeng F., Chang S., and Wan X. C., Classification for DGA-based malicious domain names with deep learning architectures, Int. J. Intell. Inf. Syst., vol. 6, no. 6, pp. 67-71, 2017.
[18]   Athiwaratkun B. and Stokes J. W., Malware classification with LSTM and GRU language models and a character-level CNN, in Proc. 2017 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017, pp. 2482-2486.
[19]   Yu B., Pan J., Hu J. M., Nascimento A., and De Cock M., Character level based detection of DGA domain names, in Proc. 2018 Int. Joint Conf. on Neural Networks, Rio de Janeiro, Brazil, 2018, pp. 1-8.
[20]   Gao L. L., Guo Z., Zhang H. W., Xu X., and Shen H. T., Video captioning with attention-based LSTM and semantic consistency, IEEE Trans. Multimed., vol. 19, no. 9, pp. 2045-2055, 2017.
[21]   Bambenek Consulting-Master feeds, , 2019.
[1] Zhen Chen,Fuye Han,Junwei Cao,Xin Jiang,Shuo Chen. Cloud Computing-Based Forensic Analysis for Collaborative Network Security Management System[J]. Tsinghua Science and Technology, 2013, 18(1): 40-50.