Twice Learning Based Semi-supervised Dictionary Learning for Software Defect Prediction
ZHANG Zhiwu1, JING Xiaoyuan1,2, WU Fei3
1.School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023 2.State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072 3.School of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023
Abstract:When the previous defect labels of modules in software history warehouse are limited, building an effective prediction model becomes a challenging problem. Aiming at this problem, a twice learning based semi-supervised learning algorithm for software defect prediction is proposed. In the first stage of learning, a large number of unlabeled samples are labeled with probability soft labels and extended to the labeled training dataset by using sparse representation classifier. Then, on this dataset discriminative dictionary learning is used for the second stage of learning. Finally, defect proneness prediction is conducted on the obtained dictionary. Experiments on the widely used NASA MDP and PROMISE AR datasets indicate the superiority of the proposed algorithm.
[1] CATAL C, DIRI B. A Systematic Review of Software Fault Predic-tion Studies. Expert Systems with Applications, 2009, 36(4): 7346-7354. [2] HALL T, BEECHAM S, BOWES D, et al. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering, 2012, 38(6): 1276-1304. [3] 何 亮,宋擒豹,沈钧毅.基于Boosting的集成k-NN软件缺陷预测方法.模式识别与人工智能, 2012, 25(5): 792-802. (HE L, SONG Q B, SHEN J Y. Boosting-Based k-NN Learning for Software Defect Prediction. Pattern Recognition and Artificial Intelligence, 2012, 25(5): 792-802.) [4] SELIYA N, KHOSHGOFTAAR T M. Software Quality Estimation with Limited Fault Data: A Semi-supervised Learning Perspective. Software Quality Journal, 2007, 15(3): 327-344. [5] SELIYA N, KHOSHGOFTAAR T M. Software Quality Analysis of Unlabeled Program Modules with Semisupervised Clustering. IEEE Transactions on Systems, Man, and Cybernetics(Systems and Humans), 2007, 37(2): 201-211. [6] CATAL C, DIRI B. Unlabelled Extra Data Do Not Always Mean Extra Performance for Semi-supervised Fault Prediction. Expert Systems, 2009, 26(5): 458-471. [7] JIANG Y, LI M, ZHOU Z H. Software Defect Detection with ROCUS. Journal of Computer Science and Technology, 2011, 26(2): 328-342. [8] LI M, ZHANG H Y, WU R X, et al. Sample-Based Software Defect Prediction with Active and Semi-supervised Learning. Automated Software Engineering, 2012, 19(2): 201-230. [9] THUNG F, LE X B D, LO D. Active Semi-supervised Defect Categorization // Proc of the 23rd IEEE International Conference on Program Comprehension. Piscataway, USA: IEEE, 2015: 60-70. [10] CATAL C. A Comparison of Semi-supervised Classification Approaches for Software Defect Prediction. Journal of Intelligent Systems, 2014, 23(1): 75-82. [11] MA Y, PAN W W, ZHU S Z, et al. An Improved Semi-supervised Learning Method for Software Defect Prediction. Journal of Intelligent & Fuzzy Systems, 2014, 27(5): 2473-2480. [12] ABAEI G, SELAMAT A, FUJITA H. An Empirical Study Based on Semi-supervised Hybrid Self-organizing Map for Software Fault Prediction. Knowledge-Based Systems, 2015, 74: 28-39. [13] ZHANG Z W, JING X Y, WANG T J. Label Propagation Based Semi-supervised Learning for Software Defect Prediction. Automated Software Engineering, 2017, 24(1): 47-69. [14] JING X Y, YING S, ZHANG Z W, et al. Dictionary Learning Based Software Defect Prediction // Proc of the 36th International Conference on Software Engineering. New York, USA: ACM, 2014: 414-423. [15] ZHOU Z H, JIANG Y. Medical Diagnosis with C4.5 Rule Preceded by Artificial Neural Network Ensemble. IEEE Transactions on Information Technology in Biomedicine, 2003, 7(1): 37-42. [16] JIANG Y, LI M, ZHOU Z H. Mining Extremely Small Data Sets with Application to Software Reuse. Software: Practice & Experience, 2009, 39(4): 423-440. [17] 杨子旭,黎 铭.二次回归学习及其在软件开发工作量预测上的应用.模式识别与人工智能, 2015, 28(1): 59-64. (YANG Z X, LI M. Twice Regression Learning and Its Application on Software Effort Estimation. Pattern Recognition and Artificial Intelligence, 2015, 28(1): 59-64.) [18] WRIGHT J, YANG A Y, GANESH A, et al. Robust Face Recognition via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210-227. [19] RAMIREZ I, SPRECHMANN P, SAPIRO G. Classification and Clustering via Dictionary Learning with Structured Incoherence and Shared Features // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2010: 3501-3508. [20] YANG M, ZHANG L, YANG J, et al. Metaface Learning for Sparse Representation Based Face Recognition // Proc of the 17th IEEE International Conference on Image Processing. New York, USA: IEEE, 2010: 1601-1604. [21] ROSASCO L, MOSCI S F, SANTORO M, et al. Iterative Projection Methods for Structured Sparsity Regularization. Technical Reports, MIT-CSAIL-TR-2009-050, CBCL-282. Cambridge, USA: Massachusetts Institute of Technology, 2009. [22] YANG M, ZHANG L, FENG X C, et al. Sparse Representation Based Fisher Discrimination Dictionary Learning for Image Classification. International Journal of Computer Vision, 2014, 109(3): 209-232. [23] GRAY D, BOWES D, DAVEY N, et al. The Misuse of the NASA Metrics Data Program Data Sets for Automated Software Defect Prediction // Proc of the 15th Annual Conference on Evaluation & Assessment in Software Engineering. London, UK: IET, 2011: 96-103. [24] SHEPPERD M, SONG Q B, SUN Z B, et al. Data Quality: Some Comments on the NASA Software Defect Datasets. IEEE Transactions on Software Engineering, 2013, 39(9): 1208-1215. [25] LU H H, CUKIC B, CULP M. An Iterative Semi-supervised Approach to Software Fault Prediction // Proc of the 7th International Conference on Predictive Models in Software Engineering. New York, USA: ACM, 2011. DOI: 10.1145/2020390.2020405.