Semi-supervised Ensemble Learning Based Software Defect Prediction
WANG Tiejian1, WU Fei2, JING Xiaoyuan1
1.State Key Laboratory of Software Engineering, School of Computer, Wuhan University, Wuhan 430072 2.School of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023
Abstract:The software defect prediction is usually adversely affected by the limitation of the labeled modules and the class-imbalance of software defect data. Aiming at this problem, a semi-supervised ensemble learning software defect prediction approach is proposed. High-performance classifiers can be built through semi-supervised ensemble learning by using a large amount of unlabeled modules and a better prediction capability is achieved for class-imbalanced data by using a series of weak classifiers to reduce the bias generated by the majority class. With the consideration of the cost of risk in software defect prediction, a sample weight vector updating strategy is employed to reduce the cost of risk caused by misclassifying defective modules as non-defective ones. Experimental results on NASA MDP datasets show better software defect prediction capability of the proposed approach.
[1] HALL T, BEECHAM S, BOWES D, et al. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering, 2012, 38(6): 1276-1304. [2] NAM J, PAN S J, KIM S. Transfer Defect Learning // Proc of the International Conference on Software Engineering. Piscataway, USA: IEEE, 2013: 382-391. [3] GRAY D, BOWES D, DAVEY N, et al. The Misuse of the NASA Metrics Data Program Data Sets for Automated Software Defect Prediction // Proc of the 15th Annual Conference on Evaluation & Assessment in Software Engineering. London, UK: IET, 2011: 96-103. [4] SHEPPERD M, SONG Q B, SUN Z B, et al. Data Quality: Some Comments on the NASA Software Defect Datasets. IEEE Transactions on Software Engineering, 2013, 9(9): 1208-1215. [5] ZHU X J. Semi-supervised Learning Literature Survey. Technical Report, 1530. Madison, USA: University of Wisconsin-Madison, 2005. [6] GAO K H, KHOSHGOFTAAR T M, NAPOLITANO A. A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction // Proc of the 11th International Conference on Machine Learning and Applications. Washington, USA: IEEE, 2012, II: 281-288. [7] KHOSHGOFTAAR T M, GAO K H, SELIYA N. Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction // Proc of the 22nd IEEE International Conference on Tools with Artificial Intelligence. Washington, USA: IEEE, 2010, I: 137-144. [8] ROKACH L. Ensemble-Based Classifiers. Artificial Intelligence Review, 2010, 33(1): 1-39. [9] SUN Z B, SONG Q B, ZHU X Y. Using Coding-Based Ensemble Learning to Improve Software Defect Prediction. IEEE Transactions on Systems, Man, and Cybernetics(Applications and Reviews), 2012, 42(6): 1806-1817. [10] ZHENG J. Cost-Sensitive Boosting Neural Networks for Software Defect Prediction. Expert Systems with Applications, 2010, 37(6): 4537-4543. [11] MALLAPRAGADA P K, JIN R, JAIN A K, et al. SemiBoost: Boosting for Semi-supervised Learning. IEEE Transactions on Pa-ttern Analysis and Machine Intelligence, 2009, 31(11): 2000-2014. [12] JIANG Y, LI M, ZHOU Z H. Software Defect Detection with ROCUS. Journal of Computer Science and Technology, 2011, 26(2): 328-342. [13] LU H H, CUKIC B, CULP M. Software Defect Prediction Using Semi-supervised Learning with Dimension Reduction // Proc of the 27th IEEE/ACM International Conference on Automated Software Engineering. New York, USA: ACM, 2012: 314-317. [14] CATAL C. A Comparison of Semi-supervised Classification App-roaches for Software Defect Prediction. Journal of Intelligent Systems, 2014, 23(1): 75-82. [15] THUNG F, LE X D, LO D. Active Semi-supervised Defect Categorization // Proc of the 23rd IEEE International Conference on Program Comprehension. Piscataway, USA: IEEE, 2015: 60-70. [16] ZHANG Z W, JING X Y, WANG T J. Label Propagation Based Semi-supervised Learning for Software Defect Prediction. Automated Software Engineering, 2017, 24(1): 47-69.