Abstract:The large-scale labeled samples can not be acquired easily and the cost of sample labeling is high. Aiming at these problems, an active learning algorithm of support vector machine (SVM) based on tri-training semi-supervised learning and convex-hull vector is proposed in this paper. Semi-supervised learning and active learning are efficiently combined. Firstly, by calculating the convex-hull vector of the sample set, samples of convex-hull vector which are most likely to be support vectors are selected to be labeled. For the existing active learning, the unlabeled samples are no longer used after the most informative samples are selected to be labeled. Secondly, to salve this problem, semi-supervised learning method-based tri-training is introduced into SVM active learning. Thus, the unlabeled samples with higher confidence level of classifying samples are selected and classified as the training sample set, and the useful information for learning machines in the unlabeled samples is exploited. The experimental results on UCI dataset show that the proposed algorithm achieves higher classification accuracy with less labeled samples and it improves generalization performance and reduces the labeling cost of SVM training.
[1] VAPNIK V N. The Nature of Statistical Learning Theory. New York, USA: Springer, 2000. [2] KOTHARI R, JAIN V. Learning from Labeled and Unlabeled Data Using a Minimal Number of Queries. IEEE Trans on Neural Networks, 2003, 14(6): 1496-1505. [3] ZHOU Z H, LI M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers. IEEE Trans on Knowledge and Data Engineering, 2005, 17(11): 1529-1541. [4] 龙 军,殷建平,祝 恩,等.主动学习中一种基于委员会的误分类采样算法.计算机工程与科学, 2008, 30(4): 69-72. (LONG J, YIN J P, ZHU E, et al. A Committee-Based Misclassification Sampling Algorithm in Active Learning. Computer Enginee-ring & Science, 2008, 30(4): 69-72.) [5] 徐海龙,王晓丹,廖 勇,等.一种基于主动学习的SVM增量训练算法.控制与决策, 2010, 25(2): 282-286. (XU H L, WANG X D, LIAO Y, et al. Incremental Training Algorithm of SVM Based on Active Learning. Control and Decision, 2010, 25(2): 282-286.) [6] RICCARDI G, HAKKANI-TR D. Active and Unsupervised Learning for Automatic Speech Recognition[EB/OL]. [2015-01-03]. http://link.springer.com/article/10.1186/1471-2105-12-S12-S4. [7] TUR G, HAKKANI-TR D, SCHAPIRE R E. Combining Active and Semi-supervised Learning for Spoken Language Understanding. Speech Communication, 2005, 45(2): 171-186. [8] HOI S C H, LYU M R. A Semi-supervised Active Learning Framework for Image Retrieval[EB/OL]. [2015-01-03]. http://www.cs.cuhk.hk/~lyu/paper_pdf/CVPR2005.pdf. [9] 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法.软件学报, 2012, 23(6): 1486-1499. (ZHAO W Z, MA H F, LI Z Q, et al. Efficiently Active Leaning for Semi-supervised Document Clustering. Journal of Software, 2012, 23(6): 1486-1499.) [10] LEWIS D D, GALE W A. A Sequential Algorithm for Training Text Classifiers // Proc of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Germany, 1994: 3-12. [11] COHN D A, GHAHRAMANI Z, JORDAN M I. Active Learning with Statistical Models [J/OL]. [2015-01-03]. http://mlg.eng.cam.ac.uk/pub/pdf/CohGhaJor94a.pdf. [12] SEUNG H S, OPPER M, SOMPOLINSKY H. Query by Committee // Proc of the 5th Annual ACM Workshop on Computational Learning Theory. Pittsburgh, USA, 1992: 287-294. [13] FREUND Y, SEUNG H S, SHAMIR E, et al. Selective Sampling Using the Query by Committee Algorithm. Machine Learning, 1997, 28(2): 133-168. [14] 胡正平.基于最佳样本标记的主动支持向量机学习策略.信号处理, 2008, 24(1): 105-107. (HU Z P. An Active Learning Strategy of SVM via Optimal Selection of Labeled Data. Signal Processing, 2008, 24(1): 105-107.) [15] MCCALLUM A K, NIGAM K. Employing EM and Pool-Based Active Learning for Text Classification // Proc of the 15th International Conference on Machine Learning. Madison, USA, 1998: 350-358. [16] MUSLEA I, MINTON S, KNOBLOCK C A. Active+Semi-supervised Learning=Robust Multi-view Learning // Proc of the 19th International Conference on Machine Learning. Sydney, Australia, 2002: 435-442. [17] 徐 杰,施鹏飞.图像检索中基于标记与未标记样本的主动学习算法.上海交通大学学报, 2004, 38(12): 2068-2072. (XU J, SHI P F. Active Learning with Labeled and Unlabeled Samples for Content-Based Image Retrieval. Journal of Shanghai Jiaotong University, 2004, 38(12): 2068-2072.) [18] 李东晖,杜树新,吴铁军.基于壳向量的线性支持向量机快速增量学习算法.浙江大学学报(工学版),2006, 40(2): 202-206. (LI D H, DU S X, WU T J. Fast Incremental Learning Algorithm of Linear Support Vector Machine Based on Hull Vectors. Journal of Zhejiang University (Engineering Science), 2006, 40(2): 202-206.) [19] 李仁兵,李艾华,王声才,等.支持向量预选的凸壳顶点法.控制与决策, 2010, 25(12): 1848-1852. (LI R B, LI A H, WANG S C, et al. Preselecting Support Vectors by Convex Hull Method. Control and Decision, 2010, 25(12): 1848-1852.) [20] 邓 超,郭茂祖.基于Tri-training和数据剪辑的半监督聚类算法.软件学报, 2008, 19(3): 663-673. (DENG C, GUO M Z. Tri-training and Data Editing Based Semi-supervised Clustering Algorithm. Journal of Software, 2008, 19(3): 663-673.) [21] 张 翔,肖小玲,徐光祐.基于最大熵估计的支持向量机概率建模.控制与决策, 2006, 21(7): 767-770. (ZHANG X, XIAO X L, XU G Y. Probabilistic Outputs for Support Vector Machines Based on the Maximum Entropy Estimation. Control and Decision, 2006, 21(7): 767-770.) [22] FREUND Y, SCHAPIRE R E. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55(1): 119-139. [23] PANDA N, GOH K, CHANG E Y. Active Learning in Very Large Databases[EB/OL]. [2015-01-03]. http://alumni.cs.ucsb.edu/~panda/published_papers/mtap.pdf.