基于粒化-融合的海量高维数据特征选择算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201607002

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (421 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要基于粒计算视角,提出粒化-融合框架下的海量高维数据特征选择算法.运用BLB(Bag of Little Bootstrap)的思想,首先将原始海量数据集粒化为小规模数据子集(粒),然后在每个粒上构建多个自助子集的套索模型,实现粒特征选择,最后,各粒特征选择结果按权重融合、排序,得到原始数据集的有序特征选择结果.人工数据集和真实数据集上的实验表明文中算法对海量高维数据集进行特征选择的可行性和有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	冀素琴
	石洪波
	吕亚丽
	郭珉

关键词 ：海量高维数据, 特征选择, 粒计算, 套索(LASSO)

Abstract：From a granular computing perspective, a feature selection algorithm based on granulation-fusion for massive and high-dimension data is proposed. By applying bag of little Bootstrap (BLB), the original massive dataset is granulated into small subsets of data (granularity), and then features are selected by constructing multiple least absolute shrinkage and selection operator(LASSO) models on each granularity. Finally, features selected on each granularity are fused with different weights, and feature selection results are obtained on original dataset through ordering. Experimental results on artificial datasets and real datasets show that the proposed algorithm is feasible and effective for massive high-dimension datasets.

Key words： Massive High-Dimensional Data Feature Selection Granular Computing Least Absolute Shrinkage and Selection Operator (LASSO)

收稿日期: 2016-02-08

ZTFLH:

TP 311.13

基金资助:国家自然科学基金项目(No.60873100)、山西省自然科学基金项目(No.2014011022-2,2013011016-4)、中国博士后科学基金面上项目(No.2016M591409)资助

作者简介: 冀素琴(通讯作者),女,1972年生,硕士,讲师,主要研究方向为数据挖掘、分布式技术.E-mail:jsq58@sina.com. 石洪波,女,1965年生,博士,教授,主要研究方向为机器学习、数据挖掘.E-mail:shb710@163.com.吕亚丽,女,1975年生,博士,副教授,主要研究方向为人工智能、数据挖掘.E-mail:yali.lv2008@gmail.com. 郭珉,女,1978年生,博士研究生,讲师,主要研究方向为应用统计.E-mail:guomin9617@163.com.

引用本文:

冀素琴，石洪波，吕亚丽，郭珉. 基于粒化-融合的海量高维数据特征选择算法^*[J]. 模式识别与人工智能, 2016, 29(7): 590-597. JI Suqin, SHI Hongbo, Lü Yali, GUO Min. Feature Selection Algorithm Based onGranulation-Fusion for Massive High-Dimension Data. , 2016, 29(7): 590-597.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201607002 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2016/V29/I7/590

[1] 谢娟英,谢维信.基于特征子集区分度与支持向量机的特征选择算法.计算机学报, 2014, 37(8): 1704-1718.
(XIE J Y, XIE W X. Several Feature Selection Algorithms Based on the Discernibility of a Feature Subset and Support Vector Machines. Chinese Journal of Computers, 2014, 37(8): 1704-1718.)
[2] YU L, LIU H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research, 2004, 5: 1205-1224.
[3] QIAN Y H, LIANG J Y, PEDRYCZ W, et al. Positive Approximation: An Accelerator for Attribute Reduction in Rough Set Theory. Artificial Intelligence, 2010, 174(9/10): 597-618.
[4] 鲍捷,杨明,刘会东.高维数据的1-范数支持向量机集成特征选择.计算机科学与探索, 2012, 6(10): 948-953.
(BAO J, YANG M, LIU H D. Ensemble Feature Selection Based on 1-Norm Support Vector Machine for High-Dimensional Data. Journal of Frontiers of Computer Science and Technology, 2012, 6(10): 948-953.)
[5] LIANG J Y, WANG F, DANG C Y, et al. An Efficient Rough Feature Selection Algorithm with a Multi-granulation View. International Journal of Approximate Reasoning, 2012, 53(6): 912-926.
[6] 杨昙,冯翔,虞慧群.基于多群体公平模型的特征选择算法.计算机研究与发展, 2015, 52(8): 1742-1756.
(YANG T, FENG X, YU H Q. Feature Selection Algorithm Based on the Multi-colony Fairness Model. Journal of Computer Research and Development, 2015, 52(8): 1742-1756.)
[7] 冀素琴,石洪波,吕亚丽.基于粒计算与区分能力的属性约简算法.模式识别与人工智能, 2015, 28(4): 327-334.
(JI S Q, SHI H B, L Y L. An Attribute Reduction Algorithm Based on Granular Computing and Discernibility. Pattern Recognition and Artificial Intelligence, 2015, 28(4): 327-334.)
[8] 徐计,王国胤,于洪.基于粒计算的大数据处理.计算机学报,2015, 38(8): 1497-1517.
(XU J, WANG G Y, YU H. Review of Big Data Processing Based on Granular Computing. Chinese Journal of Computers, 2015, 38(8): 1497-1517.)
[9] YANG C, ZHANG X Y, ZHONG C M, et al. A Spatiotemporal Compression Based Approach for Efficient Big Data Processing on Cloud. Journal of Computer and System Sciences, 2014, 80(8):1563-1583.
[10] RUAN J H, WANG X P, SHI Y. Developing Fast Predictors for Large-Scale Time Series Using Fuzzy Granular Support Vector Machines. Applied Soft Computing, 2013, 13(9): 3981-4000.
[11] KLEINER A, TALWALKAR A, SARKAR P, et al. A Scalable
Bootstrap for Massive Data. Journal of the Royal Statistical Society(Series B), 2014, 76(4): 795-816.
[12] 张海,王尧,常象宇,等.L_1/2正则化.中国科学:信息科学,2010, 40(3): 412-422.
(ZHANG H, WANG Y, CHANG X Y, et al. L_1/2Regularization. Scientia Sinica Informationis, 2010, 40(3): 412-422.)
[13] 刘建伟,崔立鹏,刘泽宇,等.正则化稀疏模型.计算机学报, 2015, 38(7): 1307-1325.
(LIU J W, CUI L P, LIU Z Y, et al. Survey on the Regularized Sparse Models. Chinese Journal of Computers, 2015, 38(7): 1307-1325.)
[14] XU H, CARAMANIS C, MANNOR S. Sparse Algorithms Are Not Stable: A No-Free-Lunch Theorem. IEEE Trans on Pattern Analysis and Machine Intelligence, 2012, 34(1): 187-193.
[15] SIMON N, FRIEDMAN J, HASTIE T, et al. A Sparse Group La-
sso. Journal of Computational and Graphical Statistics, 2013, 22(2): 231-245.