Abstract:Tumor classification based on gene expression profiles, which is of tremendous convenience for cancer accurate diagnosis and subtype recognition, has drawn a great attention in recent years. Due to the characteristics of small samples, high dimensionality, much noise and data redundancy for gene expression profiles, it is difficult to mine biological knowledge from gene expression profiles profoundly and accurately, and it also brings enormous difficulty to informative gene selection in the tumor classification.Therefore, an iterative Lasso-based approach for gene selection,called Gene Selection Based on Iterative Lasso(GSIL), is proposed to select an informative gene subset with fewer genes and better classification ability. The proposed algorithm mainly involves two steps. In the first step, a gene ranking algorithm, Signal Noise Ratio, is applied to select top-ranked genes as the candidate gene subset, which aims to eliminate irrelevant genes. In the second step, an improved method based on Lasso, Iterative Lasso, is employed to eliminate the redundant genes. The experimental results on 5 public datasets validate the feasibility and effectiveness of the proposed algorithm and demonstrate that it has better classification ability in comparison with other gene selection methods.
[1] Golub T, Slonim D, Tamayo P, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 1999, 286(5439): 531-537 [2] Guyon I, Weston J, Barnhill S, et al. Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning, 2002, 46(1/2/3): 389-422 [3] Huang Deshuang, IP H H S, Law K C K, et al. Zeroing Polynomials Using Modified Constrained Neural Network Approach. IEEE Trans on Neural Networks, 2005, 16(3): 721-732 [4] Huang Deshuang, Zheng Chunhou. Independent Component Analysis- Based Penalized Discriminant Method for Tumor Classification Using Gene Expression Data. Bioinformatics, 2006, 22(15): 1855-1862 [5] Wang Shulin, Zhu Yihai, Jia Wei, et al. Robust Classification Method of Tumor Subtype by Using Correlation Filters. IEEE/ACM Trans on Computational Biology and Bioinformatics, 2012, 9(2): 580-591 [6] Brock G N, Shaffer J R, Blakesley R E, et al. Which Missing Value Imputation Method to Use in Expression Profiles: A Comparative Study and Two Selection Schemes. BMC Bioinformatics, 2008. DOI:10.1186/1471-2105-9-12 [7] Wang Shulin, Li Xueling, Fang Jianwen. Finding Minimum Gene Subsets with Heuristic Breadth-First Search Algorithm for Robust Tumor Classification. BMC Bioinformatics, 2012. DOI:10.1186/1471-2105-13-178 [8] Li Yingxin, Li Jiangeng, Ruan Xiaogang. Study of Informative Gene Selection for Tissue Classification Based on Tumor Gene Expression Profiles. Chinese Journal of Computers, 2006, 29(2): 324-330 (in Chinese) (李颖新,李建更,阮晓钢.肿瘤基因表达谱分类特征基因选取问题及分析方法研究.计算机学报, 2006, 29(2): 324-330) [9] Wang Yuhang, Makedon F S, Ford J C, et al. HykGene: A Hybrid Approach for Selecting Marker Genes for Phenotype Classification Using Microarray Gene Expression Data. Bioinformatics, 2005, 21(8): 1530-1537 [10] Hanczar B, Courtine M, Benis A, et al. Improving Classification of Microarray Data Using Prototype-Based Feature Selection. SIGKDD Explorations, 2003, 5(2): 23-30 [11] Tan Feng, Fu Xuezheng, Wang Hao, et al. A Hybrid Feature Selection Approach for Microarray Gene Expression Data // Proc of the 6th International Conference on Computational Science. Reading, UK, 2006: 678-685 [12] Li Yingxin, Ruan Xiaogang. Feature Selection for Cancer Classification Based on Support Vector Machine. Journal of Computer Research and Development, 2005, 42(10): 1796-1801 (in Chinese) (李颖新,阮晓钢.基于支持向量机的肿瘤分类特征基因选取.计算机研究与发展, 2005, 42(10): 1796-1801) [13] Wang Shulin, Wang Ji, Chen Huowang, et al. Heuristic Breadth-First Search Algorithm for Informative Gene Selection Based on Gene Expression Profiles. Chinese Journal of Computers, 2008, 31(4): 636-649 (in Chinese) (王树林,王 戟,陈火旺,等.肿瘤信息基因启发式宽度优先搜索算法研究. 计算机学报, 2008, 31(4): 636-649) [14] Chuang L Y, Yang Chenghuei, Li J C, et al. A Hybrid BPSO-CGA Approach for Gene Selection and Classification of Microarray Data. Journal of Computational Biology, 2012, 19(1): 68-82 [15] Ma Shuange, Song Xiao, Huang Jian. Supervised Group Lasso with Applications to Microarray Data Analysis. BMC Bioinformatics, 2007. DOI:10.1186/1471-2105-8-60 [16] Zheng Songfeng, Liu Weixiang. An Experimental Comparison of Gene Selection by Lasso and Dantzig Selector for Cancer Classification. Computers in Biology and Medicine, 2011, 41(11): 1033-1040 [17] Liu Huan, Motoda H, Setiono R, et al. Feature Selection: An Ever Evolving Frontier in Data Mining // Proc of the 4th International Workshop on Feature Selection in Data Mining. Hyderabad, India, 2010: 4-13 [18] Mao Yong, Zhou Xiaobo, Xia Zheng, et al. A Survey for Study of Feature Selection Algorithms. Pattern Recognition and Artificial Intelligence, 2007, 20(2): 211-218 (in Chinese) (毛 勇,周晓波,夏 铮,等.特征选择算法研究综述.模式识别与人工智能, 2007, 20(2): 211-218) [19] Zhang Jing, Hu Xuegang, Zhang Yuhong, et al. K-split Lasso: An Effective Feature Selection Method for Tumor Gene Expression Data. Journal of Frontiers of Computer Science and Technology, 2012, 6(12): 1136-1143 (in Chinese) (张 靖,胡学钢,张玉红,等.K-split Lasso:有效的肿瘤特征基因选择方法.计算机科学与探索, 2012, 6(12): 1136-1143) [20] Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1996, 58(1): 267-288 [21] Efron B, Hastie T, Johnstone I, et al. Least Angle Regression. Annals of Statistics, 2004, 32(2): 407-451 [22] Zhao Yingdong, Simon R. BRB Array Tools Data Archive for Human Cancer Gene Expression: A Unique and Efficient Data Sharing Resource. Cancer Informatics, 2008, 6: 9-15 [23] Alon U, Barkai N, Notterman D A, et al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proceedings of the National Academy of Sciences of the United States of America, 1999, 96(12): 6745-6750 [24] Singh D, Febbo P G, Ross K, et al. Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell, 2002, 1(2): 203-209 [25] Shipp M A, Ross K N, Tamayo P, et al. Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning. Nature Medicine, 2002, 8(1): 68-74 [26] Gordon G J, Jensen R V, Hsiao L L, et al. Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 2002, 62(17): 4963-4967 [27] Frank E, Hall M, Trigg L, et al. Data Mining in Bioinformatics Using Weka. Bioinformatics, 2004, 20(15): 2479-2481 [28] George G V S, Raj V C. Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification Using Gene Expression Profiles. International Journal of Computer Science and Engineering Survey, 2011, 2(3): 16-27 [29] Kulkarni A, Kumar B S C N, Ravi V, et al. Colon Cancer Prediction with Genetics Profiles Using Evolutionary Techniques. Expert Systems with Applications, 2011, 38(3): 2752-2757 [30] Shen Qi, Shi Weimin, Kong Wei, et al. A Combination of Modified Particle Swarm Optimization Algorithm and Support Vector Machine for Gene Selection and Tumor Classification. Talanta, 2007, 71(4): 1679-1683 [31]Li Jianzhong, Yang Kun, Gao Hong, et al. Model-Free Gene Selection Method by Considering Unbalanced Samples. Journal of Software, 2006, 17(7): 1485-1493 (in Chinese) (李建中,杨 昆,高 宏,等.考虑样本不平衡的模型无关的基因选择方法.软件学报, 2006, 17(7): 1485-1493)