Gene Markers Identification Algorithm for Detecting Colon Cancer Patients
XIE Juanying1, FAN Wen2
1.School of Computer Science, Shaanxi Normal University, Xi′an 7101193 2.School of Software Engineering, University of Science and Technology of China, Suzhou 215123
Abstract:To detect those few informative genes with strong classification information and identify colon cancer patients as correctly as possible, an algorithm is proposed in this paper to identify the gene markers for detecting colon cancer patients. The densities and distances are defined for genes firstly. All genes are scattered in a 2D space with gene density and distance as X-axis and Y-axis, respectively. Those genes at high density peaks are selected to construct the optimal gene subset. Then, those samples only with genes in the optimal gene subset of colon dataset are clustered by DP_K-medoids clustering algorithm. The distances between genes or samples are calculated via Euclidean distance, Manhattan distance, Chebyshev distance and the cosine distance, respectively. The experimental results demonstrate that the proposed algorithm can find the optimal gene subset of colon cancer with high accuracy, sensitivity, specificity and MCC, and with a very few number of genes as well.
[1] GULSHAN V, PENG L, CORAM M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 2016, 316(22): 2402-2410. [2] ESTEVA A, KUPREL B, NOVOA R A, et al. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature, 2017, 542(7639): 115-118. [3] LONG E P, LIN H T, LIU Z Z, et al. An Artificial Intelligence Platform for the Multihospital Collaborative Management of Congenital Cataracts. Nature Biomedical Engineering, 2017. DOI:10.1038/541551-016-0024. [4] ORRINGER D A, PANDIAN B, NIKNAFS Y S, et al. Rapid Intraoperative Histology of Unprocessed Surgical Specimens via Fibre-Laser-Based Stimulated Raman Scattering Microscopy. Nature Biomedical Engineering, 2017. DOI: 10.1038/s41551-016-0027. [5] FARINA D, VUJAKLIJA I, SARTORI M, et al. Man/Machine Interface Based on the Discharge Timings of Spinal Motor Neurons after Targeted Muscle Reinnervation. Nature Biomedical Engineering, 2017. DOI: 10.1038/s41551-016-0025. [6] ABEEL T, HELLEPUTTE T, VAN DE PEER Y, et al. Robust Bio-marker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods. Bioinformatics, 2010, 26(3): 392-398. [7] 杨秀珍,彭云香,王庆锋,等.大肠癌基因表达谱研究进展.国际检验医学杂志, 2015, 36(14): 2089-2090, 2116. (YANG X Z, PENG Y X, WANG Q F, et al. Research and Progresses of Colorectal Cancer Gene Expression Profiles. International Journal of Laboratory Medicine, 2015, 36(14): 2089-2090, 2116.) [8] 方 艳.数据挖掘在生物信息学中的应用.微机发展, 2004, 14(4): 1-3, 17. (FAN Y. The Application of Data Mining in Bioinformatics. Microcomputer Development, 2004, 14(4): 1-3, 17.) [9] YANG F, MAO K Z. Robust Feature Selection for Microarray Data Based on Multicriterion Fusion. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2011, 8(4): 1080-1092. [10] ALON U, BARKAI N, NOTTERMAN D A, et al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Procee-dings of the National Academy of Sciences of the United Stated of America, 1999, 96(12): 6745-6750. [11] BEN-DOR A, BRUHN L, FRIEDMAN N, et al. Tissue Classification with Gene Expression Profiles // Proc of the 4th Annual International Conference on Computational Molecular Biology. New York, USA: ACM, 2000: 54-64. [12] 徐久成,李 涛,孙 林,等.基于信噪比与邻域粗糙集的特征基因选择方法.数据采集与处理, 2015, 30(5): 973-981. (XU J C, LI T, SUN L, et al. Feature Gene Selection Based on SNR and Neighborhood Rough Set. Journal of Data Acquisition and Processing, 2015, 30(5): 973-981.) [13] 张军梅.基于最大权重最小冗余准则的特征选择方法研究.硕士学位论文.大连:大连理工大学, 2016.
(ZHANG J M. Study on Feature Selection Based on Maximum Weight and Minimum Redundancy. Master Dissertation. Dalian, China: Dalian University of Technology, 2016.) [14] 谢娟英,屈亚楠.密度峰值优化初始中心的K-medoids聚类算法.计算机科学与探索, 2016, 10(2): 230-247. (XIE J Y, QU Y N. K-medoids Clustering Algorithms with Optimized Initial Seeds by Density Peaks. Journal of Frontiers of Computer Science and Technology, 2016, 10(2): 230-247.) [15] HAN J W, KAMBER M, PEI J. Data Mining: Concepts and Techniques. 3rd Edition. Amsterdam, The Netherlands: Elsevier, 2011. [16] POWERS D M. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies, 2011, 2(1): 37-63. [17] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496. [18] TAN P N, STEINBACH M, KUMAR V. An Introduction to Data Mining. Berlin, Germany: Springer, 2009.