1.Advanced Control Systems Laboratory, School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044 2.Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing 100700
Abstract:As a special classification problem, classification of Boolean sequences is seldom studied. Definitions of the ordering and piecewise mapping are given. And then a dimension-reduction method called ordering and piecewise mapping (OPM ) is put forward. Thus an improved KNN algorithm (OPM-KNN) is presented by integrating OPM with KNN. Analytical and experimental results show the speed of OPM method is improved compared with that of traditional PCA algorithm in dimension reduction. As for classification, the accurate rate of OPM-KNN is almost equivalent to the traditional KNN algorithm or appreciably higher than it and the speed is also faster.
[1] Kim K H. Boolean Matrix Theory and Application. New York, USA: Marcel Dekker, 1982 [2] Gelfand M S, Roytberg M A. Prediction of the Exon-Intron Structure by a Dynamic Programming Approach. Biosystems, 1993, 30(1/2/3): 173-182 [3] Pedro A M. Walsh-Fourier Analysis and Its Statistical Applications: Comment. Journal of the American Statistical Association, 1991, 86(414): 482-483 [4] Pique-Regi R, Ortega A, Asgharzadeh S. Sequential Diagonal Linear Discriminant Analysis (SeqDLDA) for Microarray Classification and Gene Identification // Proc of the IEEE Workshops and Poster Abstracts on Computational Systems Bioinformatics Conference. Stanford, USA, 2005: 112-113 [5] Selbig J, Mevissen T, Lengauer T. Decision Tree-Based Formation of Consensus Protein Secondary Structure Prediction. Bioinformatics, 1999, 15(12): 1039-1046 [6] Lü Feng, Zhang Weiwei. Research on the Characters of Four Sequential Patterns Mining Algorithms. Journal of Wuhan University of Technology, 2006, 28(2): 57-60 (in Chinese) (吕 锋,张炜玮.4种序列模式挖掘算法的特性研究.武汉理工大学学报, 2006, 28(2): 57-60) [7] Zhang Chunting, Lin Zhesuai, Yan Ming, et al. A Novel Approach to Distinguish between Intron-Containing and Intronless Genes Based on the Format of Z Curves. Journal of Theoretical Biology, 1998, 192(4): 467-473 [8] Samuel S G, Olga R, Chuong B D, et al. Training Conditional Random Fields for Maximum Labelwise Accuracy [EB/OL]. [2007-09-01]. http://books.nips.cc/nips19.html [9] Meng Dazhi. The Structure of DNA Sequence and Simple Model. Mathematics in Practice and Theory, 2001, 31(1): 54-58 (in Chinese) (孟大志.DNA序列中的结构与简化模型.数学的实践与认识, 2001, 31(1): 54-58) [10] Gyllenberg M, Koski T, Verlaan M. Clustering and Quantization of Binary Vectors with Stochastic Complexity // Proc of the IEEE International Symposium on Information Theory. Trondheim, Norway, 1994, Ⅴ: 390-396 [11] Gyllenberg M, Koski T, Verlaan M. Classification of Binary Vectors by Stochastic Complexity. Journal of Multivariate Analysis, 1997, 63(1): 47-72 [12] Gyllenberg M, Koski T. Probabilistic Models for Bacterial Taxonomy. International Statistical Review, 2001, 69(2): 249-276 [13] Frnti P, Xu Mantao, Krkkinen I, et al. Classification of Binary Vectors by Using ΔSC Distance to Minimize Stochastic Complexity. Pattern Recognition Letters, 2003, 24(1/2/3): 65-73 [14] Kak S C. Classification of Random Binary Sequences Using Walsh-Fourier Analysis. IEEE Trans on Electromagnetic Compatibility, 1971, 13(3): 74-77 [15] Wedelin D. Efficient Estimation and Model Selection in Large Graphical Models. Statistics and Computing, 1996, 6(4): 313-323 [16] Jansche M. A Maximum Expected Utility Framework for Binary Sequence Labeling // Proc of the 45th Annual Meeting of the Association for Computational Linguistics. Prague, Czech Republic, 2007, 736-743 [17] Beeferman D, Berger A, Lafferty J. Statistical Models for Text Segmentation. Machine Learning, 1999, 34(1/2/3): 177-210 [18] Dasarathy B V. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, USA: IEEE Computer Society Press, 1990 [19] St-Jacques C, Barrière C. Similarity Judgments: Philosophical, Psychological and Mathematical Investigations // Proc of the Workshop on Linguistic Distances. Sydney, Australia, 2006: 8-15 [20] Krzysztof J C, Lukasz A K. SPECT Heart Data [DB/OL]. [2001-10-1]. http://mlearn.ics.uci.edu/MLSummary.html [21] Kushmerick N. Internet Advertisements Data [DB/OL]. [2001-10-1]. http://mlearn.ics.uci.edu/MLSummary.html [22] Preparata F P, Shamos M I. Computational Geometry: An Introduction. 3rd Edition. New York, USA: Springer-Verlag, 1991