Abstract:The solution based on multi-approaches of data mining involving k-means, C4.5, Nave Bayes, Bayes net and Co-training is proposed in order to deal with the major problems of intrusion detection dataset such as class balance, class overlapping, noise, distributions etc. The experiment results show its validity.
周荃,赵凤英,王崇骏,陈世福. 数据挖掘方法在入侵检测中的应用研究*[J]. 模式识别与人工智能, 2008, 21(4): 520-526.
ZHOU Quan, ZHAO Feng-Ying, WANG Chong-Jun, CHEN Shi-Fu. Study of Data Mining in Intrusion Detection. , 2008, 21(4): 520-526.
[1] Bay S D. UCI KDD Archive [DB-OL].[2003-08-01]. http://kdd.ics.uci.edu/database/kddcop99 [2] Japkowicz N. Learning from Imbalanced Data Sets: A Comparison of Various Strategies // Proc of the AAAI Workshop on Learning from Imbalanced Data Sets. Austin, USA, 2000: 10-15 [3] Kolcz A, Chowdhury A, Alspector J. Data Duplication: An Imbalance Problem? [EB/OL]. [2003-08-21]. http://www.site.uottawa.ca/~nat/Workshop2005/imbalance-kolocz.pdf [4] Weiss G M. Mining with Rarity: A Unifying Framework. Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining, 2004, 6(1): 7-19 [5] Prati R C, Batista G E A P A, Monard M C. Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior // Proc of the 3rd Mexican International Conference on Artificial Intelligence. Mexico City, Mexico, 2004: 312-321 [6] Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One Sided Selection // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 179-186 [7] Drummond C, Holte R C. C4.5, Class Imbalance and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling [EB/OL]. [2003-08-21]. http://games.cs.ualberta.ca/~holte/Publications/icml2003workshop.pdf [8] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16(2): 321-357 [9] Drummond C, Holte R C. Severe Class Imbalance: Why Better Algorithms Aren’t the Answer // Proc of the 16th European Conference on Machine Learning. Porto, Portugal, 2005: 539-546 [10] Weiss G M, Provost F J. Learning When Training Data Are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 2003, 19(2): 315-354 [11] Ting K M. A Study of the Effect of Class Distribution Using Cost-Sensitive Learning // Proc of the 5th International Conference on Discovery Science. Lübeck, Germany, 2002: 98-112 [12] Estabrooks A J T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 2004, 20(1): 18-19 [13] Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 2002, 6(5): 429-449 [14] Elkan C. The Foundations of Cost-Sensitive Learning // Proc of the 7th International Joint Conference on Artificial Intelligence. Seattle, USA, 2001: 973-978 [15] Chawla N V. C4. 5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method,Probabilistic Estimate, and Decision Tree Structure [EB/OL]. [2003-08-21]. http://www.site.uottawa.ca/~nat/Workshop2003/chawla.pdf [16] Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training // Proc of the 11th Annual Conference on Computational Learning Theory. Madison, USA, 1998: 92-100 [17] Su Jinshu, Zhang Bofen, Xu Xin. Advances in Machine Learning Based Text Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese) (苏金树,张博锋,徐 昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859) [18] Bauer E I, Kohavi R I. An Empirical Comparison of Voting Classification Algorithms:Bagging, Boosting, and Variants. Machine Learning, 1999, 36(1/2): 105-139 [19] Fawcett T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers.Technical Report, HPL-2003-4, Palo Alto, USA: HP Lab, 2003 [20] Bradley A P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 1997, 30(7): 1145-1159 [21] Hand D J, Till R. A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45(1): 171-186 [22] Platt J C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization // Schlkopf B, Burges C J C, Smola A J, eds. Advances in Kernel Methods: Support Vector Learning. Cambridge, USA: MIT Press, 1999: 185-208 [23] Chawla N V, Japkowicz N, Kolcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, 2004, 6(1): 1-6