|
|
Study of Data Mining in Intrusion Detection |
ZHOU Quan, ZHAO Feng-Ying, WANG Chong-Jun, CHEN Shi-Fu |
National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210093 |
|
|
Abstract The solution based on multi-approaches of data mining involving k-means, C4.5, Nave Bayes, Bayes net and Co-training is proposed in order to deal with the major problems of intrusion detection dataset such as class balance, class overlapping, noise, distributions etc. The experiment results show its validity.
|
Received: 07 March 2007
|
|
|
|
|
[1] Bay S D. UCI KDD Archive [DB-OL].[2003-08-01]. http://kdd.ics.uci.edu/database/kddcop99 [2] Japkowicz N. Learning from Imbalanced Data Sets: A Comparison of Various Strategies // Proc of the AAAI Workshop on Learning from Imbalanced Data Sets. Austin, USA, 2000: 10-15 [3] Kolcz A, Chowdhury A, Alspector J. Data Duplication: An Imbalance Problem? [EB/OL]. [2003-08-21]. http://www.site.uottawa.ca/~nat/Workshop2005/imbalance-kolocz.pdf [4] Weiss G M. Mining with Rarity: A Unifying Framework. Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining, 2004, 6(1): 7-19 [5] Prati R C, Batista G E A P A, Monard M C. Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior // Proc of the 3rd Mexican International Conference on Artificial Intelligence. Mexico City, Mexico, 2004: 312-321 [6] Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One Sided Selection // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 179-186 [7] Drummond C, Holte R C. C4.5, Class Imbalance and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling [EB/OL]. [2003-08-21]. http://games.cs.ualberta.ca/~holte/Publications/icml2003workshop.pdf [8] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16(2): 321-357 [9] Drummond C, Holte R C. Severe Class Imbalance: Why Better Algorithms Aren’t the Answer // Proc of the 16th European Conference on Machine Learning. Porto, Portugal, 2005: 539-546 [10] Weiss G M, Provost F J. Learning When Training Data Are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 2003, 19(2): 315-354 [11] Ting K M. A Study of the Effect of Class Distribution Using Cost-Sensitive Learning // Proc of the 5th International Conference on Discovery Science. Lübeck, Germany, 2002: 98-112 [12] Estabrooks A J T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 2004, 20(1): 18-19 [13] Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 2002, 6(5): 429-449 [14] Elkan C. The Foundations of Cost-Sensitive Learning // Proc of the 7th International Joint Conference on Artificial Intelligence. Seattle, USA, 2001: 973-978 [15] Chawla N V. C4. 5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method,Probabilistic Estimate, and Decision Tree Structure [EB/OL]. [2003-08-21]. http://www.site.uottawa.ca/~nat/Workshop2003/chawla.pdf [16] Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training // Proc of the 11th Annual Conference on Computational Learning Theory. Madison, USA, 1998: 92-100 [17] Su Jinshu, Zhang Bofen, Xu Xin. Advances in Machine Learning Based Text Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese) (苏金树,张博锋,徐 昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859) [18] Bauer E I, Kohavi R I. An Empirical Comparison of Voting Classification Algorithms:Bagging, Boosting, and Variants. Machine Learning, 1999, 36(1/2): 105-139 [19] Fawcett T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers.Technical Report, HPL-2003-4, Palo Alto, USA: HP Lab, 2003 [20] Bradley A P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 1997, 30(7): 1145-1159 [21] Hand D J, Till R. A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45(1): 171-186 [22] Platt J C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization // Schlkopf B, Burges C J C, Smola A J, eds. Advances in Kernel Methods: Support Vector Learning. Cambridge, USA: MIT Press, 1999: 185-208 [23] Chawla N V, Japkowicz N, Kolcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, 2004, 6(1): 1-6 |
|
|
|