Abstract:Macro average precision, macro average recall and macro average F1 are usually used to evaluate classification technique. But those measures are sensitive to the datasets which means the measures are only valid for specific dataset but invalid for the others. To solve this problem, three factors are proposed to describe how datasets affect the classification result. Then a new evaluation method of categorization called newmacroF1 is presented according to the three factors. Experimental results show that the new measure remains stable on different datasets and through the performance of an algorithm on one dataset, the precision of other datasets could be estimated with the help of new measure.
龚笔宏,彭波. 对文本分类评测方法稳定性的研究*[J]. 模式识别与人工智能, 2008, 21(1): 12-17.
GONG BiHong, PENG Bo. Study of Stability of Text Classification Evaluation. , 2008, 21(1): 12-17.
[1] Sebastiani F. A Tutorial on Automated Text Categorization // Proc of the European Symposium on Telematics, Hypermedia and Artificial Intelligence. Varese, Italy, 1999: 105119 [2]Harman D. Evaluation Issues in Information Retrieval. Information Processing and Management, 1992, 28(4): 439440 [3]Yang Yiming, Liu Xin. A ReExamination of Text Categorization Methods // Proc of the ACM SIGIR Conference of the Research and Development in Information Retrieval. Berkeley, USA, 1999: 4249 [4] Gong Bihong. The Guideline of Chinese Webpages Categorization Contest in SEWM2005[EB/OL]. [20050721]. http: //www. cwirf. org/ 2005 Web Track /SEWM 2005 ClassificationTrackGuidelines.pdf (in Chinese) (龚笔宏.SEWM2005中文网页分类评测指南[EB/OL]. [20050721]. http://www.cwirf.org/ 2005WebTrack/ SEWM2005 ClassificationTrackGuidelines.pdf) [5] Yang Yiming. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1999, 1(1/2): 6788 [6] Hulth A, Megyesi B B. A Study on Automatically Extracted Keywords in Text Categorization // Proc of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL. Sydney, Australia, 2006: 537544 [7] RuizRico F, Vicedo J L, RubioSánchez M C. NEWPAR: An Automatic Feature Selection and Weighting Schema for Category Ranking // Proc of the ACM Symposium on Document Engineering. Amsterdam, Netherlands, 2006: 128137 [8] Liu Tieyan, Yang Yiming, Wan Hao, et al. An Experimental Study on LargeScale Web Categorization // Proc of the 14th International Conference on World Wide Web. Chiba, Japan, 2005: 11061107 [9]Buckley C, Voorhees E M. Evaluating Evaluation Measure Stability // Proc of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athens, Greece, 2000: 3340 [10] Aslam J A, Yilmaz E. A Geometric Interpretation and Analysis of RPrecision // Proc of the 14th ACM International Conference on Information and Knowledge Management. Bremen, Germany, 2005: 664671