Abstract:Blocking exerts negative effect on the performance of text hierarchical classification. In this paper, a two-step hierarchical text classification model based on blocking priori knowledge is proposed to address the problem. Firstly, blocking distribution is estimated and blocking pair recognition technique focusing on mining the serious blocking direction is presented. Secondly, the hierarchy topology structure is actively refined which attempts to correct misclassification and reduce blocking errors by using blocking priori knowledge. The experimental results on TanCorp, which is a new corpus special for Chinese text classification, show that the model can improve the performance significantly without increasing the extra number of classifiers and is a method of solving the hierarchical classification blocking problem. In addition, compared with flat text classification algorithm, this method has stable performance.
李文,苗夺谦,卫志华,王炜立. 基于阻塞先验知识的文本层次分类模型[J]. 模式识别与人工智能, 2010, 23(4): 456-463.
LI Wen,MIAO Duo-Qian,WEI Zhi-Hua,WANG Wei-Li. Hierarchical Text Classification Model Based on Blocking Priori Knowledge. , 2010, 23(4): 456-463.
[1] Sun Aixin, Lim E P, Ng W K. Performance Measurement Framework for Hierarchical Text Classification. Journal of the American Society for Information Science and Technology, 2003, 54(11): 1014-1028 [2] Ceci M, Malerba D. Classifying Web Documents in a Hierarchy of Categories: A Comprehensive Study. Journal of Intelligent Information Systems, 2007, 28(1): 37-78 [3] Mladenic′ D, Grobelnik M. Feature Selection on Hierarchy of Web Documents. Decision Support Systems, 2003, 35 (1): 45-87 [4] Vinokourov A, Girolami M. A Probabilistic Framework for the Hierarchic Organization and Classification of Document Collections. Journal of Intelligent Information Systems, 2002, 18 (2/3):153-172 [5] Ruiz M E, Srinivasan P. Hierarchical Text Categorization Using Neural Networks. Information Retrieval, 2002, 5(1): 87-118 [6] Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese) (苏金树,张博锋,徐 昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859) [7] Liu Shaohui, Dong Mingkai, Zhang Haijun, et al. An Approach of Multi-Hierarchy Text Classification Based on Vector Space Model. Journal of Chinese Information Processing, 2002, 16(3): 8-14,26 (in Chinese) (刘少辉,董明楷,张海俊,等.一种基于向量空间模型的多层次文本分类方法.中文信息学报, 2002, 16(3): 8-14,26) [8] Xiong Yunbo, Li Ronglu, Hu Yunfa. Comparison of Constructions for Hierarchical Structure Based on Confusion Matrix. Pattern Recognition and Artificial Intelligence, 2007, 20(2): 205-210 (in Chinese) (熊云波,李荣陆,胡运发.基于混淆矩阵的层次结构构造方法比较.模式识别与人工智能, 2007, 20(2): 205-210) [9] Greiner R, Grove A, Schuurmans D. On Learning Hierarchical Classifications [EB/OL]. [2005-03-05]. http://citeseer.nj.nec.com/article/greiner97learning.html [10] Dumais S T, Chen Hao. Hierarchical Classification of Web Content // Proc of the 23rd ACM International Conference on Research and Development in Information Retrieval. Athens, Greece, 2000: 256-263 [11] Sun A, Lim E P, Ng W K, et al. Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Trans on Knowledge and Data Engineering, 2004, 16 (10): 1305-1308 [12] Hu Xiang. The Study of Blocking Reduction Strategies in Hierarchical Text Classification. Master Dissertation. Nanjing, China: Southeast University. School of Computer Science and Engineering, 2006 (in Chinese) (胡 翔.层次文本分类中阻塞减少策略的研究.硕士学位论文.南京:东南大学.计算机科学与工程学院, 2006) [13] van Rijsbergen C J. Information Retrieval. London, UK: Butterworths, 1979 [14] Zhang Bo, Zhang Ling. Theory and Application of Problem Solving. New York, USA: Elsevier Science, 1992 [15] Bu Dongbo, Bai Shuo, Li Guojie. Principle of Granularity in Clustering and Classification. Chinese Journal of Computers, 2002, 25(8): 810-816 (in Chinese) (卜东波,白 硕,李国杰.聚类/分类中的粒度原理.计算机学报, 2002, 25(8): 810-816) [16] Liu Qun, Zhang Huaping, Yu Hongkui, et al. Chinese Lexical Analysis Using Cascaded Hidden Markov Model. Journal of Computer Research and Development, 2004, 41(8): 1421-1429 (in Chinese) (刘 群,张华平,俞鸿魁,等.基于层叠隐马模型的汉语词法分析.计算机研究与发展, 2004, 41(8): 1421-1429) [17] Yang Yiming, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 412-420 [18] Fan Xinghua, Sun Maosong. A High Performance Two-Class Chinese Text Categorization Method. Chinese Journal of Computers, 2006, 29(1): 124-131 (in Chinese) (樊兴华,孙茂松.一种高性能的两类中文文本分类方法.计算机学报, 2006, 29(1): 124-131)