Abstract:The optimal structure of theme set can be automatically learned from the data with Hierarchical Dirichlet Process(HDP) topic model. However, the set of topics can not meet the semantic requirement. And in some theme models with labels it is difficult to set the parameters. Therefore, based on the known semantic labels and the certitude degree of labels, a semi-supervised labeled HDP topic model(SLHDP) and the accuracy evaluation index of random cluster are proposed in this paper. Higher weight is given by the known semantic labels. Combined with the property of the finite space being divided infinitely in Dirichlet process, the model is built via Chinese restaurant process. The experimental results on several Chinese and English datasets show that SLHDP model makes the topic set more reasonable in the text classification of large scale datasets.
[1] LI F F, PERONA P. A Bayesian Hierarchical Model for Learning Natural Scene Categories // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2005: 524-531. [2] HUANG J. Maximum Likelihood Estimation of Dirichlet Distribution Parameters[J/OL]. [2016-10-21]. http://jonathan-huang.org/research/dirichlet/dirichlet.pdf. [3] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. [4] BLEI D M, MCAULIFFE J D. Supervised Topic Models[C/OL].[2016-10-21]. https://arxiv.org/pdf/1003.0783.pdf. [5] FLAHERTY P, GIAVEVER G, KUMM J, et al. A Latent Variable Model for Chemogenomic Profiling. Bioinformatics, 2005, 21(15): 3286-3293. [6] DAI A M, STORKEY A J. The Supervised Hierarchical Dirichlet Process. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(2): 243-255. [7] ZHU J, ZHENG X, ZHANG B. Improved Bayesian Logistic Supervised Topic Models with Data Augmentation[J/OL]. [2016-10-21]. http://ml.cs.tsinghua.edu.cn/~jun/bslda.pdf. [8] ZHU J, AHMED A, XING E P. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification // Proc of the 26th Annual International Conference on Machine Learning. New York, USA: ACM, 2009: 1257-1264. [9] HEINRICH G. "Infinite LDA"-Implementing the HDP with Minimum Code Complexity[J/OL]. [2016-10-21]. http://arbylon.net/publications/ilda.pdf. [10] TEH Y W, JORDAN M I, BEAL M J, et al. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[11] RAMAGE D, HALL D, NALLAPATI R, et al. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-labeled Corpora // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: Association for Computational Linguistics. 2009, I: 248-256. [12] LI X M, OUYANG J H, ZHOU X T, et al. Supervised Labeled Latent Dirichlet Allocation for Document Categorization. Applied Intelligence, 2015, 42(3): 581-593. [13] NEAL R M. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational & Graphical Statistics, 2000, 9(2): 249-265. [14] WANG C, PAISLEY J W, LEI D M. Online Variational Inference for the Hierarchical Dirichlet Process. Journal of Machine Learning Research, 2011, 15: 752-760. [15] GRIFFITHS T L, STEYVERS M. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235. [16] CANINI K R. Nonparametric Hierarchical Bayesian Models of Categorization. Technical Report, UCB/EECS-2011-133. Berkeley, USA: University of California, 2011. [17] FREEMAN W T, WILLSKY A S, SUDDERTH E B. Graphical Models for Visual Object Recognition and Tracking. Ph.D Dissertation. Cambridge, USA: Massachusetts Institute of Technology, 2006. [18] CHEN X, HU X H, AN Y, et al. Perspective Hierarchical Dirichlet Process for User-Tagged Image Modeling // Proc of the 20th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2011: 1341-1346. [19] ZHANG M L, ZHANG K. Multi-label Learning by Exploiting Label Dependency // Proc of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2010: 999-1008.