Abstract:The assumption that all data features are equally important in the categorical data-sequential information bottleneck(CD-sIB) lowers the transformation quality. A weighting binary transformation method is proposed to reveal the feature of non co-occurrence data by highlighting the representative features and depressing the redundancy features. Meanwhile,two weighting rules,the applicability of stochastically distributed data and the non supervision of weighting schemes,are introduced. Then,the weighted categorical data-sequential information bottleneck(WCD-sIB) algorithm is presented based on the weighting granularity concept. The experimental results show that the weighting binary transformation method generates good co-occurrence data representation,and the WCD-sIB algorithm is superior to the other algorithms.
[1] Bekkerman R,El-Yaniv R,Tishby N. Distributional Word Clusters vs Words for Text Categorization. Journal of Machine Learning Research,2003,3: 1183-1208 [2] Slonim N. The Information Bottleneck: Theory and Application. Ph.D Dissertation. Jerusalem,Israel: The Hebrew University of Jerusalem,2002 [3] Ye Yangdong,He Xidian,Jia Limin. CD-sIB: A Kind of sIB Algorithm Orient to Categorical Data. Acta Electronica Sinica,2009,37(10): 2165-2172(in Chinese) (叶阳东,何锡点,贾利民.面向范畴类型数据的sIB算法.电子学报,2009,37(10): 2165-2172) [4] Seldin Y,Slonim N,Tishby N. Information Bottleneck for Non Co-Occurrence Data // Scholkpf B,Platt J C,Hoffman T,eds. Advances in Neural Information Processing Systems. Cambridge,USA: MIT Press,2007,XIX: 1241-1248 [5] Shamir O,Sabato S,Tishby N. Learning and Generalization with the Information Bottleneck. Theoretical Computer Science,2010,411(29/30): 2696-2711 [6] Yuan H Q,Ye Y D. Iterative sIB Algorithm. Pattern Recognition Letters,2011,32(4): 606-614 [7] Xia Limin,Tan Liqiu,Zhong Hong. Semantic Annotation of Image Based on Information Bottleneck Method. Pattern Recognition and Artificial Intelligence,2008,21(6): 812-818(in Chinese) (夏利民,谭立球,钟 洪.基于信息瓶颈算法的图像语义标注.模式识别与人工智能,2008,21(6): 812-818) [8] van Rijsbergen C J. A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval. Journal of Documentation,1997,33(2): 106-119 [9] Peat H J,Willett P. The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems. Journal of the American Society for Information Science,1991,42(5): 378-383 [10] Andritsos P,Tsaparas P,Miller R J,et al. LIMBO: Scalable Clustering of Categorical Data // Proc of the 9th International Conference on Extending Database Technology. Heraklion,Greece,2004: 531-532 [11] Sebastiani F. Machine Learning in Automated Text Categori zation. ACM Computing Surveys,2002,34(1): 1-47 [12] Cost S,Salzberg S. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning,1973,10(1): 57-78 [13] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization // Proc of the 14th International Conference on Machine Learning.San Francisco,USA:Morgan Kaufmann Publishers,1997: 143-151 [14] Han E H,Karypis G, Kumar V. Text Categorization Using Weight-Adjusted k-Nearest Neighbor Classification // Proc of the Asia Conference on Knowledge Discovery and Data Mining. Hong Kong,China,2001: 53-65 [15] Shankar S,Karypis G. A Feature Weight Adjustment Algorithm for Document Categorization // Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York,USA: ACM Press,2000 [16] Debole F,Sebastiani F. Supervised Term Weighting for Automated Text Categorization // Proc of the ACM Symposium on Applied Computing. Melbourne,USA,2003: 781-788 [17] Gibson D,Kleinberg J,Raghavan P. Clustering Categorical Data: An Approach Based on Dynamical Systems // Proc of the International Conference on Very Large Data Bases. San Francisco,USA,1998: 311-322 [18] Yates R B,Neto B R.Modern Information Retrieval. New York,USA: Addison-Wesley-Longman,1999