1.合肥工业大学 计算机与信息学院 合肥 230009。 2.Department of Computer Science, University of Vermont, Burlington, VT 05405, USA
Frequent Pattern Mining from Biological Sequences Based on Score Matrix
YUAN Ermao1, GUO Dan1, HU Xuegang1, WU Xindong1,2
1.School of Computer and Information, Hefei University of Technology, Hefei 230009. 2.Department of Computer Science, University of Vermont, Burlington, VT 05405
Abstract:Mining significant frequent patterns from biological sequences is an important task in bioinformatics. An algorithm of mining approximate frequent pattern based on score matrix (MAPS) is proposed. Firstly, approximate matching score matrix (MSM) is constructed to handle insertion, replacement and deletion operations with interval constraints. Secondly, the approximate pattern matching based on score matrix (S-APM) scheme is designed to obtain counts of approximate occurrences of each pattern. Finally, a data driven pattern generation method and an Apriori-like rule are adopted to avoid unnecessary candidate patterns. Experiments on protein and DNA sequences show that the MAPS produces better performance, and it can be used to discover co-occurrence frequent patterns among different sequences.
[1] FERREIRA P G, AZEVEDO P J. Protein Sequence Pattern Mining with Constraints // Proc of the 9th European Conference on Prin-ciples and Practice of Knowledge Discovery in Databases. Berlin, Germany: Springer, 2005: 96-107. [2] HAN J W, PEI J, MORTAZAVI-ASL B, et al. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining // Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2000: 355-359. [3] WONG T L, LAM W. Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach. IEEE Trans on Knowledge and Data Engineering, 2010, 22(4): 523-536. [4] KURTZ S, SCHLEIERMACHER C. REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics, 1999, 15(5): 426-427. [5] SAGOT M F, ESCALIER V, VIARI A, et al. Searching for Repea-ted Words in a Text Allowing for Mismatches and Gaps // Proc of the 2nd South American Workshop on String Processing. Vinasdel Mar, Chili: University of Chili, 1995: 87-100. [6] BOEVA V, REGNIER M, PAPATSENKO D, et al. Short Fuzzy Tandem Repeats in Genomic Sequences, Identification, and Po-ssible Role in Regulation of Gene Expression. Bioinformatics, 2006, 22(6): 676-684. [7] 王海平,胡学钢,谢 飞,等.模式特征对带有通配符和长度约束的模式匹配问题的影响.模式识别与人工智能, 2012, 25(6): 1013-1021. (WANG H P, HU X G, XIE F, et al. Impact of Pattern Feature on Pattern Matching Problem with Wildcards and Length Constraints. Pattern Recognition and Artificial Intelligence, 2012, 25 (6): 1013-1021.) [8] 黄少滨,刘国峰,万庆生,等.一种基于部分已验证匹配关系的模式匹配模型.自动化学报, 2013, 39(10): 1642-1652. (HUANG S B, LIU G F, WAN Q S, et al. A Schema Matching Model Based on Partial Verified Matching Relations. Acta Automa-tica Sinica, 2013, 39(10): 1642-1652.) [9] WU X D, ZHU X Q, HE Y, et al. PMBC: Pattern Mining from Biological Sequences with Wildcard Constraints. Computers in Biology and Medicine, 2013, 43(5): 481-492. [10] 张 治,车皓阳,施鹏飞.模式匹配问题的描述框架与算法模型.模式识别与人工智能, 2006, 19(6): 715-721. (ZHANG Z, CHE H Y, SHI P F. Framework and Algorithm Mo-del of Schema Matching Problem. Pattern Recognition and Artificial Intelligence, 2006, 19(6): 715-721.) [11] WU Y X, WANG L L, REN J D, et al. Mining Sequential Pa-tterns with Periodic Wildcard Gaps. Applied Intelligence, 2014, 41(1): 99-116. [12] ZHANG M H, KAO B, CHEUNG D W, et al. Mining Periodic Patterns with Gap Requirement from Sequences // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2005: 623-633. [13] 吴信东,谢 飞,黄咏明,等.带通配符和One-Off条件的序列模式挖掘.软件学报, 2013, 24(8): 1804-1815. (WU X D, XIE F, HUANG Y M, et al. Mining Sequential Pa-tterns with Wildcards and the One-Off Condition. Journal of Software, 2013, 24(8): 1804-1815.) [14] 王 乐,熊松泉,常艳芬,等.基于模式增长方式的高效用模式挖掘算法.自动化学报, 2015, 41(9): 1616-1626. (WANG L, XIONG S Q, CHANG Y F, et al. An Algorithm for Mining High Utility Patterns Based on Pattern-Growth. Acta Automatica Sinica, 2015, 41(9): 1616-1626.) [15] ALTSCHUL S F, GISH W, MILLER W, et al. Basic Local Alignment Search Tool. Journal of Molecular Biology, 1990, 215(3): 403-410. [16] HE D, ZHU X Q, WU X D. Mining Approximate Repeating Pa-tterns from Sequence Data with Gap Constraints. Computational Intelligence, 2011, 27(3): 336-362. [17] GUI J, WANG S L, LEI Y K. Multi-step Dimensionality Reduction and Semi-supervised Graph-Based Tumor Classification Using Gene Expression Data. Artificial Intelligence in Medicine, 2010, 50(3): 181-191. [18] YOU Z H, LEI Y K, GUI J, et al. Using Manifold Embedding for Assessing and Predicting Protein Interactions from High-Throughput Experimental Data. Bioinformatics, 2010, 26(21): 2744-2751. [19] WANG S L, LI X L, ZHANG S W, et al. Tumor Classification by Combining PNN Classifier Ensemble with Neighborhood Rough Set Based Gene Reduction. Computers in Biology and Medicine, 2010, 40(2): 179-189. [20] YOU Z H, WANG S L, GUI J, et al. A Novel Hybrid Method of Gene Selection and Its Application on Tumor Classification // Proc of the 4th International Conference on Intelligent Computing: Advanced Intelligent Computing Theories Applications. Berlin, Germany: Springer, 2008: 1055-1068.