|
|
Construction of Phylogenetic Tree of Flu Virus Proteins Based on Coarse Graining |
LI Yang, TANG Xuqing |
School of Science, Jiangnan University, Wuxi 214122 |
|
|
Abstract Based on the coarse graining theory, a method for constructing phylogenetic tree of flu virus proteins is proposed by combining total 127 065 hemagglutinin and neuraminidase protein sequences. Firstly, to determine the appropriate granularity, a feature vector is obtained to present a virus protein sequence and then an approach is given to construct hierarchical structure of virus system by analyzing similarity among multi-protein sequences. The suitable number of clusters is determined according to hierarchical evaluation index based on the system structure. Furthermore, on the basis of the nearest-to-center principle, the significant viruses can be selected to represent characteristics of the whole class. Finally, the phylogenetic tree is established through the distance metric. The test result indicates that the influenza viruses with same host, similar time span, close outbreak location and same names are more likely to belong to the same branch. The results are identical with that of the existing literature on flu virus. The results provide a foundation for investigating the mutation, evolution and prediction of flu viruses.
|
Received: 15 May 2015
|
|
Fund:Supported by National Natural Science Foundation of China (No.11371174), International Science and Technology Cooperation Program of China (No.2011DFR70500) |
About author:: (LI Yang, born in 1991, master student. His research interests include intelligent computing and bioinformatics.) (TANG Xuqing(Corresponding author), born in 1963, Ph.D., professor. His research interests include intelligent computing, bioinformatics and modeling and simulation of ecological system.) |
|
|
|
[1] YU U S, LEE S H, KIM Y J, et al. Bioinformatics in the Post-Genome Era. Journal of Biochemistry and Molecular Biology, 2004, 37(1): 75-82. [2] GREENE C S, TAN J, UNG M, et al. Big Data Bioinformatics. Journal of Cellular Physiology, 2014, 229(12): 1896-1900. [3] EISENBERG D, MARCOTTE E M, XENARIOS I, et al. Protein Function in the Post-Genomic Era. Nature, 2000, 405(6788): 823-826. [4] KERSEY P, LONSDALE D, MULDER N J, et al. Building a Biological Space Based on Protein Sequence Similarities and Biological Ontologies. Combinatorial Chemistry and High Throughput Screening, 2008, 11(8): 653-660. [5] HALL L O. Exploring Big Data with Scalable Soft Clustering // KRUSE R, BERTHOLD M R, MOEWES C, et al., eds. Synergies of Soft Computing and Statistics for Intelligent Data Analysis. Berlin, Germany: Springer, 2013: 11-15. [6] HAVENS T C, BEZDEK J C, LECKIE C, et al. Fuzzy C-means Algorithms for Very Large Data. IEEE Trans on Fuzzy Systems, 2012, 20(6): 1130-1146. [7] KELIL A, WANG S R, BRZEZINSKI R, et al. CLUSS: Clustering of Protein Sequences Based on a New Similarity Measure. BMC Bioinformatics, 2007, 8(1). DOI: 10.1186/1471-2105-8-286. [8] ZADEH L A. Fuzzy Sets and Information Granulation // GUPTA M M, eds. Advances in Fuzzy Set Theory and Applications. Amsterdam, The Netherland: North-Holland Publishing, 1979. [9] PEDRYCZ W. Knowledge-Based Clustering: From Data to Information Granules. New York, USA: John Wiley & Sons, 2005. [10] 张 铃,张 钹.问题求解理论及应用:商空间粒度计算理论及应用.北京:清华大学出版社, 2007. (ZHANG L, ZHANG B. Theory of Problem Solving and Application: The Quotient Space Granular Computing Theory and Applications. Beijing, China: Tsinghua University Press, 2007.) [11] 唐旭清,朱 平,程家兴.基于归一化距离的结构聚类分析.模式识别与人工智能, 2009, 22(5): 678-688. (TANG X Q, ZHU P, CHENG J X. Analysis of Structural Clus-tering Based on Normalized Metric. Pattern Recognition and Artificial Intelligence, 2009, 22(5): 678-688.) [12] TANG X Q, ZHU P, CHENG J X. The Structural Clustering and Analysis of Metric Based on Granular Space. Pattern Recognition, 2010, 43(11): 3768-3786. [13] TANG X Q, ZHU P. Hierarchical Clustering Problems and Analysis of Fuzzy Proximity Relation on Granular Space. IEEE Trans on Fuzzy Systems, 2013, 21(5): 814-824. [14] GE E, HAINING R, LI C P, et al. Using Knowledge Fusion to Analyze Avian Influenza H5N1 in East and Southeast Asia. PLoS One, 2012, 7(5). DOI: 10.1371/journal.pone.0029617. [15] SMITH G J, VIJAYKRISHNA D, BAHL J, et al. Origins and Evolutionary Genomics of the 2009 Swine-Origin H1N1 Influenza a Epidemic. Nature, 2009, 459(7250): 1122-1125. [16] GARTEN R J, DAVIS C T, RUSSELL C A, et al. Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans. Science, 2009, 325(5937): 197-201. [17] EARN D J D, DUSHOFF J, LEVIN S A. Ecology and Evolution of the Flu. Trends in Ecology and Evolution, 2002, 17(7): 334-340. [18] PLOTKIN J B, DUSHOFF J, LEVIN S A. Hemagglutinin Sequence Clusters and the Antigenic Evolution of Influenza a Virus. Proceedings of the National Academy of Sciences of the United States of America, 2002, 99(9): 6263-6268. [19] TAYLOR W R. The Classification of Amino Acid Conservation. Journal of Theoretical Biology, 1986, 119(2): 205-218. [20] EKIERT D C, BHABHA G, ELSLIGER M A, et al. Antibody Recognition of a Highly Conserved Influenza Virus Epitope. Science, 2009, 324(5924): 246-251. [21] WU Z C, XIAO X, CHOU K C. 2D-MH: A Web-Server for Ge-nerating Graphic Representation of Protein Sequences Based on the Physicochemical Properties of their Constituent Amino Acids. Journal of Theoretical Biology, 2010, 267(1): 29-34. [22] NAKASHIMA H, NISHIKAWA K. Discrimination of Intracellular and Extracellular Proteins Using Amino Acid Composition and Re-sidue-Pair Frequencies. Journal of Molecular Biology, 1994, 238(1): 54-61. [23] WU X D, KUMAR V, QUINLAN J R, et al. Top 10 Algorithms in Data Mining. Knowledge and Information Systems, 2008, 14(1): 1-37. [24] GIRVAN M, NEWMAN M E J. Community Structure in Social and Biological Networks. Proceedings of the National Academy of Sciences of the United States of America, 2002, 99(12): 7821-7826. [25] NEWMAN M E J. Modularity and Community Structure in Networks. Proceedings of the National Academy of Sciences of the United States of America, 2006, 103(23): 8577-8582. [26] THEODORIDIS S, KOUTROUBAS K. Pattern Recognition. New York, USA: Academic Press, 1999. [27] HALKIDI M, VAZIRGIANNIS M, BATISTAKIS Y. Quality Scheme Assessment in the Clustering Process // Proc of the 4th European Conference on Principles of Data Mining and Knowledge Discovery. Berlin, Germany: Springer, 2000: 265-276. [28] MORET B M E, ROSHAN U, WARNOW T. Sequence-Length Requirements for Phylogenetic Methods // Proc of the 2nd International Workshop on Algorithms in Bioinformatics. Berlin, Germany: Springer, 2002: 343-356. [29] MULLICK J, CHERIAN S S, POTDAR V A, et al. Evolutionary Dynamics of the Influenza a Pandemic(H1N1) 2009 Virus with Emphasis on Indian Isolates: Evidence for Adaptive Evolution in the HA Gene. Infection, Genetics and Evolution, 2011, 11(5): 997-1005. [30] DENG G H, TAN D, SHI J Z, et al. Complex Reassortment of Multiple Subtypes of Avian Influenza Viruses in Domestic Ducks at the Dongting Lake Region of China. Journal of Virology, 2013, 87(17): 9452-9462. |
|
|
|