A Survey of Machine Learning Algorithms for Big Data
HE Qing1, LI Ning1,2,3, LUO Wen-Juan1,2, SHI Zhong-Zhi1
1.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 2.University of Chinese Academy of Sciences, Beijing 100049 3.College of Mathematics and Computer Science, Hebei University, Baoding 071002
Abstract:With the explosive growth of the industry data, more and more attention is paid to big data. However, due to the volume, complex and fast-changing characteristics of big data, traditional machine learning algorithms for small data are not applicable. Therefore, developing machine learning algorithms for big data is a research focus. In this paper, the state-of-the-art machine learning techniques for big data are introduced and analyzed. As parallelism is a mainstream strategy for applying machine learning algorithms to big data, some parallelism strategies are described in detail as well. Finally, the challenges of applying machine learning to big data and some interesting research trends of machine learning in big data are pointed out.
何清,李宁,罗文娟,史忠植. 大数据下的机器学习算法综述[J]. 模式识别与人工智能, 2014, 27(4): 327-336.
HE Qing, LI Ning, LUO Wen-Juan, SHI Zhong-Zhi. A Survey of Machine Learning Algorithms for Big Data. , 2014, 27(4): 327-336.
[1] Labrinidis A, Jagadish H V. Challenges and Opportunities with Big Data. Proc of the VLDB Endowment, 2012, 5(12): 2032-2033 [2] Bizer C, Boncz P, Brodie M L, et al. The Meaningful Use of Big Data: Four Perspectives-Four Challenges. ACM SIGMOD Record, 2012, 40(4): 56-60 [3] Li G J, Cheng X Q. Research Status and Scientific Thinking of Big Data. Bulletin of Chinese Academy of Sciences, 2012, 27(6): 647-657 (in Chinese) (李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考.中国科学院院刊, 2012, 27(6): 647-657) [4] Wang F Y. A Big-Data Perspective on AI: Newton, Merton, and Analytics Intelligence. IEEE Intelligent Systems, 2012, 27(5): 2-4 [5] Simon H A. Why Should Machines Learn? // Michalski R S, Carbonell J G, Mitchell T M, et al., eds. Machine Learning: An Artificial Intelligence Approach. Berlin, Germany: Springer, 1983: 25-37 [6] Hart P. The Condensed Nearest Neighbor Rule. IEEE Trans on Information Theory, 1968, 14(3): 515-516 [7] Gates G. The Reduced Nearest Neighbor Rule. IEEE Trans on Information Theory, 1972, 18(3): 431-433 [8] Brighton H, Mellish C. Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172 [9] Li Y H, Maguire L. Selecting Critical Patterns Based on Local Geometrical and Statistical Information. IEEE Trans on Pattern Analysis and Machine Intelligence, 2011, 33(6): 1189-1201 [10] Angiulli F. Fast Nearest Neighbor Condensation for Large Data Sets Classification. IEEE Trans on Knowledge and Data Engineering, 2007, 19(11): 1450-1464 [11] Angiulli F, Folino G. Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets. IEEE Trans on Knowledge and Data Engineering, 2007, 19(12): 1593-1606 [12] Jordan M I. Divide-and-Conquer and Statistical Inference for Big Data // Proc of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China, 2012. DOI:10.1145/2339530.2339534 [13] Kolda T G, Sun J M. Scalable Tensor Decompositions for Multi-aspect Data Mining // Proc of the 8th IEEE International Conference on Data Mining. Pisa, Italy, 2008: 363-372 [14] Wahba G. Dissimilarity Data in Statistical Model Building and Machine Learning // Proc of the 5th International Congress of Chinese Mathematicians. Beijing, China, 2012: 785-809 [15] Hoi C H, Wang J L, Zhao P L, et al. Online Feature Selection for Mining Big Data // Proc of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012: 93-100 [16] Sagheer A, Tsuruta N, Taniguchi R I, et al. Fast Feature Extraction Approach for Multi-dimension Feature Space Problems // Proc of the 18th International Conference on Pattern Recognition. Hong Kong, China, 2006, III: 417-420 [17] Anaraki J R, Eftekhari M. Improving Fuzzy-Rough Quick Reduct for Feature Selection // Proc of the 19th Iranian Conference on Electrical Engineering. Tehran, Iran, 2011: 1-6 [18] Quevedo J R, Bahamonde A, Luaces O. A Simple and Efficient Method for Variable Ranking according to Their Usefulness for Learning. Computational Statistics & Data Analysis, 2007, 52(1): 578-595 [19] Gheyas I A, Smith L S. Feature Subset Selection in Large Dimensionality Domains. Pattern Recognition, 2010, 43(1): 5-13 [20] Pal M, Foody G M. Feature Selection for Classification of Hyperspectral Data by SVM. IEEE Trans on Geoscience and Remote Sensing, 2010, 48(5): 2297-2307 [21] Sun Y J, Todorovic S, Goodison S. Local-Learning-Based Feature Selection for High-Dimensional Data Analysis. IEEE Trans on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1610-1626 [22] Hua J P, Tembe W D, Dougherty E R. Performance of Feature-Selection Methods in the Classification of High-Dimension Data. Pattern Recognition, 2009, 42(3): 409-424 [23] Song M, Yang H, Siadat S H, et al. A Comparative Study of Dimensionality Reduction Techniques to Enhance Trace Clustering Performances. Expert Systems with Applications, 2013, 40(9): 3722-3737 [24] Lau K W, Wu Q H. Online Training of Support Vector Classifier. Pattern Recognition, 2003, 36(8): 1913-1920 [25] Laskov P, Gehl C, Krüger S, et al. Incremental Support Vector Learning: Analysis, Implementation and Applications. Journal of Machine Learning Research, 2006, 7: 1909-1936 [26] Huang K, Yang H, King L, et al. Maxi-Min Margin Machine: Learning Large Margin Classifiers Locally and Globally. IEEE Trans on Neural Networks, 2008, 19(2): 260-272 [27] Kim B J. A Classifier for Big Data // Proc of the 6th International Conference on Convergence and Hybrid Information Technology. Daejeon, Republic of Korea, 2012: 505-512 [28] Franco-Arcega A, Carrasco-Ochoa J A, Sánchez-Díaz G, et al. Building Fast Decision Trees from Large Training Sets. Intelligent Data Analysis, 2012, 16(4): 649-664 [29] Hang Y, Fong S. Incrementally Optimized Decision Tree for Noisy Big Data // Proc of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012: 36-44 [30] Ben-Haim Y, Tom-Tov E. A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research, 2010, 11: 849-872 [31] Huang G B, Zhu Q Y, Siew C K. Extreme Learning Machine: Theory and Applications. Neurocomputing, 2006, 70(1/2/3): 489-501 [32] Liu N, Wang H. Ensemble Based Extreme Learning Machine. IEEE Signal Processing Letters, 2010, 17(8): 754-757 [33] He Q, Shang T F, Zhuang F Z, et al. Parallel Extreme Learning Machine for Regression Based on MapReduce. Neurocomputing, 2013, 102: 52-58 [34] Zhang R, Lan Y, Huang G B, et al. Universal Approximation of Extreme Learning Machine with Adaptive Growth of Hidden Nodes. IEEE Trans on Neural Networks and Learning Systems, 2012, 23(2): 365-371 [35] Rong H J, Huang G B, Sundararajan N, et al. Online Sequential Fuzzy Extreme Learning Machine for Function Approximation and Classification Problems. IEEE Trans on Systems, Man and Cybernetics, 2009, 39(4): 1067-1072 [36] Yang Y M, Wang X N, Yuan X F. Bidirectional Extreme Learning Machine for Regression Problem and Its Learning Effectiveness. IEEE Trans on Neural Networks and Learning Systems, 2012, 23(9): 1498-1505 [37] Li M, Zhou Z H. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples. IEEE Trans on Systems, Man and Cybernetics, 2007, 37(6): 1088-1098 [38] Lin Y Q, Lü F J, Zhu S H, et al. Large-Scale Image Classification: Fast Feature Extraction and SVM Training // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA, 2011: 1689-1696 [39] Ling X, Xue G R, Dai W Y, et al. Can Chinese Web Pages Be Classified with English Data Source? // Proc of the 17th International Conference on World Wide Web. Beijing, China, 2008: 969-978 [40] Havens T C, Bezdek J C, Leckie C, et al. Fuzzy c-means Algorithms for Very Large Data. IEEE Trans on Fuzzy Systems, 2012, 20(6): 1130-1146 [41] Xue Z H, Shen G, Li J H, et al. Compression-Aware I/O Performance Analysis for Big Data Clustering // Proc of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012: 45-52 [42] Hall L O. Exploring Big Data with Scalable Soft Clustering // Proc of the 6th International Conference on Soft Methods in Probability and Statistics. Konstanz, Germany, 2012: 11-15 [43] Zhao W Z, Ma H F, He Q. Parallel k-means Clustering Based on MapReduce // Proc of the 1st International Conference on Cloud Computing and Big Data. Beijing, China, 2009: 674-679 [44] Papadimitriou S, Sun J M. DisCo: Distributed Co-clustering with MapReduce: A Case Study towards Petabyte-Scale End-to-End Mining // Proc of the 8th IEEE International Conference on Data Mining. Pisa, Italy, 2008: 512-521 [45] Zhang C, Li F F, Jeffrey J. Efficient Parallel kNN Joins for Large Data in MapReduce // Proc of the 15th International Conference on Extending Database Technology. Berlin, Germany, 2012: 38-49 [46] Ferreira C R L, Junior T C, Traina A J M, et al. Clustering Very Large Multi-dimensional Datasets with MapReduce // Proc of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, USA, 2011: 690-698 [47] Havens T C, Chitla R, Jain A K, et al. Speedup of Fuzzy and Possibilistic Kernel c-means for Large-Scale Clustering // Proc of the IEEE International Conference on Fuzzy Systems. Taipei, China, 2011: 463-470 [48] Niu D L, Dy J G, Jordan M I. Dimensionality Reduction for Spectral Clustering // Proc of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA, 2011: 552-560 [49] Kriegel H P, Krger P, Zimek A. Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. ACM Trans on Knowledge Discovery from Data, 2009, 3(1): 1-58 [50] Vidal R. Subspace Clustering. IEEE Trans on Signal Processing, 2011, 28(2): 52-68 [51] Zhou Y, Cheng H, Yu J X. Graph Clustering Based on Structural/Attribute Similarities. Proc of the VLDB Endowment, 2009, 2(1): 718-729 [52] Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases // Proc of the 20th International Conference on Very Large Data Bases. Santiago de Chile, Chile, 1994: 487-499 [53] Agrawal R, Srikant R. Mining Sequential Patterns // Proc of the 11th International Conference on Data Engineering. Taipei, China, 1995: 3-14 [54] Srikanth R, Agrawal R. Mining Sequential Patterns: Generalizations and Performance Improvements // Proc of the 5th International Conference on Extending Database Technology: Advances in Database Technology. Avignon, France, 1996: 3-17 [55] Zaki M J. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001, 42(1/2): 31-60 [56] Han J W, Kamber M, Pei J. Data Mining Concepts and Techniques. 2nd Edition. New York, USA: Morgan Kaufmann, 2006 [57] Pei J, Han J W, Pinto H, et al. Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth // Proc of the 17th International Conference on Data Engineering. Heidelberg, Germany, 2001: 215-224 [58] Lin M Y, Lee S Y. Fast Discovery of Sequential Patterns by Memory Indexing // Proc of the 4th International Conference on Data Warehousing and Knowledge Discovery. Aix-en-Provence, France, 2002: 150-160 [59] Garofalakis M N, Rastogi R, Shim K. Spirit: Sequential Pattern Mining with Regular Expression Constraints // Proc of the 25th International Conference on Very Large Data Bases. Edinburgh, Scotland, 1999: 223-234 [60] Li N, Zeng L, He Q, et al. Parallel Implementation of Apriori Algorithm Based on MapReduce // Proc of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. Kyoto, Japan, 2012: 191-200 [61] Zhang M H, Kao B, Cheung D W, et al. Efficient Algorithms for Incremental Update of Frequent Sequences // Proc of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Taipei, China, 2002: 186-197 [62] Parthasarathy S, Zaki M J, Ogihara M, et al. Incremental and Interactive Sequence Mining // Proc of the 8th International Conference on Information and Knowledge Management. Kansas City, USA, 1999: 251-258 [63] Masseglia F, Poncelet P, Teisseire M. Incremental Mining of Sequential Patterns in Large Databases. Data & Knowledge Engineering, 2003, 46(1): 97-121 [64] Zheng Q G, Xu K, Ma S L, et al. The Algorithms of Updating Sequential Patterns[EB/OL]. [2013-05-20]. http://arxiv.org/ftp/cs/papers/0203/0203027.pdf [65] Wang C Y, Hong T P, Tseng S S. Maintenance of Sequential Patterns for Record Deletion // Proc of the IEEE International Conference on Data Mining. San Jose, USA, 2001: 536-541 [66] Zheng Q G, Xu K, Ma S L. When to Update the Sequential Patterns of Stream Data? // Proc of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Seoul, Republic of Korea, 2003: 545-550 [67] Upadhyaya S R. Parallel Approaches to Machine Learning-A Comprehensive Survey. Journal of Parallel and Distributed Computing, 2013, 73(3): 284-292 [68] Luo D J, Ding C, Huang H. Parallelization with Multiplicative Algorithms for Big Data Mining // Proc of the 12th IEEE International Conference on Data Mining. Brussels, Belgium, 2012: 489-498 [69] Shim K. MapReduce Algorithms for Big Data Analysis. Proc of the VLDB Endowment, 2012, 5(12): 2016-2017 [70] Zhang J B, Li T R, Pan Y. Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data // Proc of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. Beijing, China, 2012: 20-27 [71] Hefeeda M, Gao F, Abd-Almageed W. Distributed Approximate Spectral Clustering for Large-Scale Datasets // Proc of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing. Delft, the Netherlands, 2012: 223-234 [72] Palit I, Reddy C K. Scalable and Parallel Boosting with MapReduce. IEEE Trans on Knowledge and Data Engineering, 2012, 24(10): 1904-1916 [73] Kaiser C, Pozdnoukhov A. Enabling Real-Time City Sensing with Kernel Stream Oracles and MapReduce. Pervasive and Mobile Computing, 2013, 9(5): 708-721 [74] Yan F, Xu N Y, Qi Y. Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units // Proc of the 22nd Annual Conference on Neural Information Processing Systems. Whistler, Canada, 2009: 2134-2142 [75] Jung G, Gnanasambandam N, Mukherjee T. Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds // Proc of the 5th IEEE International Conference on Cloud Computing. Hawaii, USA, 2012: 811-818 [76] He Q, Zhuang F Z, Li J C, et al. Parallel Implementation of Classification Algorithms Based on MapReduce // Proc of the 5th International Conference on Rough Set and Knowledge Technology. Beijing, China, 2010: 655-662 [77] He Q, Tan Q, Ma X D, et al. The High-Activity Parallel Implementation of Data Preprocessing Based on MapReduce // Proc of the 5th International Conference on Rough Set and Knowledge Technology. Beijing, China, 2010: 646-654 [78] He Q, Wang Q, Du C Y, et al. A Parallel Hyper-Surface Classifier for High Dimensional Data // Proc of the 3rd International Symposium on Knowledge Acquisition and Modeling. Wuhan, China, 2010: 338-343 [79] He Q, Ma Y L, Wang Q, et al. Parallel Outlier Detection Using KD-Tree Based on MapReduce // Proc of the 3rd International Conference on Cloud Computing Technology and Science. Athens, Greece, 2011: 75-80 [80] He Q, Wang Q, Zhuang F Z, et al. Parallel CLARANS Clustering Based on MapReduce. Energy Procedia, 2011, 13: 3269-3279 [81] Tan Q, He Q, Shi Z Z. Parallel Max-Min Ant System Using MapReduce // Proc of the 3rd International Conference on Swarm Intelligence. Shenzhen, China, 2012: 182-189 [82] Cheng Y, Qin C J, Rusu F. GLADE: Big Data Analytics Made Easy // Proc of the ACM SIGMOD International Conference on Management of Data. Scottsdale, USA, 2012: 697-700