A MapReduce-based big data clustering using swarm-inspired meta-heuristic algorithms

Document Type : Article

Authors

1 Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran

2 - Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran. - Big Data Research Center, Najafabad Branch, Islamic Azad University, Najafabad, Iran.

Abstract

Clustering is one of the important methods in data analysis. For big data, clustering is difficult due to the volume of data and the complexity of clustering algorithms. Therefore, methods that can handle a large amount of data clustering at the reasonable time are required. MapReduce is a powerful programming model that allows parallel algorithms to run in distributed computing environments. In this study, an improved artificial bee colony algorithm based on a MapReduce clustering model (MR-CWABC) is proposed. The weighted average without greedy selection of the results improves the local and global search of ABC. The improved algorithm is implemented in accordance with the MapReduce model on the Hadoop framework to allocate optimal samples to the clusters such that the compression and separation of the clusters are preserved. The proposed method is compared with some well-known bio-inspired algorithms such as particle swarm optimization (PSO), artificial bee colony (ABC) and gravitational search algorithm (GSA) implemented based on the MapReduce model on the Hadoop framework. The results showed that MR-CWABC is well-suited for big data, while maintaining clustering quality. The MR-CWABC demonstrates an improvement of 7.13%, 7.71% and 6.77% based on the average F-measure compared to MR-CABC, MR-CPSO, and MR-CGSA, respectively.

Keywords

Main Subjects


References:
1. Tsai, C.W., Lai, C.F., Chao, H.C., et al. "Big data analytics: a survey", J. Big Data, 2, pp. 1-32 (2015). DOI: 10.1186/s40537-015-0030-3.
2. Naeem, M., Jamal, T., Diaz-Martinez, J., et al. "Trends and future perspective challenges in big data", Advances in Intelligent Data Analysis and Applications, 253, pp. 309-325 (2022). DOI:https://doi.org/10.1007/978-981-16-5036-9 30.
3. Dafir, Z., Lamari, Y., and Slaoui, S.C. "A survey on parallel clustering algorithms for big data", Artif. Intell. Rev., 54, pp. 2411-2443 (2021). DOI:https://doi.org/10.1007/s10462-020-09918-2.
4. Chen, M.S., Han, J., and Yu, P.S. "Data mining: an overview from a database perspective", IEEE Trans. Knowl. Data Eng., 8(6), pp. 866-883 (1996). DOI: 10.1109/69.553155.
5. Banharnsakun, A.A. "MapReduce-based artificial bee colony for large-scale data clustering", Pattern Recognit. Lett., 93, pp.  78-84 (2017). DOI: 10.1016/j.patrec.2016.07.027.
6. Fahad, A., Alshatri, N., Tari, Z., et al. "A survey of clustering algorithms for big data: Taxonomy and empirical analysis", IEEE Trans. Emerg. Top. Comput., 2, pp. 267-279 (2014). DOI:10.1109/TETC.2014.2330519.
7. Cano, L., Carello, G., and Ardagna, D. "A framework for joint resource allocation of MapReduce and web service applications in a shared cloud cluster", J. Parallel Distrib. Comput., 120, pp. 127-147 (2018). https://doi.org/10.1016/j.jpdc.2018.05.010.
8. Yan, Y., Sun, Z., Mahmood, A., et al. "Achieving differential privacy publishing of location-based statistical data using grid clustering", ISPRS Int. J. Geo-Information, 11, p. 404 (2022). https://doi.org/10.3390/ijgi11070404.
9. Ezugwu, A.E., Ikotun, A.M., Oyelade, O.O., et al. "A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects", Eng. Appl. Artif. Intell., 110, p. 104743 (2022). https://doi.org/10.1016/j.engappai.2022.104743.
10. Vaghela, R.D. and Iyer, S.S. "A comparative analysis of clustering algorithm", ECS Trans., 107, p. 2435 (2022).
11. Tripathi, A.K., Sharma, K., Bala, M., et al. "A parallel military dog based algorithm for clustering big data in cognitive industrial internet of things", IEEE Trans. Ind. Informatics, 17(3), pp. 2134-2142 (2020). 
12. Abualigah, L., Gandomi, A.H., Elaziz, M.A., et al. "Advances in meta-heuristic optimization algorithms in big data text clustering", Electronics, 10, p. 101 (2021).
13. Ghobaei-Arani, M. and Shahidinejad, A. "An efficient resource provisioning approach for analyzing cloud workloads: A metaheuristic-based clustering approach", J. Supercomput., 77, pp. 711-750 (2021).
14. Maheshwari, P., Sharma, A.K., and Verma, K. "Energy efficient cluster based routing protocol for WSN using butter y optimization algorithm and ant colony optimization", Ad Hoc Networks, 110, p. 102317 (2021).
15. Beheshti, Z., Shamsuddin, S.M., Hasan, S., et al. "Improved centripetal accelerated particle swarm optimization", Int. J. Adv. Soft Comput. its Appl. , 8, pp. 1-26 (2016).
16. Ajami-Bakhtiarvand, L. and Beheshti, Z. "A new data clustering method using 4-gray wolf algorithm", Nashriyyah-I Muhandisi-I Barq Va Muhandisi- I Kampyutar-I Iran, B-Muhandisi-I Kampyutar, 19, pp. 261-274 (2022).
17. Bashabsheh, M.Q., Abualigah, L., and Alshinwan, M. "Big data analysis using hybrid meta-heuristic optimization algorithm and MapReduce framework", Integrating Meta-Heuristics and Machine Learning for Real-World Optimization Problems, 1038, pp. 181- 223 (2022). DOI: https://doi.org/10.1007/978-3-030- 99079-4 8.
18. Tsai, C.-F., Lin, W.-C., and Ke, S.-W. "Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies", J. Syst. Softw., 122, pp. 83-92 (2016). DOI: 10.1016/j.jss.2016.09.007.
19. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., et al. "Big data clustering: A review", Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 8583 LNCS, pp. 707-720 (2014). DOI: 10.1007/978-3-319-09156-3 49.
20. Chou, C.-H., Su, M.-C., and Lai, E. "A new cluster validity measure and its application to image compression", Pattern Anal. Appl., 7, pp. 205-220 (2004). DOI: https://doi.org/10.1007/s10044-004-0218-1.
21. Saxena, A., Prasad, M., Gupta, A., et al. "A review of clustering techniques and developments", Neurocomputing, 267, pp. 664-681 (2017). https://doi.org/10.1016/j.neucom.2017.06.053.
22. Ianni, M., Masciari, E., Mazzeo, G.M., et al. "Fast and effective big data exploration by clustering", Futur. Gener. Comput. Syst. , 102, pp. 84-94 (2020). https://doi.org/10.1016/j.future.2019.07.077.
23. Lalitha, R. and Rameshkumar, K. "Cluster-based convolution process on big data in privacy preserving data mining", Int. J. Bus. Inf. Syst., 38, pp. 17-33 (2021).
24. Sardar, T.H. and Ansari, Z. "Distributed big data clustering using MapReduce-based fuzzy C-medoids", J. Inst. Eng. Ser. B., 103, pp. 73-82 (2022). https://doi.org/10.1007/s40031-021-00647-w.
25. Usha Lawrance, J. and Nayahi Jesudhasan, J.V. "Privacy preserving parallel clustering based anonymization for big data using mapReduce framework", Appl. Artif. Intell., 35(15), pp. 1-34 (2021). https://doi.org/10.1080/08839514.2021.1987709.
26. Mbyamm Kiki, M.J., Zhang, J., and Kouassi, B.A. "MapReduce FCM clustering set algorithm", Cluster Comput., 24, pp. 489-500 (2021). DOI:10.1007/s10586-020-03131-0.
27. Singh, V.K., Sabharwal, S., and Gabrani, G. "A new fuzzy clustering-based recommendation method using grasshopper optimization algorithm and Map- Reduce", Int. J. Syst. Assur. Eng. Manag., 13, pp. 2698-2709 (2022). DOI: 10.1007/s13198-022-01740-z.
28. Suki Antely, A., Jegatheeswari, P., Bibin Prasad, M., et al. "Sparse FCM-based Map-Reduce framework for distributed parallel data clustering in E-Khool learning platform", Int. J. Uncertainty, Fuzziness Knowledge- Based Syst., 31, pp. 1-23 (2023). 
29. Ma, L., Gu, L., Li, B., et al. "An improved K-means algorithm based on mapreduce and grid", Int. J. Grid Distrib. Comput., 8(1), pp. 189-200 (2015).
30. Cui, X., Zhu, P., Yang, X., et al. "Optimized big data K-means clustering using MapReduce", J. Supercomput., 70, pp. 1249-1259 (2014). DOI: 10.1007/s11227- 014-1225-7.
31. Mao, Y., Gan, D., Mwakapesa, D.S., et al. "A MapReduce-based K-means clustering algorithm", J. Supercomput., 78, pp. 1-22 (2021).https://doi.org/10.1007/s11227-021-04078-8.
32. Beheshti, Z. "A fuzzy transfer function based on the behavior of meta-heuristic algorithm and its application for high-dimensional feature selection problems", Knowledge-Based Syst., 284, 111191 (2024).https://doi.org/10.1016/j.knosys.2023.111191.
33. Sharifian, Z., Barekatain, B., Quintana, A.A., et al. "Sin-Cos-bIAVOA: A new feature selection method based on improved African vulture optimization algorithm and a novel transfer function to DDoS attack detection", Expert Syst. Appl., 228, p. 120404 (2023).https://doi.org/10.1016/j.eswa.2023.120404.
34. Beheshti, Z. "BMPA-TVSinV: A binary marine predators algorithm using time-varying sine and V-shaped transfer functions for wrapper-based feature selection", Knowledge-Based Syst., 252, p. 109446 (2022). https://doi.org/10.1016/j.knosys.2022.109446.
35. Beheshti, Z., Shamsuddin, S.M., and Sulaiman, S. "Fusion global-local-topology particle swarm optimization for global optimization problems", Math. Probl. Eng., 2014, pp. 1-19 (2014). DOI: 10.1155/2014/907386.
36. Aljarah, I. and Ludwig, S.A. "Parallel particle swarm optimization clustering algorithm based on MapReduce methodology", 4th World Congr. Nat. Biol. Inspired Comput. NaBIC, Mexico City, Mexico, pp. 104-111 (2012). DOI: 10.1109/NaBIC.2012.6402247.
37. Yang, J. and Li, X. "MapReduce based method for big data semantic clustering", 2013 IEEE Int. Conf. Syst. Man, Cybern. SMC, Manchester, UK, pp. 2814-2819 (2013). DOI: 10.1109/SMC.2013.480.
38. Al-Madi, N., Aljarah, I., and Ludwig, S.A. "Parallel glowworm swarm optimization clustering algorithm based on MapReduce", 2014 IEEE Symp. Ser. Comput. Intell, Orlando, FL, USA, pp. 189-196 (2015). DOI: 10.1109/SIS.2014.7011794.
39. Ludwig, S.A. "MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability", Int. J. Mach. Learn. Cybern, 6, pp. 923-934 (2015). DOI: 10.1007/s13042-015-0367-0.
40. Tsai, C.W., Hsieh, C.H., and Chiang, M.C. "Parallel black hole clustering based on MapReduce", 2015 IEEE Int. Conf. Syst. Man, Cybern. SMC., Hong Kong, China, pp. 2543-2548 (2015). DOI: 10.1109/SMC.2015.445.
41. Slagter, K., Hsu, C.H., and Chung, Y.C. "An adaptive and memory efficient sampling mechanism for partitioning in MapReduce", Int. J. Parallel Program, 43, pp. 489-507 (2015). DOI: 10.1007/s10766-013-0288-z.
42. Tripathi, A.K., Sharma, K., and Bala, M. "A novel clustering method using enhanced grey wolf optimizer and MapReduce", Big Data Res., 14, pp. 93-100 (2018). https://doi.org/10.1016/j.bdr.2018.05.002.
43. Tripathi, A.K., Sharma, K., and Bala, M. "Parallel hybrid bbo search method for twitter sentiment analysis of large scale datasets using mapreduce", Int. J. Inf. Secur. Priv., 13, pp. 106-122 (2019).
44. Tripathi, A.K., Mittal, H., Saxena, P., et al. "A new recommendation system using map-reduce-based tournament empowered whale optimization algorithm", Complex Intell. Syst., 7, pp. 297-309 (2021). DOI:10.1007/s40747-020-00200-0.
45. Chacko, A.M., Gupta, A., Madhu, S., et al. "Improving execution speed of incremental runs of MapReduce using provenance", Int. J. Big Data Intell., 4, p. 186 (2017). DOI: 10.1504/IJBDI.2017.085521.
46. Meddah, I.H.A. and Belkadi, K. "Parallel distributed patterns mining using hadoop MapReduce framework", Int. J. Grid High Perform. Comput., 9, pp. 70-85 (2017). DOI: 10.4018/IJGHPC.2017040105.
47. Krishnaswamy, R., Subramaniam, K., Nandini, V., et al. "Metaheuristic based clustering with deep learning model for big data classification", Comput. Syst. Sci. Eng., 44, pp. 391-406 (2023). https://doi.org/10.32604/csse.2023.024901.
48. Asif, M., Abbas, S., Khan, M.A., et al. "MapReduce based intelligent model for intrusion detection using machine learning technique", J. King Saud Univ.-Comput. Inf. Sci., 34, pp. 9723-9731 (2022). https://doi.org/10.1016/j.jksuci.2021.12.008.
49. Akhtar, M.M., Shatat, A.S.A., Al-Hashimi, M., et al. "MapReduce with deep learning framework for student health monitoring system using IoT technology for big data", J. Grid Comput., 21, p. 67 (2023). DOI: 10.1007/s10723-023-09690-x.
50. Tian, P., Shen, H., and Abolfathi, A. "Towards efficient ensemble hierarchical clustering with MapReduce-based clusters custering technique and the innovative similarity criterion", J. Grid Comput., 20, p. 34 (2022). DOI: 10.1007/s10723-022-09623-0.
51. Kumar, D. and Jha, V.K. "An improved query optimization process in big data using ACO-GA algorithm and HDFS map reduce technique", Distrib. Parallel Databases, 39, pp. 79-96 (2021). https://doi.org/10.1007/s10619-020-07285-z.
52. Demirbaga, U. and Aujla, G.S. "Federated-ANN based critical path analysis and health recommendations for MapReduce work flows in consumer electronics applications", IEEE Trans. Consum. Electron, 70(1) pp. 2639-2647 (2023). DOI: 10.1109/TCE.2023.3318813.
53. Kamakshamma, V. and Bharati, K.F. "Adaptive- CSSA: adaptive-chicken squirrel search algorithm driven deep belief network for student stress-level and drop out prediction with MapReduce framework", Soc. Netw. Anal. Min., 13, p. 90 (2023). DOI: 10.1007/s13278-023-01090-z.
54. Zhang, H., Li, P., Meng, F., et al. "MapReducebased distributed tensor clustering algorithm", Neural Comput. Appl., 35, pp. 24633-24649 (2023). DOI: 10.1007/s00521-023-08415-1.
55. Liu, Y., Du, X., and Ma, S. "Innovative study on clustering center and distance measurement of K-means algorithm: mapreduce efficient parallel algorithm based on user data of JD mall", Electron. Commer. Res., 23, pp. 43-73 (2023). DOI: 10.1007/s10660-021-09458-z.
56. Hanafi, N. and Saadatfar, H. "A fast DBSCAN algorithm for big data based on efficient density calculation", Expert Syst. Appl. , 203, p. 117501 (2022). https://doi.org/10.1016/j.eswa.2022.117501.
57. Basturk, B. and Karaboga, D. "An artificial bee colony (ABC) algorithm for numeric function optimization", IEEE Swarm Intelligence Symposium, Indianapolis, IN, USA, p. 12 (2006).
58. Karaboga, D. "An idea based on honey bee swarm for numerical optimization", Technical Report TR06, Erciyes University, Engineering Faculty, Computer Engineering Department (2005).
59. Shvachko, K., Kuang, H., Radia, S., et al. "The hadoop distributed file system", IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST),Washington, DC, USA, pp. 1-10 (2010). DOI: 10.1109/MSST.2010.5496972.
60. Dean, J. and Ghemawat, S. "MapReduce: Simplified data processing on large clusters", Sixth Symp. Oper. Syst. Des. Implement., 51(1), pp. 107-113 (2004). https://doi.org/10.1145/1327452.1327492.
61. Dua, D. and Graff, C. UCI Machine Learning Repository. (2017).
62. Bifet, A., Holmes, G., Kirkby, R., et al. "MOA: Massive online analysis", J. Mach. Learn. Res. , 11, pp. 1601-1604 (2010).
63. Brief, T. "Agreement, the F-measure , and reliability in information retrieval", J. Am. Med. Informatics Assoc., 12, pp. 296-298 (2005). DOI:10.1197/jamia.M1733.
64. Barr, R.S. and Hickman, B.L. "Reporting computational experiments with parallel algorithms: Issues, measures, and experts' opinions", ORSA J Comput., 5(1), pp. 2-18 (1993). https://doi.org/10.1287/ijoc.5.1.2.
65. Sekar, K. and Padmavathamma, M. "Privacy preserving-aware over big data in clouds using GSA and MapReduce framework", Int. J. Bus. Intell. Data Min., 16, pp. 150-176 (2020). https://doi.org/10.1504/IJBIDM.2020.104742.
66. Systems, I.J.I. and Sahoo, G.A. "Review on gravitational search algorithm and its applications to data clustering and classification", Intell. Syst. Appl., 6, pp.79-93 (2014). DOI: 10.5815/ijisa.2014.06.09.