Welcome to Scholar Publishing Group

Machine Learning Theory and Practice, 2022, 3(2); doi: 10.38007/ML.2022.030201.

An Improved Random Forest Algorithm Based on Spark

Author(s)

Liming Zhang

Corresponding Author:
Liming Zhang
Affiliation(s)

Jilin International Studies University, Changchun 130117, China

Abstract

Classification algorithms are an important branch of data mining, and are also of great importance in the era of BD. The random forest algorithm (RFA) is one of the classification algorithms and is widely used in various industries for its good classification performance. However, the performance of RFA is not so good when dealing with high-dimensional data and unbalanced data. The main objective of this paper is to improve the RFA based on Spark. In this paper, we read a large amount of relevant algorithm literature in terms of algorithm research, and gain a comprehensive understanding of what feature selection and unbalanced classification are, as well as what characteristics they have and how these problems should be solved. It then focuses on how some domestic and international scholars have solved these problems. This paper focuses on studying and analysing the strengths and weaknesses of the RFA, and makes relevant improvements to address the two weaknesses of the RFA. In order to solve the problems of the RFA in the field of feature selection and the field of unbalanced classification, the optimization is improved respectively, and the parallelized design of the optimized algorithm is finally implemented on Spark.

Keywords

Spark Distributed Platform, Random Forests, Optimization Algorithms, Parallelized Design

Cite This Paper

Liming Zhang. An Improved Random Forest Algorithm Based on Spark. Machine Learning Theory and Practice (2022), Vol. 3, Issue 2: 1-11. https://doi.org/10.38007/ML.2022.030201.

References

[1] Bruno Henrique Meyer, Aurora Trinidad Ramirez Pozo, Wagner M. Nunan Zola: Improving Barnes-Hut t-SNE Algorithm in Modern GPU Architectures with Random Forest KNN and Simulated Wide-Warp. ACM J. Emerg. Technol. Comput. Syst. 17(4): 53:1-53:26 (2021). https://doi.org/10.1145/3447779

[2] Fahimeh Motamedi, Horacio Pérez Sánchez, Alireza Mehridehnavi, Afshin Fassihi, Fahimeh Ghasemi: Accelerating BD Analysis through LASSO-RFA in QSAR Studies. Bioinform. 38(2): 469-475 (2022). https://doi.org/10.1093/bioinformatics/btab659

[3] Sridevi Subbiah, Kalaiarasi Sonai Muthu Anbananthen, Saranya Thangaraj, Subarmaniam Kannan, Deisy Chelliah: Intrusion detection technique in wireless sensor network using grid search random forest with Boruta feature selection algorithm. J. Commun. Networks 24(2): 264-273 (2022). https://doi.org/10.23919/JCN.2022.000002

[4] Amandeep Kaur Sandhu, Ranbir Singh Batth: Software reuse analytics using integrated random forest and gradient boosting machine learning algorithm. Softw. Pract. Exp. 51(4): 735-747 (2021). https://doi.org/10.1002/spe.2921

[5] Pezhman Gholamnezhad, Ali Broumandnia, Vahid Seydi: An inverse model-based multiobjective estimation of distribution algorithm using Random-Forest variable importance methods. Comput. Intell. 38(3): 1018-1056 (2022). https://doi.org/10.1111/coin.12315

[6] Valeria D'Amato, Rita Laura D'Ecclesia, Susanna Levantesi: ESG score prediction through RFA. Comput. Manag. Sci. 19(2): 347-373 (2022). https://doi.org/10.1007/s10287-021-00419-3

[7] Josalin Jemima J., D. Nelson Jayakumar, S. Charles Raja, Venkatesh P.: Proposing a Hybrid Genetic Algorithm based Parsimonious Random Forest Regression (H-GAPRFR) technique for solar irradiance forecasting with feature selection and parameter optimization. Earth Sci. Informatics 15(3): 1925-1942 (2022). https://doi.org/10.1007/s12145-022-00839-y

[8] María Guadalupe Bedolla-Ibarra, María del Cármen Cabrera-Hernández, Marco Antonio Aceves-Fernández, Saúl Tovar-Arriaga: Classification of attention levels using a RFA optimized with Particle Swarm Optimization. Evol. Syst. 13(5): 687-702 (2022). https://doi.org/10.1007/s12530-022-09444-2

[9] Mirna Nachouki, Mahmoud Abou Naaj: Predicting Student Performance to Improve Academic Advising Using the RFA. Int. J. Distance Educ. Technol. 20(1): 1-17 (2022). https://doi.org/10.4018/IJDET.296702

[10] Saeed Samadianfard, Katayoun Kargar, Sadra Shadkani, Sajjad Hashemi, Akram Abbaspour, Mir Jafar Sadegh Safari: Hybrid models for suspended sediment prediction: optimized random forest and multi-layer perceptron through genetic algorithm and stochastic gradient descent methods. Neural Comput. Appl. 34(4): 3033-3051 (2022). https://doi.org/10.1007/s00521-021-06550-1

[11] Hafiz Syed Mohsin Abbas, Zahid Hussain Qaisar, Xiaodong Xu, Chunxia Sun: Nexus of E-government, cybersecurity and corruption on public service (PSS) sustainability in Asian economies using fixed-effect and RFA. Online Inf. Rev. 46(4): 754-770 (2022). https://doi.org/10.1108/OIR-02-2021-0069

[12] C. Venkata Narasimhulu: An automatic feature selection and classification framework for analyzing ultrasound kidney images using dragonfly algorithm and random FC. IET Image Process. 15(9): 2080-2096 (2021). https://doi.org/10.1049/ipr2.12179

[13] Musavir Hassan, Muheet Ahmed Butt, Majid Zaman: An Ensemble RFA for Privacy Preserving Distributed Medical Data Mining. Int. J. E Health Medical Commun. 12(6): 1-23 (2021). https://doi.org/10.4018/IJEHMC.20211101.oa8

[14] Sam Goundar, Akashdeep Bhardwaj: Property Valuation Using Linear Regression and RFA. Int. J. Syst. Dyn. Appl. 10(4): 1-16 (2021). https://doi.org/10.4018/IJSDA.20211001.oa13

[15] Alankrita Aggarwal, Kanwalvir Singh Dhindsa, P. K. Suri: Performance-Aware Approach for Software Risk Management Using RFA. Int. J. Softw. Innov. 9(1): 12-19 (2021). https://doi.org/10.4018/IJSI.2021010102

[16] Shenbagarajan Anantharajan, Shenbagalakshmi Gunasekaran: Automated brain tumor detection and classification using weighted fuzzy clustering algorithm, deep auto encoder with barnacle mating algorithm and random FC techniques. Int. J. Imaging Syst. Technol. 31(4): 1970-1988 (2021). https://doi.org/10.1002/ima.22582

[17] Bruno Henrique Meyer, Aurora Trinidad Ramirez Pozo, Wagner M. Nunan Zola: Improving Barnes-Hut t-SNE Algorithm in Modern GPU Architectures with Random Forest KNN and Simulated Wide-Warp. ACM J. Emerg. Technol. Comput. Syst. 17(4): 53:1-53:26 (2021). https://doi.org/10.1145/3447779

[18] Teer Ba: Performance analysis of sports training based on RFA and infrared motion capture. J. Intell. Fuzzy Syst. 40(4): 6853-6863 (2021). https://doi.org/10.3233/JIFS-189517

[19] S. Ramalingam, K. Baskaran: An efficient data prediction model using hybrid Harris Hawk Optimization with RFA in wireless sensor network. J. Intell. Fuzzy Syst. 40(3): 5171-5195 (2021). https://doi.org/10.3233/JIFS-201921

[20] François Bienvenu, Jean-Jil Duchamps, Félix Foutel-Rodier: The Moran forest. Random Struct. Algorithms 59(2): 155-188 (2021). https://doi.org/10.1002/rsa.20997