Optimizing Real-Time Data Pipelines with a Hybrid CatBoost-XGBoost Framework in Apache Spark

Authors

  • Bhaskar Yakkanti MGM Resorts, USA Author
  • Naveen Kumar Siripuram CVS Health, USA Author
  • Prabhu Krishnaswamy Oracle Corp, USA Author

Keywords:

CatBoost, XGBoost, Apache Spark, real-time data pipelines, predictive analytics

Abstract

Predictive analytics in large scale environment is enhanced by the integration of real time data pipelines in advanced machine learning frameworks. The aim of this study is to introduce a hybrid framework that utilises the complementary strength of XGboost and CatBoost within Apache Spark, which will help in optimise predictive performance, which helps in demonstrate how the synergy of both gradient boosting algorithm enhances model accuracy, computational efficiency, and scalability in real-time decision-making applications.

Downloads

Download data is not yet available.

References

T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), San Francisco, CA, USA, 2016, pp. 785–794.

A. Prokhorenkova, G. Gusev, A. Vorobev, A. Dorogush, and A. Gulin, "CatBoost: Unbiased boosting with categorical features," in Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 2018, pp. 6639–6649.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster computing with working sets," in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud '10), Boston, MA, USA, 2010, pp. 1–7.

N. Maruyama, S. Matsuoka, and T. Endo, "Optimizing gradient boosting decision trees for distributed environments," in Proceedings of the IEEE International Conference on Big Data, Washington, DC, USA, 2019, pp. 456–463.

H. Zhang, Y. Ma, and L. Zheng, "Real-time machine learning on Apache Spark: An empirical study," in Journal of Parallel and Distributed Computing, vol. 135, pp. 45–55, Apr. 2020.

T. N. Kipf, M. Welling, and T. Gaertner, "Neural networks for graph-structured data: Applications in gradient boosting," in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 5, pp. 1870–1884, May 2021.

A. Dorogush, V. Ershov, and A. Gulin, "A fast and accurate gradient boosting method for large-scale machine learning," in Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017, pp. 3358–3366.

G. Ke et al., "LightGBM: A highly efficient gradient boosting decision tree," in Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017, pp. 3146–3154.

M. Armbrust, T. Das, and J. Rosen, "Structured streaming: A declarative API for real-time applications in Apache Spark," in Proceedings of the VLDB Endowment, vol. 12, no. 1, pp. 555–568, 2019.

A. Esmaeilirad, A. Khaleghi, and H. Homayoun, "Distributed gradient boosting using Apache Spark for large-scale machine learning," in Proceedings of the IEEE International Conference on Cloud Computing, San Francisco, CA, USA, 2021, pp. 223–232.

R. Bekkerman and M. Bilenko, Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge, UK: Cambridge University Press, 2011.

M. Schulte, "Optimizing ensemble learning in big data environments: A comparative analysis of XGBoost, CatBoost, and LightGBM," in ACM Transactions on Knowledge Discovery from Data, vol. 14, no. 4, pp. 1–23, Nov. 2020.

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," in Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015, pp. 1–15.

J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization," in Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, Feb. 2012.

P. Domingos, "A unified bias-variance decomposition and its applications," in Proceedings of the 17th International Conference on Machine Learning (ICML), Stanford, CA, USA, 2000, pp. 231–238.

M. Zaharia et al., "Discretized streams: Fault-tolerant streaming computation at scale," in Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), Farmington, PA, USA, 2013, pp. 423–438.

A. L. Maas, A. Y. Hannun, and A. Ng, "Rectifier nonlinearities improve neural network acoustic models," in Proceedings of the International Conference on Machine Learning (ICML), Bellevue, WA, USA, 2013, pp. 1–8.

H. Bostock et al., "Parallel hyperparameter tuning in distributed machine learning," in Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp. 342–351.

Y. Liu, X. Wang, and L. Chen, "Accelerating gradient boosting decision trees with GPU computing," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 4, pp. 1005–1016, Apr. 2021.

M. R. Kang et al., "Towards real-time large-scale machine learning using Apache Spark and TensorFlow," in Proceedings of the IEEE International Conference on Big Data Analytics (ICBDA), Shenzhen, China, 2020, pp. 123–131.

Downloads

Published

04-06-2022

How to Cite

[1]
Bhaskar Yakkanti, Naveen Kumar Siripuram, and Prabhu Krishnaswamy, “Optimizing Real-Time Data Pipelines with a Hybrid CatBoost-XGBoost Framework in Apache Spark”, Newark J. Hum. Centric AI Robot Inter., vol. 2, pp. 188–223, Jun. 2022, Accessed: Dec. 21, 2025. [Online]. Available: https://njhcair.org/index.php/publication/article/view/31