Reinforcement-Learning Scheduler for Multi-Tenant Spark Clustersunder Privacy Constraints
Keywords:
reinforcement learning, differential privacy, Apache SparkAbstract
This work proposes a reinforcement learning-based scheduling solution for multi-tenant Apache Spark clusters with tight privacy constraints. Our deep Q-learning scheduler combines three competing goals: work completion time, privacy budget management with differential privacy guarantees, and infrastructure cost to handle huge volumes of data. Schedulers learn from business-like workloads with noise for differentiated privacy. Markov Decision Processes (MDPs) simulate scheduling to find the best rules. Policy incentives include performance, cost-effectiveness, and privacy trade-offs. Our approach enhances throughput, privacy budget allocation, and task latency control for a wide range of batch and streaming applications compared to FIFO and FAIR. We observed that reinforcement learning works well for adaptive scheduling in distributed computing systems with various tenants and privacy issues.
Downloads
References
B. Basiri et al., "Chaos Engineering," Commun. ACM, vol. 62, no. 9, pp. 44–49, Sep. 2019.
C. Metz, "Chaos Monkey and the Rise of Netflix's Simian Army," Wired, vol. 20, no. 6, pp. 88–92, Jun. 2012.
M. T. Rahman, R. M. Parizi, and A. Dehghantanha, "Chaos Engineering for Microservice Architectures," in Proc. IEEE Int. Conf. Software Architecture Companion (ICSA-C), Hamburg, Germany, Mar. 2019, pp. 123–130.
LitmusChaos Team, “LitmusChaos: Kubernetes Chaos Engineering,” CNCF Sandbox Project, 2023.
OpenTelemetry Project, “OpenTelemetry Specification,” CNCF, 2024.
T. Palit, R. Majumdar, and P. Trivedi, "A Survey of Fault Injection Techniques in Kubernetes Environments," in Proc. ACM SIGOPS Asia-Pacific Workshop on Systems (APSys), 2021, pp. 1–7.
P. Laplante and A. Laplante, "The Challenges of Microservice Observability," IEEE Software, vol. 38, no. 3, pp. 84–89, May–Jun. 2021.
M. Fowler and J. Lewis, "Microservices: a Definition of This New Architectural Term," martinfowler.com, 2014.
J. Turnbull, The Kubernetes Book, 5th ed., Turnbull Press, 2021.
C. Guo et al., "Taming Operational Instability in Large-Scale Cloud Services," in Proc. ACM SOSP, Shanghai, China, 2017, pp. 1–17.
N. Dragoni et al., "Microservices: Yesterday, Today, and Tomorrow," in Present and Ulterior Software Engineering, M. Mazzara and B. Meyer, Eds. Springer, 2017, pp. 195–216.
S. Newman, Building Microservices: Designing Fine-Grained Systems, 2nd ed., O'Reilly Media, 2021.
B. Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure," Google Research Publication, 2010.
PCI Security Standards Council, "Payment Card Industry Data Security Standard (PCI DSS) v4.0," Mar. 2022.
G. Hightower, B. Burns, and J. Beda, Kubernetes: Up and Running, 3rd ed., O'Reilly Media, 2022.
J. Allspaw, "Fault Injection in Production: Making the Case for Resilience Testing," Velocity Conf., O’Reilly, 2016.
A. Joshi and V. Sehgal, "KubeInvaders: An Interactive Chaos Engineering Tool for Kubernetes," in Proc. IEEE ICACCS, 2020, pp. 1030–1036.
B. Sharma and R. Laddad, "GitOps for Infrastructure as Code with Kubernetes," InfoQ, 2020.
G. Salgueiro et al., "Applying Observability to Distributed Systems," IEEE Cloud Computing, vol. 9, no. 2, pp. 12–20, Mar.–Apr. 2022.
D. Ashkenazi et al., "Adaptive Failure Injection in Production Cloud Systems," in Proc. IEEE/IFIP DSN, 2021, pp. 284–296.