Design Challenge: Integrate Kubernetes VPA with Apache Spark

Apache Spark and Kubernetes

Apache Spark is a distributed data processing engine that can read data from a range of data sources and make requested computations on that data in a distributed manner. It is a system of program instances running in different physical or virtual nodes. There is a master instance called a “driver” program and worker instances called “executor” programs.

https://spark.apache.org/docs/latest/cluster-overview.html

Since the executor programs need to be run in different nodes, Apache Spark needs to use a “cluster manager,” which is responsible for launching executor instances as requested by the driver program. Apache Spark delegates the executor’s launch to the selected cluster manager. Kubernetes is such a cluster manager for Apache Spark.

So, Apache Spark can run on Kubernetes using the built-in Kubernetes cluster manager. We also have the spark-operator project, which makes running Spark on Kubernetes even easier by allowing you to define Spark applications and their parameters in Kubernetes language (e.g., custom resource definitions).

Scaling Apache Spark

Apache Spark has many optimizations to speed up executions, built in years based on needs, ranging from computational optimizations to I/O optimizations. One of the handy features of Apache Spark is dynamic executor allocations. Typically, Spark applications can be configured to have a static number of executors before the driver program starts. But we can tell Spark to choose the number of executors dynamically based on the workload. This is called dynamic allocation, and it’s Apache Spark’s horizontal scaling feature.

https://blog.det.life/i-spent-4-hours-learning-apache-spark-resource-allocation-08b570d88335

However, Apache Spark currently lacks a vertical scaling feature. We couldn’t tell Apache Spark to increase CPU and memory limits based on the application’s needs. I’ll clarify what I meant here in a minute.

Kubernetes VPA (Vertical Pod Autoscaler)

https://www.nops.io/blog/comprehensive-guide-kubernetes-autoscaling/

Kubernetes has a feature called “Vertical Pod Autoscaler.” When enabled, it can monitor the resource usage of applications and several other factors and then predict the optimal resource amounts for that particular application. For example, if we allocate 512 MB memory to a hypothetical application that doesn’t use more than 256 MB, VPA will predict that we should use 256 MB (roughly speaking) instead of 512 MB to save on costs. The opposite is also possible. If we allocate 512 MB, but our application frequently needs more than that, VPA will tell us to allocate more, like 1024 MB, to improve performance.

Using VPA with Apache Spark

If you read this, you probably expect me to say, “Well, we use VPA to add a vertical autoscaling feature to Apache Spark. Bye.” Unfortunately, it is not that simple.

https://medium.com/@anukriti67/kubernetes-pods-vs-containers-vs-nodes-fd44cfc3e3ff

VPA is a generic feature that can be used with any workload, including Apache Spark applications. However, enabling VPA is not sufficient for Apache Spark because Apache Spark has its resource configuration parameters. Even if we scale up the executor pods, the executor process inside will not use the total capacity unless we set the corresponding Apache Spark configuration parameters. Spark applications are unaware of the outer world. They are designed to be agnostic about where they run.

So, we must take additional actions to ensure that our Spark applications benefit from VPA predictions (a.k.a. VPA recommendations).

Integrating VPA with Apache Spark

We must consider how Spark applications and VPA interact on a Kubernetes cluster.

Firstly, we need to create VPA objects to match our Spark applications. Those VPA objects will aggregate metrics, and some recommendations will become available as we continue running our applications. Then, we need to apply (or allow VPA to apply) those recommendations to our Spark application pods so they can benefit from the optimized resource amounts.

While doing that for a single Spark application could be feasible, it is not for thousands of Spark applications. So, we need to automate this process correctly.

Designing the integration

Various design approaches could fit what we want to achieve here. My main design principles will be as follows:

Don’t allow VPA itself to modify any pods so we can have more granular control over applying recommendations.
Do not stop or restart running Spark applications to scale them because this could increase the total completion time for that run. Scale up/down only just before starting the pods.
Allow users to easily tag their Spark applications to indicate instances of the same application so that their metrics can be aggregated and separated from the others.
Make the integration an opt-in feature, disabled by default unless the user explicitly wants to use it.

I implemented the integration as a Kubernetes mutating admission webhook to satisfy these goals.

Kubernetes mutating admission webhook

Kubernetes is an extremely flexible and modular system that allows customizing almost every part of the system. We can utilize the admission webhook feature to inspect pod creation requests for the Spark applications just before the pods are actually created and modify them according to our needs. So, the pods will be created according to the modifications we made.

https://trstringer.com/kubernetes-mutating-webhook/

When we implement the admission webhook and deploy it to our Kubernetes cluster, the Kubernetes API will ask our webhook endpoint what to do with the pod creation request. We can reject or modify the request and return a response to the Kubernetes API. It will then apply the modifications we tell it to do.

When my webhook endpoint captures a pod creation request, it will create the corresponding VPA object if it has not been created yet. It will also check for available recommendations. If there are any, they could be applied to the pod by modifying resource requests/limits and Spark-specific environment variables like SPARK_EXECUTOR_MEMORY and SPARK_EXECUTOR_CORES. Making selective decisions based on other criteria is also possible if needed. We are entirely free to modify the pod creation request at hand.

Resolving a VPA challenge

VPA allows you to directly target specific controller objects (like Deployments) but not pods. So, we couldn’t define an arbitrary label selector to match our Spark application group. We need a controller object. This introduces a challenge for the integration.

Fortunately, VPA doesn’t check if the controller object is the actual owner of the pods but uses its label selector to match pods. As a workaround, we can create dummy deployment objects with no instances and set those deployment objects as VPA targets to match any arbitrary pod.

The webhook also handles creating a dummy deployment object when a pod creation request is inspected. We need to be careful here to ensure we can map dummy deployments to VPA objects and Spark pods.

Volcano batch scheduler as an additional challenge

Kubernetes has its default scheduler, for sure. It can simply schedule/assign pods to the available nodes. It doesn’t follow any complex logic to do that and treats pods independently.

However, this is not the case for distributed processing engines like Spark. They consist of multiple parts that need to run together. If we can run the driver program but not the executors, then the computation won’t start at all, and it is a waste of time and resources. However, the default Kubernetes scheduler is not designed to resolve such problems.

Volcano is a batch scheduler designed to resolve such problems. It was originally implemented in Huawei and made open-source. Now, it is an incubating CNCF project. Volcano supports “gang scheduling,” which means it will schedule driver and executor pods together or will not schedule any of them if there are not enough resources to do so.

Volcano can be used with bare Spark on Kubernetes workload and spark-operator workload. We need to set a few parameters.

It is easy to enable Volcano with Spark on Kubernetes (or spark-operator), but it introduces another challenge to our integration. We can continue inspecting pod-creating requests and modifying the pods according to the VPA recommendations. However, this will introduce inconsistencies with the Volcano-enabled workflow because Volcano considers total resource requirements while scheduling/assigning pods to the nodes. If we modify pod resources, we could end up with unschedulable pods and break Volcano internals.

Designing Volcano in-mind

Volcano creates “PodGroup” objects (a CRD defined by Volcano itself) for each Spark application and then uses the parameters defined on PodGroups to make scheduling decisions. So, if we could modify PodGroups according to our needs, there shouldn’t be any inconsistencies or problems when using Volcano.

We can add another endpoint to our webhook to listen for Volcano PodGroup creation requests. We can also do the same things we did to process pod creation requests, except this time, we need to modify the parameters corresponding to the total resource amounts on the PodGroup.

Notes on implementation details

Go is the standard language of Kubernetes and its extensions, like admission webhooks. It is effortless to integrate with Kubernetes and has performance benefits and other advantages.

I implemented the first PoC in Scala since the team is familiar with Scala due to heavy Spark-related work. Still, I wasted an essential amount of time just on serialization issues. Then I switched to Go, and the project grew smoothly even though I learned to Go for this project, which was my first Go project.

I believe development should be local as much as possible. This approach increases productivity since the feedback cycle is shorter and the environment is safer to play with. So, I used K3d to set up a local Kubernetes test environment. I preferred K3d over Minikube for its performance and recommend it.

I’ve crafted several scripts and a long Makefile to make manual testing easier and reproducible. Then, I created e2e tests combining various targets defined in my Makefile. This way, I can test multiple real-world scenarios in an end-to-end environment (e.g., actually deploying Volcano, spark-operator, etc.).

Conclusion

Integrating Kubernetes VPA with Apache Spark turned out to be more complex than initially seemed. While VPA provides a great feature, adapting it to Spark’s needs requires careful design and automation to make the solution practical and reliable.

This project gave me valuable knowledge and experience in the Go language, Kubernetes internals and testing, VPA, Volcano, Spark-operator, automation, and understanding ecosystems. It also reinforced the importance of choosing the right tools and designing with the bigger picture in mind.

I believe in the power of hands-on experience and learning. It makes the process enjoyable, effective, and long-lasting. I’m grateful for the opportunity to work on this project.

There’s still room to improve and expand this integration, and I’m excited to take on new challenges in this space.

Thanks for reading. See you in the following article!