At Cognitree, most of our customer projects involve analysing a large amount of data to gather insights. These projects involve design and development of ETL pipelines, real-time analysis, report generation and training models using statistical learning algorithms.
The solutions are often built using open source tools and although the components of the big data stack remain the same there are always minor variations across the use-cases.
In this blog post, we will list the typical challenges faced by developers in setting up a big data stack for application development. We will also talk about how we leveraged Kubernetes to overcome these challenges.
Our typical big data stack includes
- Kafka: An ingest engine which serves as a distributed messaging system to handle a large stream of events from several clients at scale and is well suited for real-time streaming use cases.
- Spark: Data processing framework to execute data pipelines including machine learning workloads.
- Elasticsearch: Datastore with search and analytics capabilities powering analytics dashboard and real-time analysis at scale.
- Kibana: Flexible visualization platform that enables real-time analysis using ad-hoc queries and building dashboards.
- HDFS: Acts as the primary data store where all the raw data is stored and supports the batch processing and analytics workloads.
A good amount of effort is spent in setting up the stack on various environments including dev, QA, stage, and production. The typical challenge in a developer environment includes resource constraints, configuring clusters and connecting components to each other. Often a much-simplified variation of the production environment is used to develop and test the code.
Working with big data applications, we have often seen that even a slight change in cluster topology affects the overall performance of the application. The code itself presents with new challenges when deployed in a clustered environment.
There was a need to automate and set up developer environments closer to the production environment and flexible enough to scale across various development/testing scenarios like functional testing and performance bottleneck analysis etc.
The following are some of the development environment constraints:
- One click deployment and cleanup.
- Multiple deployment options for streaming and batch pipelines with different stack components.
- Ability to scale, fine-tune and modify the cluster topology.
- Ability to change the application code and re-deploy the code quickly with minimal interruption to the setup.
- Look into logs to quickly debug application issues.
- Application metrics to quickly analyze the source of the performance bottleneck.
Big data stack on Kubernetes
We explored containers and evaluated various orchestrating tools and Kubernetes appeared to be the defacto standard for stateless application and microservices.
Kubernetes is an open-source container-orchestration system for automating deployments, scaling and management of containerized applications. In case you are new to Kubernetes, I would recommend reading our earlier blog post on What is Kubernetes.
As part of the big data stack, the goal was to setup Kafka, Elasticsearch, Spark, and HDFS on Kubernetes and submit Spark jobs to the cluster.
Containers are inherently stateless whereas big data applications are truly stateful. With Kubernetes (>= 1.5) introducing the concept of StatefulSets and Spark (>= 2.3.0) providing direct support to submit Spark jobs, setting up a stateful big data stack on Kubernetes is now made possible.
We set up a single node Kubernetes (v1.10) cluster on our MacBooks. There were two options enabling one-click deployment of Kubernetes on Mac.
We chose docker edge over minikube since docker can now run natively on Mac/Windows and is lightweight as compared to running minikube on Virtualbox.
Picking the right docker image
One of the challenges was to pick the right docker image for each component of the stack with enough configurable options. We wanted to leverage community images to minimize the effort on building and maintaining our own docker image.
After evaluating available docker images in the community, we decided to pick the following community images for our big data stack.
We leveraged the concept of StatefulSets and dynamic provisioning of volumes to setup component like Kafka, Elasticsearch, HDFS etc. The application state and any other resilient data are maintained in a persistent storage object associated with the StatefulSet. Kubernetes ensures that each time the component is (re)scheduled; same persistent storage is made available to the component.
We used native Kubernetes scheduler added in Spark (>= 2.3.0) to submit the stream and batch processing jobs. The native scheduler takes care of spawning the Spark driver and executor pod to execute the jobs.
Eventually deploying multiple components individually became cumbersome. Our stack involved several objects including ConfigMaps, Services, StatefulSets, Persistent Volumes etc. We used HELM for packaging multiple Kubernetes resources into a single package to simplify the deployment of the stack.
We are now working on enabling production deployments for big data stack on various public cloud platforms on Kubernetes.
In the upcoming post, we will talk about the how we used Prometheus to monitor the big data stack on Kubernetes.
Would love to hear your feedback and stories on how you approached this topic. Do comment or email us at email@example.com