Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
...is a project for the development of packaging and tests of the Big Data and Data Analytics ecosystem.
The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of bigdata-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.
The simplest way to get a feel for how bigtop works, is to just cd into provisioner
and try out the recipes under vagrant or docker. Each one rapidly spins up, and runs the bigtop smoke tests on, a local bigtop based big data distribution. Once you get the gist, you can hack around with the recipes to learn how the puppet/rpm/smoke-tests all work together, going deeper into the components you are interested in as described below.
Also, there is a new project underway, Apache Bigtop blueprints, which aims to create templates/examples that demonstrate/compare various Apache Hadoop ecosystem components with one another.
There are lots of ways to contribute. People with different expertise can help with various subprojects:
Also, opening JIRA's and getting started by posting on the mailing list is helpful.
This is the content for the talk given by jay vyas and sid mani @ apachecon 2019 in Las Vegas, you can watch it here https://www.youtube.com/watch?v=LUCE63q !
helm install stable/nfs-server-provisioner ; kubectl patch storageclass nfs -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' Minio: kubectl -n minio create secret generic my-minio-secret --from-literal=accesskey=minio --from-literal=secretkey=minio123 helm install --set existingSecret=my-minio-secret stable/minio --namespace=minio --name=minio Nifi: helm repo add cetic https://cetic.github.io/helm-charts ; helm install nifi --namespace=minio Kafka: helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator $ helm install --name my-kafka incubator/kafka , kubectl edit statefulset kafka envFrom: - configMapRef: name: kafka-cm Spark: kubectl create configmap spark-conf --from-file=core-site.xml --from-file=log4j.properties --from-file=spark-defaults.conf --from-file=spark-env.sh -n bigdata ; helm install microsoft/spark --version 1.0.0 --namespace=minio Presto: cd ./presto3-minio/ , kubectl create -f - -n minio
Installation of things has been commoditized by containers and K8s. The more important problems we have nowadays are around interoperation, learning, and integration of different tools for different problems in the analytics space.
Modern data scientists need ‘batteries included’ frameworks that can be used to model and address different types of analytics problems over time, which can replicate the integrated functionality of AWS, GCP, and so on.
This repository currently integrates installation of a full analytics stack for kubernetes with batteries included, including storage.
configuration isnt really externalized very well in most off the shelf helm charts. The other obvious missing link is that storage isnt provided for you, which is a problem for folks that don‘t know how to do things in K8s. We’ve externalized configuration for all files (i.e. see spark as a canonical example of this) into configmaps and unified zookeeper instances into a single instances for ease of deployment here. Also, this repo has tested different helm repos / yaml files to see what random code on the internet actually works the way it should.
For example, the stable helm charts don‘t properly configure zepplin, allow for empty storage on ZK, or inject config into kafka as you’d want to be able to in certain scenarios. In this repo, everything should just work provided you create things in the right order.
Minikube is the easiest tool to run a single-node Kubernetes cluster.
$ cd $BIGTOP_HOME $ minikube start --cpus 8 --memory 8196 --disk-size=80g --container-runtime=cri-o $ kubectl cluster-info
Prerequisites:
If you want a multi-node cluster on local machine, you can create the cluster using Kubespray:
$ cd $BIGTOP_HOME $ ./gradlew kubespray-clean kubespray-download && cd dl/ && tar xvfz kubespray-2.11.0.tar.gz $ cd dl/kubespray-2.11.0/ && cp ../../kubespray/vagrant/Vagrantfile . $ vagrant up
kubectl
for local cluster$ vagrant ssh k8s-1
k8s-1$ kubectl plugin list k8s-1$ kubectl bigtop kubectl-config && kubectl bigtop helm-deploy
The easiest way to get simple persistent volumes with dynamic volume provisiong on a one node cluster:
$ helm repo add rimusz https://charts.rimusz.net $ helm repo update $ helm upgrade --install hostpath-provisioner \ --namespace kube-system \ --set storageClass.defaultClass=true \ rimusz/hostpath-provisioner $ kubectl get storageclass
Mark hostpath
StorageClass as default:
$ kubectl patch storageclass hostpath -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' $ kubectl get storageclass
On Minikube, there is ‘standard’ storage class as default storage class. you can make ‘hostpath’ storage class as default:
$ kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' $ kubectl get storageclass
You need to install lvm2
package for Rook-Ceph:
# Centos sudo yum install -y lvm2 # Ubuntu sudo apt-get install -y lvm2
Refer to https://rook.io/docs/rook/v1.1/k8s-pre-reqs.html for prerequisites on Rook
Run download
task to get Rook binary:
$ ./gradlew rook-clean rook-download && cd dl/ && tar xvfz rook-1.1.2.tar.gz
Create Rook operator:
$ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/ceph/common.yaml $ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/ceph/operator.yaml $ kubectl -n rook-ceph get pod
Create Ceph cluster:
# test $ kubectl create -f storage/rook/ceph/cluster-test.yaml # production $ kubectl create -f storage/rook/ceph/cluster.yaml $ kubectl get pod -n rook-ceph
Deploy Ceph Toolbox:
$ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/ceph/toolbox.yaml $ kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" $ kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash # ceph status && ceph osd status && ceph df && rados df &&
Refer to https://rook.io/docs/rook/v1.1/ceph-toolbox.html for more details.
Create a StorageClass for Ceph RBD:
$ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/ceph/csi/rbd/storageclass.yaml $ kubectl get storageclass rook-ceph-block
Mark rook-ceph-block
StorageClass as default:
kubectl patch storageclass rook-ceph-block -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
$ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/nfs/operator.yaml $ kubectl -n rook-nfs-system get pod # NFS by the default storageClass $ kubectl create -f storage/rook/nfs/nfs.yaml # NFS by the Ceph RBD volumes $ kubectl create -f storage/rook/nfs/nfs-ceph.yaml # NFS by the ```hostpath``` storageClass $ kubectl create -f storage/rook/nfs/nfs-hostpath.yaml # storageClass by NFS export $ kubectl create -f storage/rook/nfs/storageclass.yaml
If you want to mark rook-nfs-share1
storageClass as default:
$ kubectl patch storageclass rook-nfs-share1 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' $ kubectl get storageclass
Deploy Minio operator:
$ kubectl create -f dl/rook-1.1.2/cluster/examples/kubernetes/minio/operator.yaml # $ kubectl -n rook-minio-system get pod
Create standalone Minio object store:
$ kubectl create -f storage/rook/minio/object-store.yaml $ kubectl -n rook-minio get objectstores.minio.rook.io $ kubectl -n rook-minio get pod -l app=minio,objectstore=bigtop-rook-minio
If you want to deploy Distributed Minio cluster, Increase the nodeCount
in object-store.yaml file. e.g., nodeCount: 4
Deploy a Standalone Minio:
$ cd $BIGTOP_HOME $ helm install --name bigtop-minio --namespace bigtop -f storage/minio/values.yaml stable/minio
Deploy a Distributed Minio:
$ cd $BIGTOP_HOME $ helm install --name bigtop-minio --namespace bigtop --set mode=distributed,replicas=4 -f storage/minio/values.yaml stable/minio
Minio can be accessed via port 9000 on the following DNS name from within your cluster: bigtop-minio.bigtop.svc.cluster.local To access Minio from localhost, run the below commands: 1. export POD_NAME=$(kubectl get pods --namespace bigtop -l "release=bigtop-minio" -o jsonpath="{.items[0].metadata.name}") 2. kubectl port-forward $POD_NAME 9000 --namespace bigtop Read more about port forwarding here: http://kubernetes.io/docs/user-guide/kubectl/kubectl_port-forward/ You can now access Minio server on http://localhost:9000. Follow the below steps to connect to Minio server with mc client: 1. Download the Minio mc client - https://docs.minio.io/docs/minio-client-quickstart-guide 2. mc config host add bigtop-minio-local http://localhost:9000 minio minio123 S3v4 3. mc ls bigtop-minio-local Alternately, you can use your browser or the Minio SDK to access the server - https://docs.minio.io/categories/17
Deploy mc pod:
$ cd $BIGTOP_HOME $ kubectl create -n bigtop -f storage/mc/minio-client.yaml
$ kubectl exec -n bigtop minio-client mc admin info server bigtop-minio
Make a bucket and remove the bucket:
$ kubectl exec -n bigtop minio-client mc mb bigtop-minio/testbucket1 Bucket created successfully `bigtop-minio/testbucket1`. $ kubectl exec -n bigtop minio-client mc ls bigtop-minio/ [2019-10-31 04:46:44 UTC] 0B testbucket1/ $ kubectl exec -n bigtop minio-client mc rb bigtop-minio/testbucket1 Removing `bigtop-minio/testbucket1`.
Deploy Zookeeper cluster on Kubernetes cluster via Helm chart:
$ cd $BIGTOP_HOME $ export NS="bigtop" $ helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator $ helm install --name zookeeper --namespace $NS -f zookeeper/values.yaml incubator/zookeeper $ kubectl get all -n $NS -l app=zookeeper $ kubectl exec -n $NS zookeeper-0 -- bin/zkServer.sh status
Refer to https://github.com/helm/charts/tree/master/incubator/zookeeper for more configurations.
Deploy Kafka cluser via Helm:
$ helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator $ helm install --name kafka \ --namespace bigtop \ -f kafka/values.yaml \ --set zookeeper.url="zookeeper-0.zookeeper-headless:2181\,zookeeper-1.zookeeper-headless:2181\,zookeeper-2.zookeeper-headless:2181" \ incubator/kafka # Deploy Kafka client: $ kubectl create --namespace bigtop -f kafka/kafka-client.yaml # Usage of Kafka client $ export ZOOKEEPER_URL="zookeeper-0.zookeeper-headless:2181,zookeeper-1.zookeeper-headless:2181,zookeeper-2.zookeeper-headless:2181" # List all topics $ kubectl -n bigtop exec kafka-client -- kafka-topics \ --zookeeper $ZOOKEEPER_URL \ --list # To create a new topic: $ kubectl -n bigtop exec kafka-client -- kafka-topics \ --zookeeper $ZOOKEEPER_URL \ --topic test1 --create --partitions 1 --replication-factor 1
Optionally, You can deploy schema registry service for Kafka:
helm install --name kafka-schema-registry --namespace bigtop -f kafka/schema-registry/values.yaml \ --set kafkaStore.overrideBootstrapServers="kafka-0.kafka-headless:9092\,kafka-1.kafka-headless:9092\,kafka-2.kafka-headless:9092" \ incubator/schema-registry
See spark/README.md
See prometheus/README.md
And also, You can aggregate logs from Kubernetes cluster (Pods) in using Loki stack (Loki and promtail). See logs/loki/README.md for deploying Liki and querying logs from Loki on Grafana.
Below are some recipes for getting started with using Apache Bigtop. As Apache Bigtop has different subprojects, these recipes will continue to evolve. For specific questions it's always a good idea to ping the mailing list at dev-subscribe@bigtop.apache.org to get some immediate feedback, or open a JIRA.
The simplest way to test bigtop is described in bigtop-tests/smoke-tests/README file
For integration (API level) testing with maven, read on.
The website can be built by running mvn site:site
from the root directory of the project. The main page can be accessed from “project_root/target/site/index.html”.
The source for the website is located in “project_root/src/site/”.
You can get in touch with us on the Apache Bigtop mailing lists.