You have CSV files and JSON files in AWS S3 ready to go, let’s get querying!
Dremio is, first of all, free to play with in the community edition. Second, it is a great tool to have when you’re ready to go big with your data and queries. Dremio allows the analyst to query S3 data files directly without ETL jobs.
Bonus: you’re not limited to just files on S3. All the clouds are supported.
Today, our goal is to:
Step 1: Install Dremio locally in Ubuntu
Step 2: Query [example] files on S3 with SQL statements
Why Local Install?
If the data is in AWS on S3, why run locally? First we learn to crawl, then we learn to walk. To quote the great Mr. Miyagi of legend, “Wax on, wax off.” Experimenting in the cloud can get expensive fast when you go beyond the free tier or you’ll find yourself spending too much time managing IAM permissions and VPNs instead of using the tools. The local installation will be able to access the Dremio sample data on S3 so we can do many of the Dremio Tutorials. That said, we will plan to move to the cloud as the next step. Mr. Miyagi would only throw young Daniel into the ring when he is ready. 😎
We’re running locally on Ubuntu…
“Microk8s is the click-and-run solution for deploying a Kubernetes cluster locally, originally developed by Canonical, the publisher of Ubuntu.”
And Dremio has a tutorial using MicroK8s so we can assume it will work. We will adjust this tutorial a tiny bit to use latest and greatest versions of things.
Why Helm 3 over Helm 2?
Helm works well with Kubernetes to install new pods/services in your cluster. Helm 3 was released in 2019 and is ready for prime time. At the time of this writing, the Dremio installation documents refer to Helm 2 and Tiller. Tiller is removed in Helm 3. This is good — less is more. Let’s avoid Tiller and pretend it never existed. We expect Dremio to update their documentation to use Helm 3.
“The internal implementation of Helm 3 has changed considerably from Helm 2. The most apparent change is the removal of Tiller….”
Let’s Do This!
I’m using Ubuntu 20.04 Desktop 64-bit which can be installed following these instructions on Ubuntu’s site. I’m using a 16GB RAM machine but you could get away with just 8GB. My Intel machine has 4 cores and 2 threads on each. My SSD is 500GB. The “Balena Etcher” software makes it very easy to build a bootable USB with the Ubuntu ISO file. Note the ISO file name will end in “amd64.iso” even when you are using an Intel machine. File name: ubuntu-20.04.2.0-desktop-amd64.iso
Here are the commands to install using snap. A few things are installed via apt. More detailed instructions here: https://microk8s.io/docs
$ sudo snap refresh
$ sudo snap install microk8s --classic --channel=1.21
$ sudo microk8s enable dns storage helm3
In addition to the above snaps, we’ll need ‘git’ in order to grab Dremio’s tools from github.
$ sudo apt install git
Lastly, a shortcut to not have to type in ‘sudo’ every time you run microk8s commands. Be sure to reboot after the following.
$ sudo usermod -a -G microk8s $USER
$ sudo chown -f -R $USER ~/.kube--> Reboot machine$ microk8s start
$ microk8s kubectl get pods -A
Configuration for WiFi Setup for Network Interface
If you’re connected to your network using WiFi and you’re not able to get your pods into a “Running” state, read the following on how to update cni.yaml file. Link below.
Installing Kubernetes with MicroK8s on an Intel NUC running Ubuntu
Issue: Stuck in the “ContainerCreating” Status
Adding Dremio (v2) to your MicroK8s Cluster
This section is a slightly modified version of what you can find on Dremio’s site here. The main differences (as of June 2021) are
- I’m using Helm 3
- I change the CPU, count and memory settings in the values.yaml file
- I use NodePort instead of LoadBalancer to expose the Dremio UI to a local web browser
(1) Get the Helm chart information from Dremio github location
$ git clone https://github.com/dremio/dremio-cloud-tools.git
$ cd dremio-cloud-tools/charts
(2) In the ‘charts’ directory, copy the values.yaml file for dremio_v2 and edit the new copy
$ cp dremio_v2/values.yaml ./local.values.yaml
$ vi ./local.values.yaml
(3) Change the following items in the local.values.yaml file. Change the CPU, memory, and count values below each sections. I’m dropping these values significantly from the defaults as it’s just me using Dremio and I’m mostly using example datasets.
That’s right, for our setup, you only need one zookeeper and executor and coordinator. The count=0 means “0 slave coordinators” — the total coordinators will be 0 + 1=1. You can also decrease the disk space. Anywhere in the file you see “100Gi” you can drop down to “50Gi”.
Note that we also changed “LoadBalancer” to “NodePort” to expose our Dremio UI.
(4) Install using Helm 3 from the
dremio-cloud-tools/charts directory you created during the git clone. The following command says to use the name “dremio” and use the contents from the “dremio_v2” directory. This only works if you are in the “charts” directory and that’s where you created your local .yaml file.
$ microk8s helm3 install dremio dremio_v2 -f ./local.values.yaml
If you make changes again and need to update, then use “upgrade” instead.
$ microk8s helm3 upgrade dremio dremio_v2 -f ./local.values.yaml
Sometimes, you may need to clear it out and start over with uninstall. If you forgot the name, use list command. A few more helm3 commands below for fun.
$ microk8s helm3 list
$ microk8s helm3 uninstall dremio
(4) Validate the Installation
$ microk8s kubectl get all -A
Look for STATUS “Running” next to your “pod/zk-0” (Zookeper), “pod/dremio-executor-0” and “pod/dremio-master-0”.
The above will also have a line item that shows the IP address of the dremio-client service. Run that in your browser over http to create your account.
Likely you have two options to get to the UI (IP address examples below):
If it’s not set to Running status, then maybe there are clues found here:
$ microk8s kubectl describe pod dremio-master-0
$ microk8s kubectl describe pod dremio-executor-0
$ microk8s kubectl describe pod zk-0$ microk8s kubectl logs -f dremio-master-0
(5) Run your first tutorial!
I recommend starting with “Working with Your First Dataset” found here: