Kubernetes Cloud Native Monitoring with TICK & Prometheus - Up and Running

0 Introduction

This post doesn't intend to introduce prometheus or influxdb, it serves as a reference for building up a monitoring/logging system in kubernetes with open source softwares.
The monitoring/logging/alerting system composes of 4 open sources softwares, refer to diagram below

  1. Fluentbit is used for log collecting, fluentbit is deployed as kubernetes daemonset to all kubernetes nodes, it collects container's log and enriches them with metadata from kubernetes API.
  2. InfluxDB is used to store collected/enriched logs from fluentbit.
  3. Prometheus is used for monitoring, it pulls metrics from service monitoring targets and store metrics to its own time series database.
  4. Granfa is used for metris analytics & visualization, it serves as a dashboard system.

1 Prerequisites

The whole deployment requires helm, charts and a few of customized configuration files, follow below steps to download them in advance

1.1 Helm

Refer to Install applications with Helm in Azure Kubernetes Service (AKS), install helm

1.2 TICK Charts

Clone TICK charts from below repository

git clone https://github.com/influxdata/tick-charts.git

TICK stands for

  • T - Telegraf
  • I - InfluxDB
  • C - Chronograf
  • K - Kapacitor

1.3 Monitoring Configuration Files

Download kubernetes monitoring configuration files from below repository

git clone https://github.com/huangyingting/k8smon.git

2 Install/Configure TICK Stack & Fluentbit for Logging

2.1 Install TICK Stack

From command prompt/shell, change directory to tick-charts(cloned from section 1.2)

helm install --name data --namespace tick ./influxdb/ --set persistence.enabled=true,persistence.size=16Gi,config.udp.enabled=true

helm install --name alerts --namespace tick ./kapacitor/ --set persistence.enabled=true,persistence.size=8Gi

helm install --name dash --namespace tick ./chronograf/

helm install --name polling --namespace tick ./telegraf-s/

2.2 Configure InfluxDB UDP Port

The default setting of influxdb is incorrect for udp service, we need to change protocol from TCP to UDP with below steps
kubectl edit service data-influxdb -n=tick
Modify

  - name: udp
    port: 8089
    protocol: TCP
    targetPort: 8089

To

  - name: udp
    port: 8089
    protocol: UDP
    targetPort: 8089

2.3 InfluxDB Configuration

From command prompt/shell, run kubectl port-forward svc/dash-chronograf -n=tick 8080:80
Open a brower and access URL http://localhost:8080, follow the wizard set

2.4 Install Fluentbit to Collect Container's Log

Fluentbit is a lightweight log collector, according to Fluentbit

Fluent Bit is an open source and multi-platform Log Processor and Forwarder which allows you to collect data/logs from different sources, unify and send them to multiple destinations. It's fully compatible with Docker and Kubernetes environments.
Fluentbit is used here to collect and populate kubernetes container logs to influxdb, to install it into kubernetes, change directory to k8smon

kubectl create namespace logging
kubectl apply -f fluentbit-config.yaml
kubectl apply -f fluentbit-ds.yaml

2.5 Verify Log Collecting

Fluentbit should start to collect container's logs and store them into influxdb, to verify it works appropriately, from chronograf "Explore", hightlight database "fluentbit", it should list containers measurements like below

3 Install/Configure Prometheus for Monitoring

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud.

We choose prometheus operator as it has a set of pre-configured dashboards and alerts.

3.1 Install Prometheus Operator

Change directory to k8smon, then from command prompt/shell, run

helm install --name prom --namespace prom stable/prometheus-operator -f prometheus_values.yaml

3.2 Configure ETCD Monitoring

Newest kubernetes cluster shold have SSL/TLS enabled for ETCD, prometheus needs to have a client certificate to monitor ETCD, for example, in a kubeadm setup cluster, client certificate/key/ca certificate are in master node's directory /etc/kubernetes/pki/etcd, copy/rename CA certificate to ca.crt, etcd client certificate to etcd.crt and etcd client key to etcd.key, then create a yaml with below commands

cat <<-EOF > etcd-client-cert.yaml
apiVersion: v1
data:
  etcd-ca.crt: "$(cat ca.crt | base64 --wrap=0)"
  etcd-client.crt: "$(cat etcd.crt | base64 --wrap=0)"
  etcd-client.key: "$(cat etcd.key | base64 --wrap=0)"
kind: Secret
metadata:
  name: etcd-client-cert
  namespace: prom
type: Opaque
EOF

And apply it to kubernetes cluster
kubectl apply -f etcd-client-cert.yaml

3.3 Configure Controller Manager and Scheduler Monitoring

Monitoring kubernetes controller manager and scheduler requires they are listening at 0.0.0.0 address, if kubernetes is deployed with kubeadm, we need to change following settings and then reboot master node

sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-scheduler.yaml

3.4 Configure Telegraf Ouput to Prometheus

We'd like to use prometheus to collect influxdb's metrics and have influxdb monitored by prometheus, unfortunately influxdb doesn't expose its metrics to prometheuse. However, telegraf can collect influxdb's metrics and output to prometheus, so we can configure telegraf to transit influxdb metrics to prometheus, to do that, change directory to tick-charts, from command prompt/shell, run

helm upgrade polling --namespace tick ./telegraf-s/ -f ../k8smon/telegraf_values.yaml

3.5 Verify Monitoring

Prometheus should start to collect metrics, to verify it works appropriately, from command prompt/shell, run

kubectl port-forward svc/prom-prometheus-operator-prometheus -n=prom 9090:9090

Now, open a browser and visit
http://localhost:9090/targets
All targers should be green

Also, accessing http://localhost:9090/graph, all metrics should be populated

4 Alerting

Refer to Alerting overview

Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, PagerDuty and HipChat.

The default configuration in prometheus_values.yaml will redirect alerts to a null notification channel.

  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
    receivers:
    - name: 'null'

To enable sending alerts to MSTeams, we can

  1. From prometheus_values.yaml, comment out above configurations and uncomment below part
    route:
      group_by: ['alertname', 'job']
      group_wait: 30s
      group_interval: 1m
      repeat_interval: 12h
      receiver: 'prometheus-msteams'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'prometheus-msteams'
    receivers:
    - name: 'prometheus-msteams'
      webhook_configs:
      - send_resolved: true
        url: 'http://prometheus-msteams.prom:2000/alertmanager'
  1. Run helm upgrade prom --namespace prom stable/prometheus-operator -f prometheus_values.yaml to update the settings
  2. Clone prometheus-msteams with https://github.com/bzon/prometheus-msteams.git
  3. From MSTeams, correct a webhook for a channel, then open values.yaml, set alertmanager to webhook URL
connectors:
  - alertmanager: https://outlook.office.com/webhook/xxx/IncomingWebhook/xxx/xxx
  1. Run helm install --name prometheus-msteams ./prometheus-msteams --namespace prom to install prometheus-msteams

If it is configured correctly, you should be able to receive alerts from prometheus in MSTeams

Prometheus has an alert manager to review all alerts, it can be accessed with
kubectl port-forward svc/prom-prometheus-operator-alertmanager -n=prom 9093:9093
Then open a browser to http://localhost:9093/#/alerts

5 Grafana

The default setup will install a few of pre-configured dashboards into grafana, to access those dashboard, we need to login to grafana first,

  1. Follow below steps to get username/password
kubectl get secret -n=prom prom-grafana -o jsonpath="{.data.admin-user}"|base64 --decode;echo
kubectl get secret -n=prom prom-grafana -o jsonpath="{.data.admin-password}"|base64 --decode;echo
  1. Run kubectl port-forward svc/prom-grafana -n=prom 18080:80, and open browser to access http://localhost:18080/login, login with username/password retrieved from step 1.
  2. From "Home", find those pre-configured dashboards
  3. Click any of them, for example "Node" dashboard will list all performance metrics

6 Summary

Above setup should make the kubernetes cluster monitoring/alerting/logging up and running, this solution could be applied to a small/middle sized kubernetes cluster as well if it is self-hosted.