Building the "world's worst" Kubernetes log catcher

 

Logs are a lifeline when troubleshooting faults or investigating a security breach. Without them, you often have to result to guesswork in order to replicate an issue or an intrusion into your systems. In the traditional server environment, logs are written to binary or flat files to the locally attached disk. The contents of the logs can then be viewed using application specific tools (I'm looking at you, Varnish) or any text viewer. Large scale deployments collate logs from multiple systems or virtual machines using something like rsyslogd. This allows informatics, dashboards, and paging personnel to be implemented centrally within the data center.

And then Kubernetes screwed it all up.

Don't get me wrong, I love k8s. I find it one of the most powerful tools for modern infrastructure management we have. Once one gets over the gap of understanding between VMs and containers, it becomes a lot more accessible to use.

Yet, logging remains complicated.

When running an application in a container, you really have four different options for storing log files:

  1. Save to local ephemeral storage in the container.
  2. Save to local persistent storage in the container.
  3. Dump to STDOUT and STDERR to be stored by the k8s log catcher.
  4. Offload to an external networked log catcher.

The first two are the most basic, and often the ones people rely on when transitioning from a traditional server environment to a containerized one. You open a file somewhere in the container, and write the contents to disk as if it were a normal server. Unlike a normal server, however, you have two choices of where to put those files. Without specifically attaching a persistent disk to your pod, all files reside in ephemeral storage. They exist for the life of the container, and are deleted when the container is deleted. You also have no guarantee as to how much of that ephemeral storage you have available by default. It's first come, first served for all applications on the same node in the cluster.

You can negate these problems by attaching a persistent disk. Ideally, this disk would be specific to logs for the application, and not shared across applications for fear of creating a broader security hole in your infrastructure. This runs into two new problems:

  • Persistent disks on most major managed k8s providers are ReadWriteOnce.
  • The number of persistent disks you can mount at once per node is limited on most major managed k8s providers.

So, unless you're self-hosting k8s, or using a clustered file system like Rook, this isn't a practical solution. Even if you have the ability to mount disks using ReadWriteMany, you run into yet another problem. Ideally, you scale out applications by running multiple instances across multiple nodes in your cluster. Each instance of the application should then use the same config (usually stored in configmaps or secrets) so that they're all inter-changable. This creates a problem where the logging configuration would write to the same file along the same path. If you combine this with a shared disk and ReadWriteMany, each instance will overwrite the log files resulting in file corruption. This can be avoided if you can use environment variables or container hostnames in the file paths, but not all applications support this.

Skipping option 3 for the moment -- we'll get back to it, promise -- leaves offloading to a networked log catcher. This neatly avoids all the other issues with concurrency, persistent disks, and so on. Unfortunately, it as much, much, much less common for applications to support this out of the box.

All of the above are terrible options, and don't work across all applications or instance counts. In some use cases, these will work out just fine depending on the size and complexity of your cluster.

Also none of them are real recommendations. None.

At the time of this writing, if you want a proper logging, alarm, and reporting solution for Kubernetes, there's one already available: Prometheus and Grafana. The two are separate applications which work together along with application specific log catchers which work using a sidecar pattern. That is, if your pod is a container running a web server, a second container in the same pod is set up as a log catcher which forwards logs to Prometheus. Grafana then uses Prometheus as a database to create dashboards and power reports.

So really, if you really, actually need real logging and informatics, just use Prometheus and Grafana. Stop reading, and go do that.

So you decided to keep reading and see what sort of cursed nonsense I'm up to for the purposes of learning and fun.

The problem with writing files to disk is that to access them, you need to kubectl exec to get into the container, and then use a log reading utility in-container to read the logs. Alternatively, you need to use kubectl cp to download the log files and read them locally. This is where writing logs to STDOUT and STDERR comes into play.

While it may look at first blush that logs written here disappear into the ether, it's actually one of the most accessible options for logging on Kubernetes. K8s will actually record (typically the last 24 hours or life of the container, whichever is shorter) output from STDOUT and STDERR for you to access using kubectl logs. If your application runs in a Unix-like OS such as Linux, you can actually redirect even older applications to /dev/stdout and /dev/stderr as if they were local files. This is useful for checking on a container when it starts, or peeking in on an active, recurring error, but it's not stored forever by default. There's a time limit depending on your managed k8s provider or your self-hosted k8s configuration.

Yet, kubectl logs can be executed remotely, from within or outside of the cluster, given the right permissions and network access. Moreover, it can be done per pod, even if they're different instances in the same application as the logs are associated with the pod and container name.

So what if we just run kubectl logs on a regular scheduler like cron?

First, of course, we need to get the pods we from which we want to get logs.

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods

NAME                        READY   STATUS    RESTARTS   AGE
cm-acme-http-solver-75mfb   1/1     Running   0          3h53m
cm-acme-http-solver-kflhn   1/1     Running   0          6d12h
memcache-0                  1/1     Running   0          8d
web-8bcc8cf87-6cllp   2/2     Running   0          8d
web-8bcc8cf87-bhmd6   2/2     Running   0          8d
web-8bcc8cf87-w47gz   2/2     Running   0          8d

That'll get us a list of pods, but it will get all the pods in the namespace. We may want to target a particular deployment or statefulset within our namespace. In the above example, we'd rather just get the pods starting with web-. While there's a few different ways to do this, the best is to use labels within the deployment or statefulset definition:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: web
  name: web
  namespace: my-namespace
spec:
  # truncated for brevity

With this, we can now use the label selector (app=web) to limit our output to just the pods we want:

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web

NAME                        READY   STATUS    RESTARTS   AGE
web-8bcc8cf87-6cllp   2/2     Running   0          8d
web-8bcc8cf87-bhmd6   2/2     Running   0          8d
web-8bcc8cf87-w47gz   2/2     Running   0          8d

That's better, but the results are still human readable and not suitable for use in a script. What we really want is just a list of pod names. Fortunately, we can do that with an inline template to customize the kubectl output:

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web -o custom-columns=:metadata.name)

web-8bcc8cf87-6cllp
web-8bcc8cf87-bhmd6
web-8bcc8cf87-w47gz

Now that we have the pod names, it's a simple matter to get the logs for each by running these three commands in succession:

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs web-8bcc8cf87-6cllp

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs web-8bcc8cf87-bhmd6

$ kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs web-8bcc8cf87-w47gz

But why should we do this? We should get the computer to do this for us. Thankfully, we can do this with a bit of shell scripting. We can run our call to kubectl logs in a for loop, using the output of kubectl get pods to form the list of items over which to iterate:

for pod in $(kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web -o custom-columns=:metadata.name); do bash -c "kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs  $pod"; done

That just dumps it to the terminal for output though, so lets save each to a pod named, dated file so we can keep them all separate and not have them overwrite. The filename format we want is <applicationName>-<dateLogWasTaken>-<podName>.log. Where dateLogWasTaken is a timestamp with a four digit year, two digit month and day, the letter 'T', and a 24 hour time zero filled and delimited by full colons. While that makes for a long filename, it gives us some reasonable confidence that it will be unique each time logs are written.

for pod in $(kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web -o custom-columns=:metadata.name); do bash -c "kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs  $pod > path/to/web-`date +%Y-%m-%dT%H:%M:%SZ`-$pod.log"; done

While our script certainly works for now, we're going to face yet a new problem. Whenever we run the script, we'll get all the logs back in time known to Kubernetes per pod. If we're writing them to separate files, we risk writing down duplicate log entries which may throw off any data collection or metrics we may wish to generate from them. Instead of doing a complicated file compare, we could lock down the period at which we ask kubectl logs for output. There's actually a neat option for this, --since-time.

for pod in $(kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web -o custom-columns=:metadata.name); do bash -c "kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs  $pod  --since-time=`date -d @$(($(date +%s) - 21600)) +%Y-%m-%dT%H:%M:%SZ` > path/to/web-`date +%Y-%m-%dT%H:%M:%SZ`-$pod.log"; done

A lot is going on in the new piece we added:

--since-time=`date -d @$(($(date +%s) - 21600)) +%Y-%m-%dT%H:%M:%SZ`

First, add the parameter --since-time and pass the value between backticks. The backticks (in Bash) allow us to inline output the result of another command call as the parameter. We then call the date command with the -d parameter, which asks it for a date formatted with the given value.

We then do some more trickery with a pair of subshells. The innermost command gets the current UNIX time in seconds:

$(date +%s)

Technically, we should set the timezone here so we always know when this is run. In my case, I was able to omit this as all the timezones were set to UTC. Once we have the current UNIX time, we need to subtract an arbitrary period in seconds from it. In this example, we used 21600, or six hours.

$(($(date +%s) - 21600))

This means we'll need to configure our cronjob later to run every six hours to match. You could change this period to be longer or shorter as you needs arise. Importantly, it should be no longer than the amount of time you're willing to lose if your pods are replaced with new ones.

Finally, we want to get the time in a format desired by kubectl, in this case a four digit year, a two digit month and day delimited by hyphens, the letter 'T', a two digit hour in 24 hour format, a two digit minute and second delimited by full colons, and the letter 'Z' indicating "zulu" or UTC time.

date -d @$(($(date +%s) - 21600)) +%Y-%m-%dT%H:%M:%SZ

All combined, this lets us write a log for each pod with our label and in our cluster namespace, getting only the last six hours of logs for each pod.

Now, of course, we could run this on any external system with network access to the cluster and cron. We could even run it on somebody's workstation, provided we never turn the thing off. Since that's not good practice, it's better if we have our log catcher script run inside the cluster instead. Since we want to keep those log files persistent, we should attach a persistent disk to it:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-log-files
  namespace: my-namespace
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: logs-cron
  namespace: my-namespace
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /bin/bash
            - -c
            - for pod in $(kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace get pods -l app=web -o custom-columns=:metadata.name); do bash -c "kubectl --kubeconfig=/path/to/kubeconfig.yaml -n my-namespace logs  $pod  --since-time=`date -d @$(($(date +%s) - 21600)) +%Y-%m-%dT%H:%M:%SZ` > path/to/web-`date +%Y-%m-%dT%H:%M:%SZ`-$pod.log"; done
            image: ten7/flightdeck-util
            imagePullPolicy: IfNotPresent
            name: cron
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /path/to
              name: vol-pvc-my-log-files
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - name: vol-pvc-my-log-files
            persistentVolumeClaim:
              claimName: my-log-files
  schedule: 0 0,6,12,18 * * *
  startingDeadlineSeconds: 60
  successfulJobsHistoryLimit: 0
  suspend: false

While the above is lengthy, it's fairly straightforward. The first definition is for our persistent disk, my-log-files. We request 10G for space, although you may wish to request more or less depending on your needs and the limitations of your hosting provider.

After the three dashes, we get our cronjob definition. It runs our script in a container based on the ten7/flightdeck-util image, which has kubectl, bash, and the date command already inside it ready for use. We run the command every six hours at midnight, 6am, 12pm, and 6pm UTC.

The above doesn't quite work, though. You'll notice that we included a kubeconfig file in our command, but did not mount such a file anywhere in our cronjob. We could write that file to the persistent disk we created, but we shouldn't.

In general, you should never write a kubeconfig file to anywhere inside your cluster if you can help it. Often, the kubeconfig files you get have overly broad permissions, including administration permissions to your cluster. This can create big problems for you if someone abuses your log catcher to do something nefarious. So, instead, let's define a Service Account inside our cluster which only has permissions to list out pods, and get their logs.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-logcatcher-service-account
  namespace: my-namespace

Not much there, of course. This just defines a new service account with the name of my-logcatcher-service-account.

Next, we need to define the permissions, by creating a new role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: my-namespace
  name: log-catcher
rules:
- apiGroups:
    - ""
  resources:
    - "pods"
    - pods/log
  verbs:
    - "get"
    - "list"

The role, log-catcher grants permissions to get and list pods and pod logs (pods/log). This lets us do the two kubectl commands we need, kubectl get pods and kubectl logs. Also, this is specifically of type Role and not a ClusterRole. This bounds the permissions granted to by the role to the current namespace only.

Finally, we need to define a RoleBinding to assign our service account our new role:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: log-catcher-rolebinding
  namespace: my-namespace
subjects:
- kind: ServiceAccount
  name: my-logcatcher-service-account
  namespace: my-namespace
roleRef:
  kind: Role
  name: log-catcher
  apiGroup: rbac.authorization.k8s.io

That gets us all the permissions we need. To use it in our cronjob, we have to add the following lines to the spec of the definition:

serviceAccount: my-logcatcher-service-account
serviceAccountName: my-logcatcher-service-account

This instructs Kubernetes to run the cronjob under our new service account with the permissions we specify. Now that we have that, we no longer need to specify --kubeconfig in any of our commands:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-log-files
  namespace: my-namespace
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-logcatcher-service-account
  namespace: my-namespace
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: my-namespace
  name: log-catcher
rules:
- apiGroups:
    - ""
  resources:
    - "pods"
    - pods/log
  verbs:
    - "get"
    - "list"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: log-catcher-rolebinding
  namespace: my-namespace
subjects:
- kind: ServiceAccount
  name: my-logcatcher-service-account
  namespace: my-namespace
roleRef:
  kind: Role
  name: log-catcher
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: logs-cron
  namespace: my-namespace
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /bin/bash
            - -c
            - for pod in $(kubectl -n my-namespace get pods -l app=web -o custom-columns=:metadata.name); do bash -c "kubectl -n my-namespace logs  $pod  --since-time=`date -d @$(($(date +%s) - 21600)) +%Y-%m-%dT%H:%M:%SZ` > path/to/web-`date +%Y-%m-%dT%H:%M:%SZ`-$pod.log"; done
            image: ten7/flightdeck-util
            imagePullPolicy: IfNotPresent
            name: cron
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /path/to
              name: vol-pvc-my-log-files
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          serviceAccount: my-logcatcher-service-account
          serviceAccountName: my-logcatcher-service-account
          volumes:
          - name: vol-pvc-my-log-files
            persistentVolumeClaim:
              claimName: my-log-files
  schedule: 0 0,6,12,18 * * *
  startingDeadlineSeconds: 60
  successfulJobsHistoryLimit: 0
  suspend: false

When the cronjob runs, the resulting container mounts our persistent volume, and calls our script. The kubectl command is smart enough to check for the kubeconfig file kubernetes dynamically generates for our service account and uses it to access resources within the cluster. Our script runs, the files are written, and the job finishes.

Et voila! A Kubernetes log catcher in a single manifest file.

The above is an abjectly terrible log catcher. It only works for one namespace, and one set of labels. While under the best of circumstances it'll catch and write logs every six hours, a container restart or a new deployment will replace the containers and lose critical log data. There are so, so many better ways to implement this. However, it is useful to know that it is possible with a minimum of tooling and programming. It's useful as a temporary solution if nothing else.