These days I’ve been playing a bit with Sysdig Cloud, which allows you great things in terms of monitoring and reporting.
For those who do not know, Sysdig is an OSS monitoring tool: “Open Source Universal System Visibility With Native Container Support” This means that you install the software on a host running as “hypervisor” of containers like Docker and you will have visibility of different metrics without having to install any piece of software. Even more, you can install this on the hypervisor of a Kubernetes environment. This is done by installing a specific kernel module, which sysdig software will use to copy and interpret system calls. You can even capture these system calls like you would with tcpdump and network traffic to do some safe forensic investigation afterwards.
Sysdig Cloud follows the same idea, but provides you with several benefits, like nice graphs for different metrics, preset metrics for different (most common) services, alerts, events, and… obviously notifications.
I’ll speak of notifications now. After I set up few hosts to test, and I could see the stats of them, and of the different containers:
Then I reviewed the alerting system and set up the most common notifications, this is receiving an email (personal email/group mail, or mailing list, you choose. But then I remembered some bots we did at my job for some ticketing systems (say OTRS) using a Telegram bot, and thought: “Hey, it could be nice to receive alerts on mobile using a Telegram bot as well”.
I soon discovered doc regarding notification methods, and among of them was a webhook method, which basically requires a URL (which is basically what a Telegram bot needs to receive to send a message). Unfortunately, the webhook sends the content of the alert as a JSON object, and as far as I have seen, Telegram API does not interpret this JSON on POST requests. Therefore, you still could set up static alerts (something like 1 notification per alert type), but this would be tedious, and would not provide all the information one might need at a particular point to determine severity.
But then I thought, that despite of this, this could bring good chances. JSON and POST is being used by many API’s. Soon I thought about a good idea which would be creating reports on issues, using Elastic and Kibana. The most typical use case for this would be the IT team leader needing some quick and easy way to have a look to the workload, and severity of the issues that his team is facing. Despite this might look complex, as it is much easier than it could look like – after you circumvent some particularities.
First thing would be installing ES and Kibana, and setting up the index (ES + Kibana doc are quite clear on how you can do this). Once I did, I set up the Webhook notification to post to my ES instance : http://xavy:MyPass@x.x.x.x:9200/sysdig-alerts2/sysdig2 . Please note that as I was using a docker image of ES with X-pack installed, I was using auth. If you use a standard not licensed ES installation, you will not need that (but you should consider limiting reachability of ES)
Then the first issue appeared, as I found that the timestamp value, was stored in a Long type in ES, instead of using a DATE type. Not too complex, just let’s create a mapping which ensures that timestamp field uses date type. The complexity here comes from the fact that Sysdig Cloud sends the timestamp in EPOCH with MICROseconds adjustment, while DATE type of ES only handles up to MILIseconds. I though of several possibilities here, but finally decided to use a scripted field on Kibana. Let me show you a screenshot of such a scripted field:
As you may see, I have just divided the Sysdig microsends per 1000 to achieve the milliseconds, and chosen a time format which fitted a report like the one I had on my mind.
Right now we had the right data available:
So we just had to build the visualization to present the data in some report-like format:
As you may check I have chosen a COUNT aggregation for the Y Axis, and for the X Axis a date Histogram aggregation splitted by a Sub Aggregation based on the severity of Alerts, so we can see not only how many issues happened each day, but how important they were as well.
This is all folks. Hope you enjoyed it and/or find this useful!