Instantly Diagnose a Database Outage with Flow Alerts
Stateful, commonly monolithic, and absolutely fundamental to system design, the quality of your database administration and operation is a key determinant of your overall success. Databases…
Whether you are just starting your observability journey or already are an expert, our courses will help advance your knowledge and practical skills.
Expert insight, best practices and information on everything related to Observability issues, trends and solutions.
Explore our guides on a broad range of observability related topics.
When building alerts, engineers aim to create accurate, timely, and actionable alerts. In pursuit of this goal, many engineers will leverage PromQL throughout their careers. PromQL is the query language used by Prometheus and Alert Manager to query metrics and define alerting rules.
While PromQL works very well for simple use cases, as infrastructure scales, architectural patterns grow more complex, engineering practices accelerate, and alerting use cases become more multivariate. Let’s explore the limitations of PromQL and how a low-code alerting solution like Coralogix Flow Alerts will help you scale your alerting use cases to match even the most complex cases.
PromQL is a fundamental technology in the observability industry and features in almost every reputable platform. It is so ubiquitous that many engineering teams treat it as the default approach for handling metrics and defining alerts, but this approach comes with a series of potential issues.
PromQL is a query language that can scale into a complex expression, encompassing statistical, programming, and logical concepts. This means that your non-technical colleagues may struggle to use it, but potentially worse, many of your software engineers will have to navigate a difficult learning curve.
The readability of a PromQL query degrades fast. Take, for example, the following expression:
http_requests_total{job=”apiserver”, handler=”/api/comments”}
We can guess what this does. It has a single clause and doesn’t perform any calculations. So what about this?
avg(rate(container_cpu_usage_seconds_total[5m])) / avg(rate(container_cpu_system_seconds_total[5m])) > 0.9 and avg(rate(http_requests_total{job=”apiserver”, handler=”/api/comments”}[5m])) < 150
It all becomes a little difficult when you involve multiple clauses in your queries, and the larger the query becomes, the more difficult it is to understand.
As a direct effect of the poor scalability of PromQL, it incentivizes engineers to write many small alerts. This means that for a given outage, dozens or perhaps hundreds of alerts might fire. This isn’t useful, nor is it actionable. Worse, it can cause alert fatigue.
As your engineering efforts scale, these issues translate into increased cost and complexity.
We need a tool that gives us the convenience of PromQL at a small scale but doesn’t burden us with the operational complexity of a multivariate PromQL alert. The answer is simple: we need a layer on top of PromQL that can orchestrate our modular alerts in many different complex ways, enabling us to grow and scale our alerts in response to our system demands without worrying about code complexity.
Let’s try something more complex. Imagine we want to fire an alert if:
To implement this in PromQL, you would first need to capture the average CPU utilization and test to see if it increased over 90% in the last five minutes:
avg(rate(container_cpu_usage_seconds_total[5m])) / avg(rate(container_cpu_system_seconds_total[5m])) > 0.9
Next, we need to check if the average request latency has increased sharply AND the error rate has increased.
increase(http_response_took_cx_avg[5m]) > 1000 and increase(http_error_perc_total) > 5
We now need to join these together, so that they fire in sequence. Unfortunately, this is where we need to draw the line. After all of the work of putting these queries together, we can’t orchestrate PromQL queries over time, it’s simply not something supported by the engine.
In Flow alerts, we first need to define the correct alarms. We can do this by breaking up our query into three clear components:
Next, we declare those as alerts in their own right, on the Coralogix platform. For example, using the CPU usage alarm, that looks something like this:
Link to GIF: Screen-recording-clean (1).gif
And then we string those alerts together into a single flow alert. The flow alert is a simple, low-code alerting interface that allows you to create powerful relationships between individual alerts, to describe the full story of an incident as it travels through your system.
Link to GIF: Flow Alert Clean.gif
On top of this, as you build your alert, you also get a living process diagram. This makes maintenance and handover far more straightforward. It also greatly simplifies the process of creating these alerts. If the basic building blocks are all in place, your Flow Alert could be built by anyone who can think logically about what your alert needs to do. It removes the stress of trying to break down and understand large, complex PromQL queries.
Feature | PromQL + AlertManager | PromQL + Flow Alerts |
Easily declare simple alerting cases | ✅ | ✅ |
Sequence multiple alerts over time | ❌ | ✅ |
Create alerts based on logs, metrics, traces, and security data | ❌ | ✅ |
Generate a clear diagram of your alert as you build it | ❌ | ✅ |
Logically connect multiple PromQL statements together, using a visual UI | ❌ | ✅ |
PromQL offers a remarkable level of flexibility and control over your alerting, and at Coralogix, we understand how important it is for engineers to be able to do what they do best. That’s why PromQL is fully compatible with Flow Alerts. Still, with the Flow Alerts UI, you can safeguard against the weaknesses of PromQL, enabling you to build more complex alerting flows, using features not available in AlertManager, to sequence your alerts into a single, coherent story that describes an incident from end to end.
Stateful, commonly monolithic, and absolutely fundamental to system design, the quality of your database administration and operation is a key determinant of your overall success. Databases…
Like cloud-native and DevOps, full-stack observability is one of those software development terms that can sound like an empty buzzword. Look past the jargon, and you’ll…
In the observability toolchain, all of our efforts go into data storage and analysis, and the usability of our system becomes a second-class citizen. Autocomplete is…