Adopting monitoring frameworks - LETS
You're looking for the best ways to monitor software and non-software systems. Despite your investment in a lot of tools, all the possible metrics make unclear sometimes what exactly you should look at. Even with curated dashboards, you become confused about what is important.
You've learned that using the 4 “Golden Signals” of Latency, Errors, Traffic, and Saturation (LETS) provides a generic framework you can use to understand your software and infrastructure. You're interested in whether this framework can also be applied to non-software related scenarios.
A non-software LETS example
Let’s use a hypothetical non-software example to illustrate the power of the Golden Signals. Imagine you run a busy restaurant. The restaurant seems to be doing well, but you don’t know where to look to make improvements or cut costs. You decide to start measuring, but you're not sure how to decide what to measure.
Applying LETS in this scenario helps you focus on:
- Latency: How long does it take to get food to a customer? Understanding latency metrics will help you decide if you need to hire more cooks and servers, or your upgrade equipment.
- Errors: How often are we unable to make a meal or do we have to comp a free meal? Understanding errors will help you measure improvements from better training, staffing, and equipment.
- Traffic: How many customers are we taking in and when? Understanding traffic helps you understand how much staff you need, when you need them most, and when you can schedule fewer. Measuring customer traffic may even help you decide when it is time to expand.
- Saturation: How many meals can employees cook and serve at the same time? Understanding saturation helps uncover scheduling deficiencies, issues preparing certain popular dishes in parallel, and other unknown efficiency gaps.
These are all operational aspects you may have been able to guess about as a restaurant owner. But without measuring them, how would you know for sure?
These basic concepts provide a basis for understanding complex systems in general, like this imaginary restaurant. But where they really work is monitoring complex software architectures.
LETS in connected systems
In the age of microservices, aiming to develop specific domain knowledge of every element of a software system may be impractical. Applying the concept of LETS can provide the foundation for basic troubleshooting where issues arise in a complex system.
If you're an analyst who isn’t an expert on a service, you can use Latency, Errors, Traffic, and Saturation to more readily identify issues in connected systems. The types of insights you might find are:
- Latency appears to be much higher than normal to the database. Is that DB on-prem or in us-east-1?
- Errors are spiking after that last deployment. We should roll back.
- Traffic has totally dropped off at the load balancer! Did our cert expire?
- Saturation seems to be increasing more quickly than usual and we’ll run out of storage soon.
Using this framework allows you to quickly check known points of failure before diving down rabbit holes.
Using LETS metrics
You can use Distributed Tracing to find the metrics you need to use in a LETS framework. Distributed Tracing is concerned with the latency, errors, and traffic of requests traversing a system. When you feed your tracing data (sometimes called APM data) into a solution like Splunk APM you’ll start to get metrics right away that help you answer the LET part of LETS.
Saturation metrics are affected by your software and design decisions. Some examples of saturation to consider are:
- Is your software CPU bound? Does it rely on a certain amount of CPU power being available at a given time?
- What about memory? Would increasing memory usage cause your software to crash due to an out-of-memory cleanup and cause failures?
- Is storage your concern? Maybe a DB, network disk, or even local disk is at risk of filling up?
- Are you running enough hosts (containers/VMs/etc) to service all of your traffic?
The answers to some of the above are likely “no” for any given application in your environment. But taking the time to think about these questions and map out where resource constraints and saturation may cause failures helps reduce chart clutter and increase troubleshooting speed. Identifying the “known knowns”, or what you know already, will help you start to focus on the issues at hand and reduce side tracking.
You can apply LETS to look at latency, errors, and traffic between microservices, data centers, or even between individual software components. Applying these methods across microservices that share common infrastructure patterns (E.G. JVMs running on EC2 and using DynamoDB, Python based Cloud Functions with a Cloud SQL datastore, or any other repeatable combination) will also allow you to minimize problems like dashboard and alert bloat. Imagine a single dashboard containing LETS charts for each piece of commonly used infrastructure. By including a dimension like
servicename across all those metrics, that single dashboard can be filtered to quickly view a large swath of your microservices footprint. Alerts can be minimized similarly by focusing on the LETS fundamentals and repeatable infrastructure patterns.
The content in this article comes from a previously published blog, one of the thousands of Splunk resources available to help users succeed. In addition, these resources might help you understand and implement this guidance: