Monitoring Postgres with OpenTelemetry

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

This article explores monitoring the open-source relational database PostgreSQL. PostgreSQL is widely used in enterprise applications due to its scalability, extensibility, and support. It provides extensive internal server activity statistics through its statistics collector. This guide leverages these statistics using the OpenTelemetry Collector and focuses on database and infrastructure monitoring in Splunk Observability Cloud. Additionally, it explores how database performance connects to application performance data.

Data required

Database data

Step 1: Identify critical metrics

Monitoring database metrics is essential for identifying performance issues, optimizing queries, and ensuring database reliability. PostgreSQL generates a large volume of metrics, so it is important to focus on key operational metrics like those related to:

Query performance: Query throughput/latency, locks, query errors, index hit rate.
Resource utilization: Connections, CPU, memory, disk space, table/index size, disk I/O, cache hits.
Database health: Replication lag, deadlocks, rollbacks, autovacuum performance.

Query performance

Slow, resource-intensive queries or queries with high throughput can decrease the response time of your applications and degrade user experience. To prevent issues like slow page load time, you'll want to focus on metrics related to query time – total response time, index scans per second, and database latency. These metrics will indicate issues like whether your database has the right or wrong indexes, if you have absent indexes, if your tables are fragmented, or if you have too many locks.

Resource utilization

Exceeding resource thresholds can halt application operations altogether. If total active connections are too high, resources might be exhausted and users might not be able to interact with your application at all. Monitoring resource usage like CPU, memory, and table/index size can keep your databases up and running, while also allowing for accurate capacity planning and optimal user experience.

Database health

Conditions like a high rollback to commit rate can indicate user experience issues. For example, users might be unable to complete product checkout on an e-commerce site. An increase in the number of dead rows can lead to degraded query performance or resource exhaustion with similar effects. Proactively monitoring these metrics helps you identify inefficiencies, eliminate bottlenecks, reduce database bloat, and ultimately improve user experience.

Step 2: Collect PostgreSQL metrics using the OpenTelemetry Collector

To collect and export PostgreSQL metrics, follow these steps:

Install the OpenTelemetry Collector

Download and install the OpenTelemetry Collector.
For the Splunk Distribution of the OpenTelemetry Collector, follow the guided installation instructions.
Use Docker Compose to deploy PostgreSQL and the OpenTelemetry Collector.

Configure the OpenTelemetry Collector

Make the following configurations in your existing OpenTelemetry Collector configuration file, or create an empty otel-collector-config.yaml file if you don't already have one.
1. Add the PostgreSQL receiver under the receivers block.
2. Define an exporter to send data to Splunk Observability Cloud.
3. Ensure the receiver and exporter are included in the service pipeline.

Generally, your database and microservices would be behind network and API security layers so your databases and services would talk to each other unencrypted, which is why in this example tls is set to insecure: true. If your database requires an authenticated connection, you’ll need to supply a certificate similar to what’s shown in the sample configuration.

Finally, build, start, or restart your service (you can use docker compose up --build for this) and watch your database metrics flow into Splunk Observability Cloud.

Step 3: Visualizing PostgreSQL data in Splunk Observability Cloud

After metrics are collected, you can view them in Splunk Observability Cloud:

Analyze database metrics

Navigate to the Datastores section in Splunk Infrastructure Monitoring.
Select PostgreSQL databases for database-level metrics, or PostgreSQL hosts for metrics related to the infrastructure hosting your PostgreSQL databases.
Open the PostgreSQL databases navigator to view metrics related to all of your databases.
Review key metrics such as total operations, index scans per second, and rollbacks. If total operations are high, you’ll know at a glance if your database resources can handle the current workload intensity. If your index scans per second drop, this can suggest you're not using indexes efficiently. Databases with a high number of rollbacks could be experiencing an increase in transaction failures or deadlocks. All of these issues can lead to slow or unreliable performance for your users.
Click into your database to see database-specific metrics. You can monitor index size for efficient resource optimization and right-size indexes. Dead row monitoring helps ensure efficient vacuuming to decrease table bloat and increase performance. If you're seeing metrics like 18 total operations per second, but 0 index scans per second, that might mean you aren’t indexing and could have some query performance inefficiencies.
To ensure your system can handle current workloads and maintain consistent performance, click into the PostgreSQL hosts navigator to view metrics like changes in operations per second, transactions, and disk usage.
You can also click into a specific host to view individual host metrics like how many transactions succeeded, failed, or were rolled back, as well as the cache hits versus disk hits, all of which impact overall performance.

Step 4: Investigate query performance in APM

Your database monitoring journey will most likely start at the service level or with the applications they back, so let’s explore query performance and how to view its impacts on overall application performance. To analyze query performance:

From the PostgreSQL host navigator, select a host.
View logs or related content in Splunk Application Performance Monitoring to view services that have a dependency on the currently selected host.
Navigate to Database Query Performance to view and analyze query time, latency, and errors to see which specific areas are impacting response time and user experience, and see where you might be able to optimize your query performance.
Navigate to the Service Map to investigate specific errors, traces, or related logs.

In this example, we moved from Splunk Infrastructure Monitoring to Splunk Application Performance Monitoring, but you could have just as easily started with your Service Map and began troubleshooting database performance issues from there using Database Query Performance, database requests/errors, or traces.

Next steps

Monitoring key metrics from the databases that power your applications is critical to the performance and reliability that your users count on. Configuring the OpenTelemetry Collector to receive PostgreSQL telemetry data and export this data to a backend observability platform is an easy process that provides invaluable visibility into the databases that back your services.

These resources might help you understand and implement this guidance: