PostgreSQL integration with KloudMate
The PostgreSQL integration with KloudMate enables you to monitor the health and performance of your PostgreSQL databases in near real-time. With the integration of OpenTelemetry, you can collect vital metrics, track slow queries, and set up automated alerts to keep your team informed about potential performance issues. This document outlines how to set up the integration, monitor key metrics, and leverage additional features like slow query logging for comprehensive database monitoring.
- Real-Time Monitoring: Gain insights into your PostgreSQL database's health and performance with near real-time metrics.
- OpenTelemetry Integration: Leverage OpenTelemetry to collect and visualize metrics, traces, and logs from PostgreSQL.
- Slow Query Logging: Identify slow queries affecting your database performance.
- Dashboards: Visualize PostgreSQL performance metrics and set thresholds for key indicators like query response time, CPU usage, disk I/O, and more.
- Alerts: Create alerts to notify your team about performance issues such as high CPU usage, long-running queries, and storage bottlenecks.
Pre-requisites:
- PostgreSQL installed and running
- Install the OpenTelemetry collector on the specific server that requires monitoring. See Installing and Configuring OpenTelemetry Collector.
- PostgreSQL username and password for authentication.
Step 1: Configure the Receivers to scrape metrics as well as Logs.
To start monitoring PostgreSQL with Otel Collector, you need to configure PostgreSQL receiver
- Linux Users: Open the file located at /etc/otelcol-contrib/config.yaml using your preferred text editor.
- Windows Users: Create a new file called config.yaml in the C:\Program Files\OpenTelemetry Collector folder. You can use Notepad or any text editor to do this.
Add the suitable extensions :
In this step, two receivers are integrated to retrieve telemetry data from PostgreSQL.
- PostgreSQL Receiver: To retrieve metrics(health and performance)
- File Log Receiver: To retrieve logs(general logs, error logs, and slow query logs)
To get the metrics use only the postgresql receiver
Step 2: Set up the processor component to identify resource information from the host and either append or replace the resource values in the telemetry data with this information.
Please choose one of the following configuration options based on your provider (AWS EC2, Azure VM, or on-premises server).
- Server(Can be on-premise, non-cloud, or cloud)
- AWS EC2:
(Optional) To retrieve AWS EC2 instance tags along with logs and metrics, users need to associate an IAM role with the EC2 instance that includes the EC2:DescribeTags policy.
- Azure Virtual Machines:
Step 3: Configure the exporter, and extension and save the configuration
Set up the KloudMate Backend on the exporter part of the Open Telemetry configuration file and configure the pipeline.
Step 4: To restart and verify the status of the OpenTelemetry (Otel) Collector, follow these steps:
For Linux:
- Execute the following commands:
These commands will restart the Otel Collector and display its current status.
For Windows:
- Open the Services window:
- Press Win + R, type services.msc, and press OK.
- Alternatively, search for "Services" in the Windows Start menu.
- In the Services window, locate the "OpenTelemetry Collector" service.
- Right-click the service and select "Restart."
Subsequently, monitor the metrics on the KloudMate dashboard and set up an alarm to receive notifications if the potential metrics for a specific application rise.
Name | Description | Type | Unit |
---|---|---|---|
postgresql.backends | The number of backends | SUM | NONE |
postgresql.connection.max | Configured the maximum number of client connections allowed | GAUGE | NONE |
postgresql.database.count | Number of user databases | SUM | NONE |
postgresql.db_size | The database disk usage | SUM | Bytes |
postgresql.index.scans | The number of index scans on a table | SUM | NONE |
postgresql.index.size | The size of the index on the disk | GAUGE | Bytes |
postgresql.operations | The number of db row operations | SUM | NONE |
postgresql.replication.data_delay | The amount of data delayed in replication | GAUGE | Bytes |
postgresql.rollbacks | The number of rollbacks | SUM | NONE |
postgresql.rows | The number of rows in the database | SUM | NONE |
postgresql.table.count | Number of user tables in a database | SUM | NONE |
postgresql.table.size | Disk space used by a table | SUM | Bytes |
postgresql.table.vacuum.count | Number of times a table has manually been vacuumed | SUM | NONE |
postgresql.wal.age | Age of the oldest WAL file. This metric requires WAL to be enabled with at least one replica. | GAUGE | seconds |
postgresql.wal.lag | The time between flushing the recent WAL locally and receiving notification that the standby server has completed an operation on it. | GAUGE | seconds
|
postgresql.bgwriter.buffers.allocated | Number of buffers allocated | SUM | NONE |
postgresql.bgwriter.buffers.writes | Number of buffers written | SUM | NONE |
postgresql.bgwriter.checkpoint.count | The number of checkpoints performed | SUM | NONE |
postgresql.bgwriter.duration | Total time spent writing and syncing files to disk by checkpoints | SUM | Milliseconds |
postgresql.bgwriter.maxwritten | Number of times the background writer stopped a cleaning scan because it had written too many buffers | SUM | NONE |
postgresql.blocks_read | The number of blocks read | SUM | NONE |
postgresql.commits | The number of commits | SUM | NONE |
postgresql.deadlocks | The number of deadlocks. | SUM | NONE |
postgresql.database.locks | The number of database locks. | GAUGE | COUNT |