Monitoring Metrics

You can check the status and available resources of a cluster and learn about its real-time resource consumption through the GaussDB(DWS) monitoring items.

Table 1 describes GaussDB(DWS) monitoring metrics.

Table 1 GaussDB(DWS) monitoring metrics

Monitored Object

Metric

Value Range

Monitoring Interval (Raw Data)

Metric Usage

Recommended Threshold

Abnormal Job Monitoring Statistics

Query ID

String

30s

Collect data regarding abnormal jobs whose status is aborted in the PGXC_WLM_SESSION_HISTORY view.

N/A

Statement executed for exception handling

String

Block time before the statement is executed

>= 0

Elapsed time when the statement is executed

>= 0

Total time used by the CPU on the DN when the statement is executed for exception handling

>= 0

CPU usage skew on the DN when the statement is executed for exception handling

>= 0

cgroups used for exception handling during statement execution

String

Status of a statement after exception handling

String

Exception handling action executed by the statement

String

Reason why the statement is processed abnormally

String

Node Status Statistics

Host name

String

60s

Gather information on the current state and specifics of each ECS host instance (VM) in a cluster.

N/A

Host status

String

Instances by Status

Host name

String

60s

Gather information on the current state and specifics of each CN/DN instance in the cluster.

N/A

Instance type

String

Instance role

String

Instance status

String

Cause of the instance status.

String

Cluster Status

Cluster status

String

30s

Monitor the cluster status.

N/A

Whether an primary/standby switchover has occurred

String

Whether redistribution has occurred

String

Whether the current cluster is read-only

String

CPU usage

The default value is ALL.

String

30s

Gather CPU usage data from ECS instances (VMs) to monitor node CPU usage. High CPU usage can lead to SQL queuing and slow queries.

85%

User-mode CPU time %

>= 0.0

CPU time the process with a negative nice value (%)

>= 0.0

Kernel mode time (%)

>= 0.0

I/O wait time (%)

>= 0.0

Hard interruption time (%)

>= 0.0

Software interrupt time (%)

>= 0.0

Time spent by the virtual CPU in involuntary waiting when the virtual machine manager serves another virtual processor (%)

>= 0.0

Time spent running the virtual processor (%)

>= 0.0

Idle time except for disk waiting operations (%)

>= 0.0

Hyper-threading capability

Yes/No

Hyper-threading Enabled (Yes/No)

Yes/No

Number of processes in the runnable status

>= 0

Number of processes waiting for I/O completion

>= 0

Active Sessions

Database name

String

30s

Collect information on the current cluster's active sessions.

N/A

Instance name

String

Number of all user sessions

>= 0

Distinct username

>= 0

Distinct application name

>= 0

Number of queries in the active or fastpathfunctioncall state.

>= 0

Disk Capacity Statistics

Instance name

String

86400s

Collect statistics on the disk space used by each database in the current cluster.

N/A

Database name

String

Database size

>= 0

Transaction Status

Database name

String

60s

Gather data on the operational status of databases in the current cluster, such as the number of updated, deleted, and inserted rows, transactions, and deadlocks.

N/A

Instance name

String

Number of rows returned through a global database scan

>= 0

Number of rows returned by querying indexes in the database

>= 0

Number of rows inserted by queries in this database

>= 0

Number of rows updated by queries in this database

>= 0

Number of rows deleted by queries in this database

>= 0

Number of transactions in this database that have been committed

>= 0

Number of transactions in this database that have been rolled back

>= 0

Number of deadlocks detected in this database

>= 0

Number of disk blocks read in this database

>= 0

Number of disk blocks found in the buffer cache in the current database, that is, the number of blocks hit in the cache. (This only includes hits in the GaussDB(DWS) buffer cache, not in the file system cache.)

>= 0

Time spent reading data file blocks by backends in this database, in milliseconds

>= 0.0

Time spent reading data file blocks by backends in this database, in milliseconds

>= 0.0

Number of queries canceled due to database recovery conflicts (conflicts that occur only on the standby server).

>= 0

Number of temporary files created by queries in this database. This parameter calculates all temporary files (such as sorting or hashing) and ignores the log_temp_files setting.

>= 0

Total amount of data written to temporary files by queries in this database. This parameter calculates all temporary files and ignores the log_temp_files setting.

>= 0

Database capacity, in bytes.

>= 0

Number of rows returned by global database scanning in a unit time

>= 0

Number of rows returned by querying indexes in the database in a unit time

>= 0

Number of rows inserted through database query in a unit time

>= 0

Number of rows updated through database query in unit time

>= 0

Number of rows deleted by querying the database in a unit time

>= 0

Number of transactions that have been submitted in the database per unit time

>= 0

Number of transactions that have been rolled back in the database per unit time

>= 0

Number of deadlocks retrieved in the database per unit time

>= 0

File Handle

Name of the disk file system

String

30s

Gather data on the disk inode information of the cluster to monitor inode usage. High inode usage can pose risks.

90%

Total inode capacity (unit: KB)

>= 0

Used capacity (unit: KB)

>= 0

Node Disk Usage

Name of the disk file system

>= 0

30s

Track the disk usage of every ECS instance and switch the cluster to read-only mode when it hits 90%.

90%

Total space (unit: KB)

>= 0

Used capacity (unit: KB)

>= 0

Available capacity (unit: KB)

>= 0

Disk usage

>= 0

gsar NIC Usage Statistics

Node name

String

30s

Monitor the running status of the gsar NIC.

N/A

NIC name

String

NIC IP address

String

Data received by the NIC (unit: KB)

>= 0

Number of packets received by the NIC

>= 0

Average length of received packets (unit: byte)

>= 0

Number of received data packets that are dropped by the NIC

>= 0

Port transmit discard rate

>= 0.0

Data sent by the network adapter (unit: KB)

>= 0

Number of packets sent by the NIC

>= 0

Average length of received packets (unit: byte)

>= 0

gsar TCP Statistics

Number of retransmitted TCP packets due to timeout

>= 0

30s

Monitor the TCP retransmission rate.

Number of retransmitted TCP packets > 0

Number of sent TCP packets

>= 0

Number of retransmitted TCP packets

>= 0

TCP retransmission rate

>= 0.0

Node Disk I/O Statistics

Disk name (devicename)

String

30s

Keep track of the I/O status of every disk on a node, which is indicated by the data transfer rates and the number of read and write operations. Excessive data transfer rates could affect cluster services.

350 MB/s

Number of transmissions per second (transferpersecond). The size of each transmission is unknown.

>= 0.0

Amount of data read from the device per second (unit: KB)

>= 0.0

Amount of data written to the device per second (unit: KB)

>= 0.0

Total amount of read data (unit: KB)

>= 0.0

Total amount of written data (unit: KB)

>= 0.0

Number of times that read requests to the device are combined per second

>= 0.0

Number of times that write requests to the device are combined per second

>= 0.0

Number of completed reads per second

>= 0.0

Number of completed writes per second

>= 0.0

Amount of data read per second (unit: KB)

>= 0.0

Amount of data written per second (unit: KB)

>= 0.0

Average data volume of each I/O operation (unit: sector)

>= 0.0

Average request queue length

>= 0.0

Average waiting time for each I/O request (unit: ms)

>= 0.0

Average processing time for each I/O request (unit: ms)

>= 0.0

Percentage of the time when the I/O queue is not empty (I/O operation time divided by the total time)

>= 0.0

Instance Memory Monitoring Statistics

Instance name

String

60s

Monitor instance and dynamic memory, gather memory usage statistics for each CN and DN. If the instance memory usage goes beyond the threshold, there may not be enough instance memory in the cluster.

85%

Memory size occupied by the instance

>= 0.0

Memory size used by a process

>= 0.0

Maximum dynamic memory

>= 0.0

Used dynamic memory

>= 0.0

Dynamic peak memory

>= 0.0

Maximum dynamic shared memory context

>= 0.0

Dynamic peak value of the shared memory context

>= 0.0

Maximum shared memory

>= 0.0

Used shared memory

>= 0.0

Maximum memory allowed by column store

>= 0.0

Memory used in column store

>= 0.0

Maximum memory that can be used by the communication library

>= 0.0

Used memory size of the communication library

>= 0.0

Peak memory usage of the communication library

>= 0.0

Maximum memory that can be used by top SQLs to record historical job monitoring information

>= 0.0

Peak memory usage of the top SQL that records historical job monitoring information

>= 0.0

Memory used by top SQLs to record historical job monitoring information

>= 0.0

Other used memory

>= 0.0

Memory size occupied by pooler connections

>= 0.0

Memory size occupied by pooler idle connections

>= 0.0

Memory size used by column-store compression and decompression

>= 0.0

Memory reserved for the UDFWorker process

>= 0.0

Memory size used by the MMAP

>= 0.0

Instance Resource Statistics

Instance name

String

60s

Gather resource usage statistics for each instance in a cluster.

85%

Read the value (CPU usage %) in postmaster.pID/cm_server.pID/gtm.pID/etcd.pID.

>= 0.0

Read the value (memory usage %) in postmaster.pID/cm_server.pID/gtm.pID/etcd.pID.

>= 0.0

Instance Disk Size Statistics

Instance name

String

86400s

Collect statistics on instance disk usage and monitor it.

85%

Storage location

String

Disk space used by all databases on the current instance

>= 0

Node Memory Statistics

Size of all available RAMs, that is, the remaining physical memory minus the reserved bits and kernel usage (unit: KB)

>= 0

30s

Gather memory usage statistics for the ECS instance where the cluster is located. This metric tracks statistics on the VM OS-level memory, which differs from the instance memory.

70%

Unused memory in the system. The value is lowfree+highfree (unit: KB).

>= 0

Size of the cache used for the block device (unit: KB)

>= 0

Size of the file buffer (unit: KB)

>= 0

Total swap space (unit: KB)

>= 0

Size of the RAM memory temporarily stored in the swap file (unit: KB)

>= 0

Memory size of the virus-infected page (unit: KB)

>= 0

Network Status Statistics

NIC name

String

30s

Collect NIC status for each node in the cluster to monitor lost packets on the cluster NIC and network throughput.

N/A

NIC status (up/down)

up/down

NIC speed (1,000 Mbit/s or 100 Mbit/s)

>= 0

Total data received by the NIC (unit: byte)

>= 0

Number of packets received by the NIC

>= 0

Total number of NIC receiving errors

>= 0

Number of received data packets that are dropped by the NIC

>= 0

Number of FIFO buffer errors during reception

>= 0

Number of received packet frame errors

>= 0

Number of received compressed data packets

>= 0

Number of received multicast frames

>= 0

Total data sent by the NIC (unit: byte)

>= 0

Total number of packets sent by the NIC

>= 0

Total number of NIC sending errors

>= 0

Total number of data packets discarded by the NIC during transmission

>= 0

Number of FIFO buffer errors during sending

>= 0

Number of collisions detected on the sending interface

>= 0

Number of carrier losses detected by the device driver during transmission

>= 0

Number of sent compressed data packets

>= 0

Specifies whether NIC multi-queue is supported.

Yes/No

NIC multi-queue is enabled.

Yes/No

Specifies the CPU affinity of a multi-queue NIC.

String

Indicates whether the NIC works in duplex mode.

String

Network speed

>= 0.0

Node SQL Statistics

Node name

String

60s

Use the PGXC_SQL_COUNT view to query the number of running SQL statements on each node and monitor the cluster's QPS.

N/A

Username

String

Number of SELECT statements

>= 0

Number of UPDATE statements

>= 0

Number of INSERT statements

>= 0

Number of DELETE statements

>= 0

Number of MERGEINTO statements

>= 0

Number of DDL statements

>= 0

Number of DML statements

>= 0

Number of DCL statements

>= 0

Total response time of SELECT statements

>= 0

Average response time of SELECT statements

>= 0

Maximum response time of SELECT statements

>= 0

Minimum response time of SELECT statements

>= 0

Total response time of UPDATE statements

>= 0

Average response time of UPDATE statements

>= 0

Maximum response time of UPDATE statements

>= 0

Minimum response time of UPDATE statements

>= 0

Total response time of DELETE statements

>= 0

Average response time of DELETE statements

>= 0

Maximum response time of DELETE statements

>= 0

Minimum response time of DELETE statements

>= 0

Total response time of INSERT statements

>= 0

Average response time of INSERT statements

>= 0

Maximum response time of INSERT statements

>= 0

Minimum response time of INSERT statements

>= 0

Difference between the numbers of SELECT statements at two time points

>= 0

Difference between the numbers of UPDATE statements at two time points

>= 0

Difference between the numbers of INSERT statements at two time points

>= 0

Difference between the numbers of DELETE statements at two time points

>= 0

Difference between the numbers of MERGE INTO statements at two time points

>= 0

Difference between the numbers of DDL statements at two time points

>= 0

Difference between the numbers of DML statements at two time points

>= 0

Difference between the numbers of DCL statements at two time points

>= 0

Difference between the total SELECT response time at two time points

>= 0

Difference between the average SELECT response time at two time points

>= 0

Difference between the maximum SELECT response time at two time points

>= 0

Difference between the minimum SELECT response time at two time points

>= 0

Difference between the total UPDATE response time at two time points

>= 0

Difference between the average UPDATE response time at two time points

>= 0

Difference between the maximum UPDATE response time at two time points

>= 0

Difference between the minimum UPDATE response time at two time points

>= 0

Difference between the total DELETE response time at two time points

>= 0

Difference between the average DELETE response time at two time points

>= 0

Difference between the maximum DELETE response time difference at two time points

>= 0

Difference between the minimum DELETE response time difference at two time points

>= 0

Difference between the total INSERT response time at two time points

>= 0

Difference between the average INSERT response time at two time points

>= 0

Difference between the maximum INSERT response time at two time points

>= 0

Difference between the minimum INSERT response time at two time points

>= 0

System Status Statistics

TCP protocol stack retransmission rate (%)

>= 0.0

30s

Collect the TCP protocol, stack protocol, and stack retransmission rate of the ECS instance server.

>0

Top SQL Statistics

Database name

String

60s

Collect SQL statements running on each CN in the current cluster to identify deadlock SQL statements, slow SQL statements, or SQL statements with high resource usage. SQL statements can be scanned and killed on the management plane.

N/A

Instance name

String

Thread ID (session ID and session ID)

String

Internal query_ID used for statement execution

String

Job type, which can be set using the guc parameter query_band. The default value is a null string.

String

The value is obtained from the query_band field. The position is 0.

String

The value is obtained from the query_band field. The position is 1.

String

Username used for connecting to the backend

String

Name of the application that is connected to the backend

String

IP address of the client connected to the backend. A null value suggests a Unix socket connection or an internal server process, such as autovacuum.

String

Host name of the connected client, as reported by a reverse DNS lookup of client_addr. This column will only be non-null for IP connections, and only when log_hostname is enabled.

String

TCP port number used by the client to communicate with the backend. If the Unix socket is used, the value is -1.

String

Whether the backend is currently waiting on a lock. If yes, the value is true.

Yes/No

Time when the statement starts to be executed

>= 0

Block time before the statement is executed. The unit is ms.

>= 0

Duration that a statement has been executed. The unit is ms.

>= 0

Estimated execution time of a statement. The unit is ms.

>= 0

Estimated remaining time of statement execution. The unit is ms.

>= 0

Resource status in workload management

String

Resource pool used by the user

String

Priority of a job in the resource pool. The options are as follows:

  • 1: Low

  • 2: Medium

  • 4: High

  • 8: Rush

>= 0

Cgroup used by the statement

String

Minimum memory peak of a statement across all DNs. The unit is MB.

>= 0

Maximum memory peak of a statement across all DNs. The unit is MB.

>= 0

Average memory usage during statement execution. The unit is MB.

>= 0

Memory usage skew of a statement among DNs.

>= 0

Estimated memory used by the statement. The unit is MB.

>= 0

Statement spill information on all DNs

String

Minimum spilled data among all DNs when a spill occurs. The default value is 0 (unit: MB).

>= 0

Maximum spilled data among all DNs when a spill occurs. The default value is 0 (unit: MB).

>= 0

Average spilled data among all DNs when a spill occurs. The default value is 0 (unit: MB).

>= 0

DN spill skew when a spill occurs

>= 0

Minimum execution time of a statement across all DNs. The unit is ms.

>= 0

Maximum execution time of a statement across all DNs. The unit is ms.

>= 0

Average execution time of a statement across all DNs. The unit is ms.

>= 0

Execution time skew of a statement among DNs.

>= 0

Minimum CPU time of a statement across all DNs. The unit is ms.

>= 0

Maximum CPU time of a statement across all DNs. The unit is ms.

>= 0

Total CPU time of a statement across all DNs. The unit is ms.

>= 0

CPU time skew of a statement among DNs.

>= 0

Warning. The following warnings and warnings related to SQL self-diagnosis tuning are displayed:

String

Average IOPS peak of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

I/O skew of a statement among DNs.

>= 0

Maximum IOPS of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

Minimum IOPS of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

Query statement

String

Query plan

String

Real-time running status of the current query statement. The value can be active, idle, idleintransaction, idleintransaction (aborted), fastpathfunctioncall, or disabled.

String

Running status of the query statement in the resource pool. The value can be pending, running, finished, aborted, active or unknown.

String

Statement attribute (ordinary, simple, complicated, or internal).

String

Fast and slow lanes (fast or slow)

String

Whether a query is a system query

Yes/No

Whether the query is a system query (adapting to monitor)

Yes/No

Time when this process was started, that is, when the client connected to the server

>= 0

Execution time so far

>= 0

Time when the current transaction was started (NULL if no transactions are active) If the current query is the first of its transaction, the value of this column is the same as that of the query_start column.

>= 0

Time when the status was changed in the previous time

>= 0

Time when the statement starts to be executed

>= 0

Actual execution duration of the statement, in seconds.

>= 0

Historical Queries

Database name

String

180s

Collect historical data from the top SQL view to analyze SQL statements from the past, identify deadlock or slow SQL statements, and optimize cluster performance by rectifying any issues found.

N/A

Instance name

String

Username

String

Name of the application that is connected to the backend

String

IP address of the client connected to the backend. A null value suggests a Unix socket connection or an internal server process, such as autovacuum.

String

Host name of the connected client, as reported by a reverse DNS lookup of client_addr. This column will only be non-null for IP connections, and only when log_hostname is enabled.

String

TCP port number used by the client to communicate with the backend. If the Unix socket is used, the value is -1.

String

Job type, which can be set using the guc parameter query_band. The default value is a null string.

String

The value is obtained from the query_band field. The position is 0.

String

The value is obtained from the query_band field. The position is 1.

String

Duration that a statement is blocked before being executed, including the statement parsing and optimization duration. The unit is ms.

>= 0

Execution start time of a statement (unit: ms)

>= 0

Execution end time of a statement (unit: ms)

>= 0

Duration that a statement has been executed (unit: ms)

>= 0

Estimated statement execution time (unit: ms)

>= 0

Statement execution end status:

  • Finished (normal)

  • Aborted (Abnormal)

String

Exception information displayed if the final statement execution status is aborted.

String

Resource pool used by the user

String

Priority of a job in the resource pool. The options are as follows:

  • 8: Rush

  • 4: High

  • 2: Medium

  • 1: Low

>= 0

Cgroup used by the statement

String

Minimum memory peak of a statement across all DNs. The unit is MB.

>= 0

Maximum memory peak of a statement across all DNs. The unit is MB.

>= 0

Average memory usage during statement execution. The unit is MB.

>= 0

Memory usage skew of a statement among DNs

>= 0

Statement spill information on all DNs

String

Minimum spilled data among all DNs when a spill occurs. The unit is MB. The default value is 0.

>= 0

Maximum spilled data among all DNs when a spill occurs. The unit is MB. The default value is 0.

>= 0

Average spilled data among all DNs when a spill occurs. The unit is MB. The default value is 0.

>= 0

DN spill skew when a spill occurs

>= 0

Minimum execution time of a statement across all DNs. The unit is ms.

>= 0

Maximum execution time of a statement across all DNs. The unit is ms.

>= 0

Average execution time of a statement across all DNs. The unit is ms.

>= 0

Execution time skew of a statement among DNs.

>= 0

Minimum CPU time of a statement across all DNs. The unit is ms.

>= 0

Maximum CPU time of a statement across all DNs. The unit is ms.

>= 0

Total CPU time of a statement across all DNs. The unit is ms.

>= 0

CPU time skew of a statement among DNs.

>= 0

Minimum IOPS of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

Maximum IOPS of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

Average IOPS peak of a statement across all DNs. It is counted by ones in a column-store table and by ten thousands in a row-store table.

>= 0

I/O skew of a statement among DNs

>= 0

Warning. The following warnings and warnings related to SQL self-diagnosis tuning are displayed:

String

Query ID

String

Statement executed

String

Execution plan of a statement

String

Logical cluster of the user running the statement

String

Schema Usage Statistics

Database name

String

3600s

Monitor schema usage in a cluster by collecting usage data for each schema.

85%

Schema name

String

Used capacity (unit: byte)

>= 0

Total capacity (unit: byte)

>= 0

Session Statistics

Database name

String

180s

Gather session information for each CN in a cluster, including statistics on idle sessions and lock holdings. Use the management console to clear any idle sessions.

N/A

Instance name

String

Thread ID (It can be used as a session ID or a connection ID.)

String

Database username

String

User application name

String

Client address

String

Host name of the client

String

TCP port number used by the client to communicate with the background. If the Unix socket is used, the value is -1.

String

Time when this process was started, that is, when the client connected to the server

>= 0

Execution time so far.

>= 0

Time when the current transaction was started (NULL if no transactions are active). If the current query is the first of its transaction, the value of this column is the same as that of the query_start column.

>= 0

Time when the status was changed in the previous time

>= 0

Whether the backend is currently waiting on a lock. If yes, the value is true.

Yes/No

Current overall state of this backend.

String

Resource pool used by the user

String

Actual execution duration of the statement, in seconds.

>= 0

ID of a query

String

Time when the statement starts to be executed

>= 0

cgroups currently used by the statement.

String

Lock type

String

Lock mode

String

Indicates whether to hold a lock when lock waiting exists. The value is true.

Yes/No

Resource that is waiting for the lock

String

Statement type

String

Query SQL.

String

Indicates whether the query is performed by the system.

Yes/No

Query plan

String

SQL Probe Statistics

Query ID of a probe task

String

30s

Monitor cluster performance by collecting statistics on the duration of SQL probe execution, to detect sudden deterioration.

N/A

Cluster ID

String

Cluster project ID

String

Task type of the probe SQL

String

Time when a probe SQL task is created

>= 0

SQL execution time

>= 0

probe_ID of the associated probe SQL

String

Probe

String

Task status of the SQL probe. The options are as follows:

  • Running

  • Success

  • Fail

String

Table Dirty Page Rate Statistics

Database name

String

7200s

Gather information on dirty pages for cluster tables, as a high dirty page rate can cause a decline in table query and insertion performance.

50%

Schema name

String

Table name (full name)

String

Table owner

String

Table size (unit: byte)

>= 0

Dirty page rate

>= 0.0

Table Skew Monitoring Statistics

Database name

String

7200s

Identify tables in the cluster with a skew rate exceeding 5%, as this can negatively affect query performance.

10%

Schema name

String

Table name (full name)

String

Table owner

String

Table size (unit: byte)

>= 0

Skew rate

>= 0.0

Resource Pool Statistics

Load resource pool

String

120s

Monitor system resource usage and queuing status of the cluster resource pool by collecting relevant information.

N/A

CPU quota of the resource pool

>= 0

Memory quota of the resource pool

>= 0

Disk quota of the resource pool

>= 0

Maximum number of concurrent simple jobs allowed by the resource pool

>= 0

Maximum number of concurrent queries allowed by the resource pool

>= 0

CPU usage of the resource pool

>= 0.0

Memory usage of the resource pool

>= 0.0

Disk usage of the resource pool

>= 0.0

Number of concurrent simple jobs in the resource pool

>= 0

Current number of concurrent requests in the resource pool

>= 0

Resource Pool User Statistics

Load resource pool

String

30s

Gather data on users in the cluster resource pool and track their resource usage.

N/A

CPU quota of a user.

>= 0

Memory quota of a user

>= 0

Specifies the disk quota of a user

>= 0

CPU usage of a user

>= 0.0

Memory usage of a user

>= 0.0

Disk usage of a user

>= 0.0

User ID

String

Username

String