Supported Events

Table 1 GaussDB

Source

Namespace

Name

ID

Severity

Description

Handling Suggestion

Impact

GaussDB

SYS.GAUSSDBV5

Process status alarm

ProcessStatusAlarm

Major

Key GaussDB processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.

Component status alarm

ComponentStatusAlarm

Major

Key GaussDB components do not respond, including CMA, ETCD, GTM, CN, and DN components.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.

Cluster status alarm

ClusterStatusAlarm

Major

The cluster is abnormal, including the following faults:

The cluster is read-only. The majority of ETCD members are faulty. The cluster resources are unevenly distributed.

Contact SRE engineers.

If the cluster status is read-only, only read requests are processed.

If the majority of ETCD members are faulty, the cluster is unavailable.

If resources are unevenly distributed, the instance performance and reliability deteriorate.

Hardware resource alarm

HardwareResourceAlarm

Major

A major hardware fault occurs in the instance, such as disk damage or GTM network fault.

Contact SRE engineers.

Some or all services are affected.

Status transition alarm

StateTransitionAlarm

Major

The following events occur in the instance: DN build attempt, DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover.

Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.

Some services are interrupted.

Other abnormal alarm

OtherAbnormalAlarm

Major

Disk usage threshold alarm

Monitor workload changes and scale up storage as needed.

If the used space exceeds the threshold, storage cannot be scaled up.

Instance running status abnormal

TaurusInstanceRunningStatusAbnormal

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

Instance running status recovered

TaurusInstanceRunningStatusRecovered

Major

GaussDB provides an HA tool to automatically or manually rectify the catastrophic fault. After the fault is rectified, this event is reported.

No further action is required.

None

Node status abnormal

TaurusNodeRunningStatusAbnormal

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

TaurusNodeRunningStatusRecovered

Major

GaussDB provides an HA tool to automatically or manually rectify the catastrophic fault. After the fault is rectified, this event is reported.

No further action is required.

None

Instance creation failure

GaussDBV5CreateInstanceFailed

Major

Instances fail to be created because the quota is insufficient or underlying resources are exhausted.

Release the instances that are no longer used and try to provision new instances again, or submit a service ticket to adjust the quota.

Instances fail to be created.

Node adding failure

GaussDBV5ExpandClusterFailed

Major

The underlying resources are insufficient.

Submit a service ticket to ask O&M personnel to coordinate resources, delete the node that failed to be added and add a new one.

None

Storage scale-up failure

GaussDBV5EnlargeVolumeFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.

Services may be interrupted.

Reboot failure

GaussDBV5RestartInstanceFailed

Major

The network is abnormal.

Retry the reboot operation or submit a service ticket to the O&M personnel.

The database service may be unavailable.

Full backup failure

GaussDBV5FullBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to O&M personnel.

Data cannot be backed up.

Differential backup failure

GaussDBV5DifferentialBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to O&M personnel.

Data cannot be backed up.

Backup deletion failure

GaussDBV5DeleteBackupFailed

Major

Backup files fail to be cleared.

Submit a service ticket to O&M personnel.

There may be residual OBS files.

EIP binding failure

GaussDBV5BindEIPFailed

Major

The EIP has been used or EIP resources are insufficient.

Submit a service ticket to O&M personnel.

The instance cannot be accessed from the Internet.

EIP unbinding failure

GaussDBV5UnbindEIPFailed

Major

The network or the EIP service is faulty.

Unbind the IP address again or submit a service ticket to the O&M personnel.

Residual IP resources may be generated.

Parameter template application failure

GaussDBV5ApplyParamFailed

Major

Changing a parameter group times out.

Change the parameter group again.

None

Parameter modification failure

GaussDBV5UpdateInstanceParamGroupFailed

Major

Changing a parameter group times out.

Change the parameter group again.

None

Backup and restoration failure

GaussDBV5RestoreFromBcakupFailed

Major

The underlying resources are insufficient or backup files fail to be downloaded.

Submit a service ticket.

The database service may be unavailable during the restoration failure.

Hot patch installation failure

GaussDBV5UpgradeHotfixFailed

Major

Generally, this fault is caused by an error reported during kernel upgrade.

View the error information about the workflow and redo or skip the job.

None