Events Supported by Event Monitoring

Note

Events in Event Monitoring come from operations on cloud service resources and are not collected by the Agent in Server Monitoring.

Table 1 Elastic Cloud Server (ECS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ECS

Delete ECS

deleteServer

Major

The ECS was deleted

  • on the management console.

  • by calling APIs.

Check whether the deletion was performed intentionally by a user.

Services are interrupted.

Reboot ECS

rebootServer

Minor

The ECS was restarted

  • on the management console.

  • by calling APIs.

Check whether the restart was performed intentionally by a user.

  • Deploy service applications in HA mode.

  • After the ECS starts up, check whether services recover.

Services are interrupted.

Resize ECS

resizeServer

Minor

The ECS was resized

  • on the management console.

  • by calling APIs.

  • Check whether the operation was performed by a user.

  • Deploy service applications in HA mode.

  • After the ECS is resized, check whether services have recovered.

Services are interrupted.

Restart triggered due to hardware fault

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Restart completed due to hardware failure

endAutoRecovery

Major

The ECS was restored to be normal after the automatic migration.

This event indicates that the ECS has recovered and been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to other ECSs.

Services are interrupted.

Improper ECS running

vmIsRunningImproperly

Major

The ECS was faulty or the ECS NIC was abnormal.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Improper ECS running recovered

vmIsRunningImproperlyRecovery

Major

The ECS was restored to the normal status.

Wait for the ECS status to become normal and check whether services are affected.

None

VM faults caused by host process exceptions

VMFaultsByHostProcessExceptions

Critical

The processes of the host accommodating the ECS were abnormal.

Contact O&M personnel.

The ECS is faulty.

Restarted GuestOS

RestartGuestOS

Minor

The guest OS was restarted.

Contact O&M personnel.

Services may be interrupted.

Note

Once a physical host running ECSs breaks down, the ECSs are automatically migrated to a functional physical host. During the migration, the ECSs will be restarted.

Table 2 Advanced Anti-DDoS (AAD)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

AAD

DDoS Attack Events

ddosAttackEvents

Major

A DDoS attack occurs in the AAD protected lines.

Judge the impact on services based on the attack traffic and attack type. If the attack traffic exceeds your purchased elastic bandwidth, change to another line or increase your bandwidth.

Services may be interrupted.

Domain name scheduling event

domainNameDispatchEvents

Major

The high-defense CNAME corresponding to the domain name is scheduled, and the domain name is resolved to another high-defense IP address.

Pay attention to the workloads involving the domain name.

Services are not affected.

Blackhole event

blackHoleEvents

Major

The attack traffic exceeds the purchased AAD protection threshold.

A blackhole is canceled after 30 minutes by default. The actual blackhole duration is related to the blackhole triggering times and peak attack traffic on the current day. The maximum duration is 24 hours. If you need to permit access before a blackhole becomes ineffective, contact technical support.

Services may be interrupted.

Cancel Blackhole

cancelBlackHole

Informational

The customer's AAD instance recovers from the black hole state.

This is only a prompt and no action is required.

Customer services recover.

Table 3 Cloud Backup and Recovery (CBR)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

CBR

Failed to create the backup.

backupFailed

Critical

The backup failed to be created.

Manually create a backup or contact customer service.

Data loss may occur.

Failed to restore the resource using a backup.

restorationFailed

Critical

The resource failed to be restored using a backup.

Restore the resource using another backup or contact customer service.

Data loss may occur.

Failed to delete the backup.

backupDeleteFailed

Critical

The backup failed to be deleted.

Try again later or contact customer service.

Charging may be abnormal.

Failed to delete the vault.

vaultDeleteFailed

Critical

The vault failed to be deleted.

Try again later or contact technical support.

Charging may be abnormal.

Replication failure

replicationFailed

Critical

The backup failed to be replicated.

Try again later or contact technical support.

Data loss may occur.

The backup is created successfully.

backupSucceeded

Major

The backup was created.

None

None

Resource restoration using a backup succeeded.

restorationSucceeded

Major

The resource was restored using a backup.

Check whether the data is successfully restored.

None

The backup is deleted successfully.

backupDeletionSucceeded

Major

The backup was deleted.

None

None

The vault is deleted successfully.

vaultDeletionSucceeded

Major

The vault was deleted.

None

None

Replication success

replicationSucceeded

Major

The backup was replicated successfully.

None

None

Table 4 Relational Database Service (RDS) — resource exception

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

RDS

Full backup failure

fullBackupFailed

Major

A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).

Create a manual backup again.

Backup failed.

Primary/standby switchover or failover

PrimaryStandbySwitched

Major

This event is reported when a primary/standby switchover or a failover is triggered.

  1. After the switchover or failover is complete, check whether workloads are restored. If workloads are not restored, contact SRE engineers.

  2. Ignore the event if you have performed a switchover.

  3. If the failover is triggered by a node fault, contact the SRE engineers.

Downtime occurs during the switchover.

Replication status abnormal

abnormalReplicationStatus

Major

The possible causes are as follows:

The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.

The network between the primary instance and the standby instance or a read replica is disconnected.

Submit a service ticket.

Your applications are not affected because this event does not interrupt data read and write.

Replication status recovered

replicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

No action is required.

None

DB instance faulty

faultyDBInstance

Major

A single or primary DB instance was faulty due to a disaster or a server failure.

Check whether an automated backup policy has been configured for the DB instance and submit a service ticket.

The database service may be unavailable.

DB instance recovered

DBInstanceRecovered

Major

RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.

No action is required.

None

Failure of changing single DB instance to primary/standby

singleToHaFailed

Major

A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.

Submit a service ticket.

Your applications are not affected because this event does not interrupt data read and write of the DB instance.

Database process restarted

DatabaseProcessRestarted

Major

The database process is stopped due to insufficient memory or high load.

Log in to the Cloud Eye console. Check whether the memory usage increases sharply, the CPU usage is too high for a long time, or the storage space is insufficient. You can increase the CPU and memory specifications or optimize the service logic.

When the process exits abnormally, workloads are interrupted. In this case, RDS automatically restarts the database process and attempts to recover the workloads.

Instance storage full

instanceDiskFull

Major

Generally, the cause is that the data space usage is too high.

Scale up the instance.

The DB instance becomes read-only because the storage space is full, and data cannot be written to the database.

Instance storage full recovered

instanceDiskFullRecovered

Major

The instance disk is recovered.

No action is required.

Cancel the read-only state of the instance and resume write operations.

Read replica promotion failure

activeStandBySwitchFailed

Major

The read replica fails to be promoted to the primary DB instance due to network or server failures. The original primary DB instance takes over workloads quickly.

Submit a service ticket.

The read replica fails to be promoted to the primary DB instance.

Table 5 GaussDB(for MySQL)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GaussDB(for MySQL)

Incremental backup failure

TaurusIncrementalBackupInstanceFailed

Major

The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.

Submit a service ticket.

Backup jobs fail.

Read replica creation failure

addReadonlyNodesFailed

Major

The quota is insufficient or underlying resources are exhausted.

Check the read replica quota. Release resources and create read replicas again.

Read replicas fail to be created.

DB instance creation failure

createInstanceFailed

Major

The instance quota or underlying resources are insufficient.

Check the instance quota. Release resources and create instances again.

DB instances fail to be created.

Read replica promotion failure

activeStandBySwitchFailed

Major

The read replica fails to be promoted to the primary node due to network or server failures. The original primary node takes over services quickly.

Submit a service ticket.

The read replica fails to be promoted to the primary node.

Instance specifications change failure

flavorAlterationFailed

Major

The quota is insufficient or underlying resources are exhausted.

Submit a service ticket.

Instance specifications fail to be changed.

Faulty DB instance

TaurusInstanceRunningStatusAbnormal

Major

The instance process is faulty or the communications between the instance and the DFV storage are abnormal.

Submit a service ticket.

Services may be affected.

DB instance recovered

TaurusInstanceRunningStatusRecovered

Major

The instance is recovered.

Observe the service running status.

None

Faulty node

TaurusNodeRunningStatusAbnormal

Major

The node process is faulty or the communications between the node and the DFV storage are abnormal.

Observe the instance and service running statuses.

A read replica may be promoted to the primary node.

Node recovered

TaurusNodeRunningStatusRecovered

Major

The node is recovered.

Observe the service running status.

None

Read replica deletion failure

TaurusDeleteReadOnlyNodeFailed

Major

The communications between the management plane and the read replica are abnormal or the VM fails to be deleted from IaaS.

Submit a service ticket.

Read replicas fail to be deleted.

Password reset failure

TaurusResetInstancePasswordFailed

Major

The communications between the management plane and the instance are abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Passwords fail to be reset for instances.

DB instance reboot failure

TaurusRestartInstanceFailed

Major

The network between the management plane and the instance is abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Instances fail to be rebooted.

Restoration to new DB instance failure

TaurusRestoreToNewInstanceFailed

Major

The instance quota is insufficient, underlying resources are exhausted, or the data restoration logic is incorrect.

If the new instance fails to be created, check the instance quota, release resources, and try to restore to a new instance again. In other cases, submit a service ticket.

Backup data fails to be restored to new instances.

EIP binding failure

TaurusBindEIPToInstanceFailed

Major

The binding task fails.

Submit a service ticket.

EIPs fail to be bound to instances.

EIP unbinding failure

TaurusUnbindEIPFromInstanceFailed

Major

The unbinding task fails.

Submit a service ticket.

EIPs fail to be unbound from instances.

Parameter modification failure

TaurusUpdateInstanceParameterFailed

Major

The network between the management plane and the instance is abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Instance parameters fail to be modified.

Parameter template application failure

TaurusApplyParameterGroupToInstanceFailed

Major

The network between the management plane and instances is abnormal or the instances are abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Parameter templates fail to be applied to instances.

Full backup failure

TaurusBackupInstanceFailed

Major

The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.

Submit a service ticket.

Backup jobs fail.

Primary/standby failover

TaurusActiveStandbySwitched

Major

When the network, physical machine, or database of the primary node is faulty, the system promotes a read replica to primary based on the failover priority to ensure service continuity.

  1. Check whether the service is running properly.

  2. Check whether an alarm is generated, indicating that the read replica failed to be promoted to primary.

During the failover, database connection is interrupted for a short period of time. After the failover is complete, you can reconnect to the database.

Database read-only

NodeReadonlyMode

Major

The database supports only query operations.

Submit a service ticket.

After the database becomes read-only, write operations cannot be processed.

Database read/write

NodeReadWriteMode

Major

The database supports both write and read operations.

Submit a service ticket.

None.

Table 6 GaussDB

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GaussDB

Process status alarm

ProcessStatusAlarm

Major

Key processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.

Component status alarm

ComponentStatusAlarm

Major

Key components do not respond, including CMA, ETCD, GTM, CN, and DN components.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.

Cluster status alarm

ClusterStatusAlarm

Major

The cluster status is abnormal. For example, the cluster is read-only; majority of ETCDs are faulty; or the cluster resources are unevenly distributed.

Contact SRE engineers.

If the cluster status is read-only, only read services are processed.

If the majority of ETCDs are fault, the cluster is unavailable.

If resources are unevenly distributed, the instance performance and reliability deteriorate.

Hardware resource alarm

HardwareResourceAlarm

Major

A major hardware fault occurs in the instance, such as disk damage or GTM network fault.

Contact SRE engineers.

Some or all services are affected.

Status transition alarm

StateTransitionAlarm

Major

The following events occur in the instance: DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover.

Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.

Some services are interrupted.

Other abnormal alarm

OtherAbnormalAlarm

Major

Disk usage threshold alarm

Focus on service changes and scale up storage space as needed.

If the used storage space exceeds the threshold, storage space cannot be scaled up.

Faulty DB instance

TaurusInstanceRunningStatusAbnormal

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

TaurusInstanceRunningStatusRecovered

Major

GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.

No further action is required.

None

Faulty DB node

TaurusNodeRunningStatusAbnormal

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

DB node recovered

TaurusNodeRunningStatusRecovered

Major

GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.

No further action is required.

None

DB instance creation failure

GaussDBV5CreateInstanceFailed

Major

Instances fail to be created because the quota is insufficient or underlying resources are exhausted.

Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.

DB instances cannot be created.

Node adding failure

GaussDBV5ExpandClusterFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.

None

Storage scale-up failure

GaussDBV5EnlargeVolumeFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.

Services may be interrupted.

Reboot failure

GaussDBV5RestartInstanceFailed

Major

The network is abnormal.

Retry the reboot operation or submit a service ticket to the O&M personnel.

The database service may be unavailable.

Full backup failure

GaussDBV5FullBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Differential backup failure

GaussDBV5DifferentialBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Backup deletion failure

GaussDBV5DeleteBackupFailed

Major

This function does not need to be implemented.

N/A

N/A

EIP binding failure

GaussDBV5BindEIPFailed

Major

The EIP is bound to another resource.

Submit a service ticket to the O&M personnel.

The instance cannot be accessed from the Internet.

EIP unbinding failure

GaussDBV5UnbindEIPFailed

Major

The network is faulty or EIP is abnormal.

Unbind the IP address again or submit a service ticket to the O&M personnel.

IP addresses may be residual.

Parameter template application failure

GaussDBV5ApplyParamFailed

Major

Modifying a parameter template times out.

Modify the parameter template again.

None

Parameter modification failure

GaussDBV5UpdateInstanceParamGroupFailed

Major

Modifying a parameter template times out.

Modify the parameter template again.

None

Backup and restoration failure

GaussDBV5RestoreFromBcakupFailed

Major

The underlying resources are insufficient or backup files fail to be downloaded.

Submit a service ticket.

The database service may be unavailable during the restoration failure.