Routine Maintenance¶
To ensure long-term and stable running of the system, administrators or maintenance engineers need to periodically check items listed in Table 1 and rectify the detected faults based on the check results. It is recommended that administrators or engineers record the result in each task scenario and sign off based on the enterprise management regulations.
Routine Maintenance Frequency | Task Scenario | Check Item |
---|---|---|
Daily | Check the cluster service status. |
|
Check the cluster host status. |
| |
Check the cluster alarm information. | Check whether alarms were generated for unhandled exceptions on the previous day, including alarms that were automatically cleared. | |
Check the cluster audit information. | Check whether critical and major operations are performed on the previous day and whether the operations are valid. | |
Check the cluster backup status. | Check whether OMS, DBService, NameNodeOMS, DBServiceOMS, and LDAP have been automatically backed up on the previous day. | |
View the health check result. | Perform a health check on MRS Manager and download the health check report to check whether the current cluster is abnormal. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result. | |
Check the network communication. | Check the cluster network status and check whether the network communication between nodes is delayed. | |
Check the storage status. | Check whether the total data storage volume of the cluster increases abruptly.
| |
Check logs. |
| |
Weekly | Manage users. | Check whether the user password is about to expire and notify the user of changing the password. To change the password of a machine-machine user, you need to download the keytab file again. |
Analyze alarms. | Export and analyze alarms generated in a specified period. | |
Scan disks. | Check the disk health status. You are advised to use a dedicated disk check tool. | |
Collect statistics on storage. | Check in batches whether the disk data of cluster nodes is evenly stored, filter out the disks whose data increases significantly or is insufficient, and check whether the disks are normal. | |
Record changes. | Arrange and record the operations on cluster configuration parameters and files to provide reference for fault analysis and handling. | |
Monthly | Analyze logs. |
|
Diagnose the network. | Analyze the network health status of the cluster. | |
Manage hardware. | Check the equipment room environment and clean the devices. |