Routine Maintenance

To ensure long-term and stable running of the system, administrators or maintenance engineers need to periodically check items listed in Table 1 and rectify the detected faults based on the check results. It is recommended that administrators or engineers record the result in each task scenario and sign off based on the enterprise management regulations.

Table 1 Routine maintenance check items

Routine Maintenance Frequency

Task Scenario

Check Item

Daily

Check the cluster service status.

  • Check whether the running status and configuration status of each service are normal and whether the status icons are green.

  • Check whether the running status and configuration status of the role instances in each service are normal and whether the status icons are green.

  • Check whether the active/standby status of role instances in each service can be properly displayed.

  • Check whether the dashboard of the services and role instances can be displayed properly.

Check the cluster host status.

  • Check whether the running status of each host is normal and whether the status icon is green.

  • Check the current disk usage, memory usage, and CPU usage of each host. Check whether the current memory usage and CPU usage are increasing.

Check the cluster alarm information.

Check whether alarms were generated for unhandled exceptions on the previous day, including alarms that were automatically cleared.

Check the cluster audit information.

Check whether critical and major operations are performed on the previous day and whether the operations are valid.

Check the cluster backup status.

Check whether OMS, DBService, NameNodeOMS, DBServiceOMS, and LDAP have been automatically backed up on the previous day.

View the health check result.

Perform a health check on MRS Manager and download the health check report to check whether the current cluster is abnormal. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result.

Check the network communication.

Check the cluster network status and check whether the network communication between nodes is delayed.

Check the storage status.

Check whether the total data storage volume of the cluster increases abruptly.

  • Check whether the disk usage is close to the threshold. If yes, locate the causes. For example, check whether the junk data or cold data left by services needs to be cleared.

  • Check whether disk partitions need to be expanded based on the service growth trend.

Check logs.

  • Check whether there are failed or unresponsive MapReduce and Spark tasks. Check the /tmp/logs/${username}/logs/${application id} log file in HDFS and rectify faults.

  • Check Yarn task logs, view the logs of failed and unresponsive tasks, and delete duplicate data.

  • Check the worker logs of Storm.

  • Back up logs to the storage server.

Weekly

Manage users.

Check whether the user password is about to expire and notify the user of changing the password. To change the password of a machine-machine user, you need to download the keytab file again.

Analyze alarms.

Export and analyze alarms generated in a specified period.

Scan disks.

Check the disk health status. You are advised to use a dedicated disk check tool.

Collect statistics on storage.

Check in batches whether the disk data of cluster nodes is evenly stored, filter out the disks whose data increases significantly or is insufficient, and check whether the disks are normal.

Record changes.

Arrange and record the operations on cluster configuration parameters and files to provide reference for fault analysis and handling.

Monthly

Analyze logs.

  • Collect and analyze hardware logs of cluster node servers, such as BMC system logs.

  • Collect and analyze the OS logs of the cluster node servers.

  • Collect and analyze cluster logs.

Diagnose the network.

Analyze the network health status of the cluster.

Manage hardware.

Check the equipment room environment and clean the devices.