ALM-12012 NTP Service Is Abnormal

Description

The system checks whether the NTP service on a node synchronizes time with the NTP service on the active OMS node every 60 seconds. This alarm is generated when the NTP service fails to synchronize time for two consecutive times.

This alarm is generated when the time difference between the NTP service on a node and the NTP service on the active OMS node is greater than or equal to 20s for two consecutive times. This alarm is cleared when the time difference is less than 20s.

Attribute

Alarm ID

Alarm Severity

Auto Clear

12012

Major

Yes

Parameters

Name

Meaning

Source

Specifies the cluster or system for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Impact on the System

The time on the node is inconsistent with that on other nodes in the cluster. Therefore, some FusionInsight applications on the node may not run properly.

Possible Causes

  • The NTP service on the current node cannot start properly.

  • The current node fails to synchronize time with the NTP service on the active OMS node.

  • The key value authenticated by the NTP service on the current node is inconsistent with that on the active OMS node.

  • The time offset between the node and the NTP service on the active OMS node is large.

Procedure

Check the NTP service mode of the node.

  1. Log in to the active management node as user root, run the su - omm command to switch to user omm, and run the following command to check the resource status on the active and standby nodes:

    sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh

    • If "chrony" is displayed in the ResName column of the command output, go to 2.

    • If "ntp" is displayed in the ResName column, go to 20.

    Note

    If both "chrony" and "ntp" are displayed in the ResName column of the command output, the NTP service mode is being switched. Wait for 10 minutes and go to 1 again. If both "chrony" and "ntp" persist, contact O&M personnel personnel.

Check whether the chrony service on the node is started properly.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms. On the page that is displayed, click image1 in the row containing the alarm, and view the name of the host for which the alarm is generated in Location.

  2. Check whether the chronyd process is running on the node where the alarm is generated. Log in to the node for which the alarm is generated as user root and run the ps -ef | grep chronyd | grep -v grep command to check whether the command output contains the chronyd process.

    • If yes, go to 6.

    • If no, go to 4.

  3. Run the systemctl chronyd start command to start the NTP service. (Currently, only CentOS and Red Hat Enterprise Linux 7.0 or later are supported.)

  4. Check whether the alarm is cleared 10 minutes later.

    • If yes, no further action is required.

    • If no, go to 6.

Check whether the current node can synchronize time properly with the chrony service on the active OMS node.

  1. Check whether the node can synchronize time with the NTP service on the active OMS node based on additional information of the alarm.

    • If yes, go to 7.

    • If no, go to 17.

  2. Check whether the synchronization with the chrony service on the active OMS node is faulty.

    Log in to the node for which the alarm is generated as user root and run the chronyc sources command.

    In the command output, if there is an asterisk (*) before the IP address of the chrony service on the active OMS node, the synchronization is normal. The command output is as follows:

    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 10.10.10.162             10  10   377   626    +16us[  +15us] +/-  308us
    

    In the command output, if there is no asterisk (*) before the IP address of the NTP service on the active OMS node, and the value of Reach is 0, the synchronization is abnormal.

    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^? 10.1.1.1                      0  10     0     -     +0ns[   +0ns] +/-    0ns
    
    • If yes, go to 8.

    • If no, go to 38.

  3. The chrony synchronization failure is typically caused by the system firewall. If the firewall can be disabled, disable it. If the firewall cannot be disabled, check the firewall configuration policy and ensure that UDP ports 123 and 323 are not disabled. (For details, see the firewall configuration policy of each system.)

  4. Check whether the alarm is cleared 10 minutes later.

    • If yes, no further action is required.

    • If no, go to 10.

  5. Log in to the active OMS node as user root and run the following command to view the authentication code whose key value index is 1M:

    In Red Hat Enterprise Linux, run the cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys command.

  6. Run the following command to check whether the key value is the same as that queried in 10:

    In Red Hat Enterprise Linux, run the diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys /etc/chrony.keys command.

    Note

    If the key values are the same, no result is returned after the command is executed. For example:

    host01:~ # cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys
    1 M sdYbq;o^CzEAWo<U=Tw5
    host01:~ # diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys /etc/chrony.keys
    host01:~ #
    
    • If yes, go to 12.

    • If no, go to 38.

  7. Run the cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile command to check whether the key value is the same as that queried in 10. (Compare the key value with that of the authentication key index field 1M queried in 10.)

    • If yes, go to 13.

    • If no, go to 15.

  8. Log in to the faulty node as user root and run the cat /etc/chrony.keys command in Red Hat Enterprise Linux to check whether the key value is the same as the value queried in 12 (use the key value of the authentication key index field 1M for comparison).

    • If yes, go to 38.

    • If no, go to 14.

  9. Run the su - omm command to switch to user omm, change the key value of the authentication key index field 1M in ${NODE_AGENT_HOME}/chrony.keys to the key value of ntpKeyFile in 12, and go to 16.

  10. Run the following commands as user root or omm to change the NTP key value of the active OMS node (change ntp.keys to ntpkeys in Red Hat Enterprise Linux):

    cd ${BIGDATA_HOME}/om-server/OMS/workspace/conf

    sed -i "`cat chrony.keys | grep -n '1 M'|awk -F ':' '{print $1}'`d" chrony.keys

    echo "1 M `cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile`" >> chrony.keys

    Check whether the key value of the authentication key index field 1M in chrony.keys is the same as that of ntpKeyFile.

    • If yes, go to 16.

    • If no, change the key value of the authentication key index field 1M in chrony.keys to the key value of ntpKeyFile and go to 16.

  11. After 5 minutes, run the systemctl chronyd restart command to restart the chrony service on the active OMS node. After 15 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 38.

Check whether the time deviation between the node and the chrony service on the active OMS node is large.

  1. Check whether the time deviation is large in additional information of the alarm.

    • If yes, go to 18.

    • If no, go to 38.

  2. On the Hosts tab page, select the host for which the alarm is generated, and choose More > Stop All Instances to stop all the services on the node.

    If the time on the alarm node is later than that on the chrony service of the active OMS node, adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

    If the time on the alarm node is earlier than that on the chrony service of the active OMS node, wait until the time deviation is due and adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

    Note

    If you do not wait, data loss may occur.

  3. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 38.

Check whether the NTP service on the node is started properly.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms. On the page that is displayed, click image2 in the row containing the alarm, and view the name of the host for which the alarm is generated in Location.

  2. Check whether the ntpd process is running on the node using the following method. Log in to the alarm node as user root and run the ps -ef | grep ntpd | grep -v grep command to check whether the command output contains the ntpd process.

    • If yes, go to 24.

    • If no, go to 22.

  3. Run the service ntp start command (or the service ntpd start command in Red Hat Enterprise Linux) to start the NTP service.

  4. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 24.

Check whether the node can synchronize time properly with the NTP service on the active OMS node.

  1. Check whether the node can synchronize time with the NTP service on the active OMS node based on additional information of the alarm.

    • If yes, go to 25.

    • If no, go to 35.

  2. Check whether the synchronization with the NTP service on the active OMS node is faulty.

    Log in to the alarm node as user root and run the ntpq -np command.

    If an asterisk (*) exists before the IP address of the NTP service on the active OMS node in the command output, the synchronization is in normal state. The command output is as follows:

    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    *10.10.10.162 .LOCL. 1 u 1 16 377 0.270 -1.562 0.014
    

    If there is no asterisk (*) before the IP address of the NTP service on the active OMS node, as shown in the following command output, and the value of refid is .INIT., the synchronization is abnormal.

    remote refid st t when poll reach delay offset jitter
    ==============================================================================
    10.10.10.162 .INIT. 1 u 1 16 377 0.270 -1.562 0.014
    
    • If yes, go to 26.

    • If no, go to 38.

  3. The NTP synchronization failure is typically caused by the system firewall. If the firewall can be disabled, run the iptables -F command to disable it. If the firewall cannot be disabled, run the iptables -L command to check the firewall configuration policy and ensure that the UDP port 123 is not disabled. (For details, see the firewall configuration policy of each system.)

  4. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 28.

  5. Log in to the active OMS node as user root and run the following command to view the authentication key index field 1M:

    In SUSE Linux, run the cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys command.

    In Red Hat Enterprise Linux or EulerOS, run the cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntpkeys command.

  6. Run the following command to check whether the key value is the same as that queried in 28:

    In SUSE Linux, run the diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys /etc/ntp.keys command.

    In Red Hat Enterprise Linux or EulerOS, run the diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntpkeys /etc/ntp/ntpkeys command.

    Note

    If the key values are the same, no result is returned after the command is executed. For example:

    host01:~ # cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys
    1 M sdYbq;o^CzEAWo<U=Tw5
    host01:~ # diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys /etc/ntp.keys
    host01:~ #
    
    • If yes, go to 30.

    • If no, go to 38.

  7. Run the cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile command to check whether the key value is the same as that queried in 28. (Compare the key value with that of the authentication key index field 1M queried in 28.)

    • If yes, go to 31.

    • If no, go to 33.

  8. Log in to the faulty node as user root and run the cat /etc/ntp.keys command in SUSE Linux (or the cat /etc/ntp/ntpkeys command in Red Hat Enterprise Linux) to check whether the key value is the same as the value queried in 30 (use the key value of the authentication key index field 1M for comparison).

    • If yes, go to 38.

    • If no, go to 32.

  9. Run the su - omm command to switch to user omm, change the key value of the authentication key index field 1M in ${NODE_AGENT_HOME}/ntp.keys (${NODE_AGENT_HOME}/ntpkeys in Red Hat Enterprise Linux) to the key value of ntpKeyFile in 30, and go to 34.

  10. Run the following commands as user root or omm to change the NTP key value of the active OMS node (change ntp.keys to ntpkeys in Red Hat Enterprise Linux):

    cd ${BIGDATA_HOME}/om-server/OMS/workspace/conf

    sed -i "`cat ntp.keys | grep -n '1 M'|awk -F ':' '{print $1}'`d" ntp.keys

    echo "1 M `cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile`" >>ntp.keys

    Check whether the key value of the authentication key index field 1M in ntp.keys is the same as that of ntpKeyFile.

    • If yes, go to 34.

    • If no, change the key value of the authentication key index field 1M in ntp.keys to the key value of ntpKeyFile and go to 34.

  11. After 5 minutes, run the service ntp restart command to restart the NTP service on the active OMS node. After 15 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 38.

Check whether the time deviation between the node and the NTP service on the active OMS node is large.

  1. Check whether the time deviation is large in additional information of the alarm.

    • If yes, go to 36.

    • If no, go to 38.

  2. On the Hosts tab page, select the host for which the alarm is generated, and choose More > Stop All Instances to stop all the services on the node.

    If the time on the alarm node is later than that on the NTP service of the active OMS node, adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

    If the time on the alarm node is earlier than that on the NTP service of the active OMS node, wait until the time deviation is due and adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

    Note

    If you do not wait, data loss may occur.

  3. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 38.

Collect the fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.

  2. In the Services area, select NodeAgent and OmmServer, and click OK. Expand the Hosts dialog box and select the alarm node and the active OMS node.

  3. Click image3 in the upper right corner, and set Start Date and End Date for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click Download.

  4. Contact O&M personnel and provide the collected logs.

Alarm Clearing

This alarm is automatically cleared after the fault is rectified.