CCE Node Problem Detector

Add-on Overview

CCE Node Problem Detector (node-problem-detector, NPD) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. It can run as a DaemonSet or a daemon.

Add-on Parameters

Table 1 Parameters

Parameter

Mandatory

Type

Description

basic

No

object

Basic configuration parameters, which do not need to be specified

flavor

Yes

Table 2 object

Flavor parameters

custom

Yes

Table 3 object

Custom parameters

Table 2 Configuration of flavor

Parameter

Mandatory

Type

Description

description

No

String

Add-on description

name

Yes

String

Add-on specification name. The value is fixed at Single-instance.

replicas

Yes

String

Number of pods. The default value is 1.

resources

Yes

resources object

Container resource (CPU and memory) quotas

Table 3 Configuration of custom

Parameter

Mandatory

Type

Description

feature_gate

No

String

Feature gate, which is used to enable the beta features

multiAZBalance

No

Bool

Multi AZ deployment

multiAZEnabled

No

Bool

Whether to deploy the add-on pods in multiple AZs. The default value is false. If this parameter is set to true, cross-AZ deployment is forcibly performed. If this parameter is set to false, cross-AZ deployment is preferred.

npc

Yes

object Table 5

node-problem-controller configuration

tolerations

No

List<Object> Table 7

Tolerations of the add-on

node_match_expressions

No

List<Object> Table 7

Node affinity configuration of the add-on

Table 4 Data structure of the resources field

Parameter

Mandatory

Type

Description

limitsCpu

Yes

String

CPU size limit (unit: m)

limitsMem

Yes

String

Memory size limit (unit: Mi)

name

Yes

String

Add-on name. The value is fixed at custom-resources.

requestsCpu

Yes

String

Requested CPU size (unit: m)

requestsMem

Yes

String

Requested memory size (unit: Mi)

Table 5 Data structure of the npc field

Parameter

Mandatory

Type

Description

maxTaintedNode

Yes

String or Int

The maximum number of nodes that NPC can add taints to when a single fault occurs on multiple nodes for minimizing impact.

The value can be in int or percentage format.

Table 6 Taints and tolerations

Parameter

Mandatory

Type

Description

key

No

String

Taint key

effect

No

String

Taint policy

operator

No

String

Operator

tolerationSeconds

No

Int

Toleration time window

Table 7 nodeMatchExpresssion node affinity

Parameter

Mandatory

Type

Description

key

No

String

Taint key

values

No

List<String>

Node affinity name

operator

No

String

Operator

Example Request

{
  "kind": "Addon",
  "apiVersion": "v3",
  "metadata": {
    "annotations": {
      "addon.install/type": "install"
    }
  },
  "spec": {
    "clusterID": "b78fb690-b82c-11ee-83cf-0255ac100b0f",
    "version": "1.18.48",
    "addonTemplateName": "npd",
    "values": {
      "basic": {
        "image_version": "1.18.48",
        "swr_addr": "***",
        "swr_user": "***",
        "rbac_enabled": true,
        "cluster_version": "v1.23"
      },
      "flavor": {
        "description": "custom resources",
        "name": "custom-resources",
        "replicas": 2,
        "resources": [
          {
            "limitsCpu": "100m",
            "limitsMem": "300Mi",
            "name": "node-problem-controller",
            "requestsCpu": "30m",
            "requestsMem": "100Mi"
          },
          {
            "limitsCpu": "100m",
            "limitsMem": "300Mi",
            "name": "node-problem-detector",
            "requestsCpu": "30m",
            "requestsMem": "100Mi"
          }
        ],
        "category": [
          "CCE",
          "Turbo"
        ]
      },
      "custom": {
        "annotations": {},
        "common": {},
        "feature_gates": "",
        "multiAZBalance": false,
        "multiAZEnabled": false,
        "node_match_expressions": [],
        "npc": {
          "maxTaintedNode": "10%"
        },
        "tolerations": [
          {
            "key": "node.kubernetes.io/not-ready",
            "operator": "Exists",
            "effect": "NoExecute",
            "tolerationSeconds": 60
          },
          {
            "key": "node.kubernetes.io/unreachable",
            "operator": "Exists",
            "effect": "NoExecute",
            "tolerationSeconds": 60
          }
        ]
      }
    }
  }
}