CCE AI Suite (NVIDIA GPU)¶
Add-on Overview¶
CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in a cluster, this add-on must be installed.
Add-on Parameters¶
Parameter | Mandatory | Type | Description |
---|---|---|---|
basic | Yes | object | Basic add-on configuration parameters |
custom | Yes | Table 3 object | Custom parameters |
Parameter | Mandatory | Type | Description |
---|---|---|---|
cluster_version | No | String | CCE cluster version |
device_version | Yes | String | Add-on version |
driver_version | Yes | String | Image tag of an add-on pod where a driver is installed. Generally, the value is the same as that of device_version. |
obs_url | Yes | String | When a GPU driver is downloaded from the default driver address, the value is the GPU driver address. |
swr_addr | Yes | String | Image repository address |
swr_user | Yes | String | Tenant path of an image repository |
Parameter | Mandatory | Type | Description |
---|---|---|---|
compatible_with_legacy_api | No | Bool | API compatibility switch Default value: false true: The add-on supports the GPU native mode and xGPU virtualization. |
component_schedulername | Yes | String | Name of the scheduler used by the add-on. Default value: default-scheduler |
disable_mount_path_v1 | No | Bool | Default value: false true: /opt/cloud/cce/nvidia is not mounted to the /usr/lib/nvidia directory of a GPU container. |
disable_nvidia_gsp | No | Bool | Default value: true true: The GPU GSP firmware is disabled. |
driver_mount_paths | No | String | Driver file directory that needs to be automatically mounted to a GPU container Default value: "bin,lib64" |
enable_fault_isolation | No | Bool | Default value: true true: The add-on detects hardware faults or driver issues of a GPU and then sets the GPU to be unavailable. |
enable_health_monitoring | No | Bool | Default value: true true: The add-on detects hardware faults or driver issues of a GPU. |
enable_metrics_monitoring | No | Bool | Default value: true true: The add-on collects GPU metrics and reports these metrics to Prometheus. |
enable_simple_lib64_mount | No | Bool | Default value: true true: Only the libxxx.so.x file is mounted to a container. |
enable_xgpu | No | Bool | Default value: false Whether to enable xGPU virtualization. |
gpu_driver_config | No | Map | Configurations of the GPU driver for a single node pool Default value: {} |
health_check_xids_v2 | No | String | GPU error range for the add-on health checks Default value: "74,79" |
inject_ld_Library_path | No | String | Value of the LD_LIBRARY_PATH environment variable automatically injected by the add-on to a GPU container Default value: "" |
lib64_container_paths | No | String | Mount path of NVIDIA lib64 in a GPU container Default value: "/usr/lib64,/usr/lib/x86_64-linux-gnu" |
metrics_delete_interval | No | int | Timeout threshold for deleting a metric when the metric cannot be obtained. The unit is millisecond. Default value: 30000 |
metrics_monitor_interval | No | int | Interval for obtaining metrics, in milliseconds. Default value: 15000 |
nvidia_driver_download_url | Yes | String | Path for downloading the NVIDIA driver Default value: "" |
Example Request¶
{
"kind": "Addon",
"apiVersion": "v3",
"metadata": {
"name": "gpu-beta",
},
"spec": {
"clusterID": "80c9e306-***-***-***-0255ac100043",
"version": "2.0.69",
"addonTemplateName": "gpu-beta",
"values": {
"basic": {
"cluster_version": "v1.27",
"device_version": "2.0.69",
"driver_version": "2.0.69",
"obs_url": "***",
"region": "***",
"swr_addr": "***",
"swr_user": "***"
},
"custom": {
"compatible_with_legacy_api": true,
"component_schedulername": "kube-scheduler",
"disable_mount_path_v1": false,
"disable_nvidia_gsp": true,
"driver_mount_paths": "bin,lib64",
"enable_fault_isolation": true,
"enable_health_monitoring": true,
"enable_metrics_monitoring": true,
"enable_simple_lib64_mount": true,
"enable_xgpu": true,
"gpu_driver_config": {},
"health_check_xids_v2": "74,79",
"inject_ld_Library_path": "",
"lib64_container_paths": "/usr/lib64,/usr/lib/x86_64-linux-gnu",
"metrics_delete_interval": 30000,
"metrics_monitor_interval": 15000,
"nvidia_driver_download_url": ""
},
}
}
}