GPU Metrics¶
The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
GPU Metrics Provided by CCE¶
Category | Metric | Type | Unit | Monitoring Level | Description |
---|---|---|---|---|---|
Utilization | cce_gpu_utilization | Gauge | % | GPU cards | GPU compute usage |
cce_gpu_memory_utilization | Gauge | % | GPU cards | GPU memory usage | |
cce_gpu_encoder_utilization | Gauge | % | GPU cards | GPU encoding usage | |
cce_gpu_decoder_utilization | Gauge | % | GPU cards | GPU decoding usage | |
cce_gpu_utilization_process | Gauge | % | GPU processes | GPU compute usage of each process | |
cce_gpu_memory_utilization_process | Gauge | % | GPU processes | GPU memory usage of each process | |
cce_gpu_encoder_utilization_process | Gauge | % | GPU processes | GPU encoding usage of each process | |
cce_gpu_decoder_utilization_process | Gauge | % | GPU processes | GPU decoding usage of each process | |
Memory | cce_gpu_memory_used | Gauge | Byte | GPU cards | Used GPU memory |
cce_gpu_memory_total | Gauge | Byte | GPU cards | Total GPU memory | |
cce_gpu_memory_free | Gauge | Byte | GPU cards | Idle GPU memory | |
cce_gpu_bar1_memory_used | Gauge | Byte | GPU cards | Used GPU BAR1 memory | |
cce_gpu_bar1_memory_total | Gauge | Byte | GPU cards | Total GPU BAR1 memory | |
Frequency | cce_gpu_clock | Gauge | MHz | GPU cards | GPU clock frequency |
cce_gpu_memory_clock | Gauge | MHz | GPU cards | The speed at which the GPU memory operates | |
cce_gpu_graphics_clock | Gauge | MHz | GPU cards | GPU frequency | |
cce_gpu_video_clock | Gauge | MHz | GPU cards | GPU video processor frequency | |
Physical status | cce_gpu_temperature | Gauge | °C | GPU cards | GPU temperature |
cce_gpu_power_usage | Gauge | Milliwatt | GPU cards | GPU power | |
cce_gpu_total_energy_consumption | Gauge | Millijoule | GPU cards | Total GPU energy consumption | |
Bandwidth | cce_gpu_pcie_link_bandwidth | Gauge | bit | GPU cards | GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth | Gauge | Gbit/s | GPU cards | GPU NVLink bandwidth | |
cce_gpu_pcie_throughput_rx | Gauge | KB/s | GPU cards | GPU PCIe RX bandwidth | |
cce_gpu_pcie_throughput_tx | Gauge | KB/s | GPU cards | GPU PCIe TX bandwidth | |
cce_gpu_nvlink_utilization_counter_rx | Gauge | KB/s | GPU cards | GPU NVLink RX bandwidth | |
cce_gpu_nvlink_utilization_counter_tx | Gauge | KB/s | GPU cards | GPU NVLink TX bandwidth | |
Memory isolation page | cce_gpu_retired_pages_sbe | Gauge | N/A | GPU cards | Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe | Gauge | N/A | GPU cards | Number of isolated GPU memory pages with dual-bit errors |