FAQ
Supported Device Vendors and Specific Models
| GPU Vendor | GPU Model | Granularity | Multi-GPU Support |
|---|---|---|---|
| NVIDIA | Almost all mainstream consumer and data center GPUs | Core 1%, Memory 1M | Supported. Multi-GPU can still be split and shared using virtualization. |
| Ascend | 910A, 910B2, 910B3, 310P | Minimum granularity depends on the card type template. Refer to the official templates. | Supported, but splitting is not supported when npu > 1. The entire card is exclusively allocated. |
| Hygon | Z100, Z100L, K100-AI | Core 1%, Memory 1M | Supported, but splitting is not supported when dcu > 1. The entire card is exclusively allocated. |
| Cambricon | 370, 590 | Core 1%, Memory 256M | Supported, but splitting is not supported when mlu > 1. The entire card is exclusively allocated. |
| Iluvatar | All | Core 1%, Memory 256M | Supported, but splitting is not supported when gpu > 1. The entire card is exclusively allocated. |
| Mthreads | MTT S4000 | Core 1 core group, Memory 512M | Supported, but splitting is not supported when gpu > 1. The entire card is exclusively allocated. |
| Metax | MXC500 | Does not support splitting, only whole card allocation is possible. | Supported, but all allocations are for whole cards. |
What is vGPU? Why cannot I allocate two vGPUs on the same card despite seeing 10 vGPUs?
TL;DR
vGPU increases GPU utilization by enabling multiple tasks to share one GPU through logical splitting. A deviceSplitCount: 10 means the GPU can serve up to 10 tasks simultaneously but does not allow a single task to use multiple vGPUs from the same GPU.
Concept of vGPU
A vGPU is a logical instance of a physical GPU created using virtualization, allowing multiple tasks to share the same physical GPU. For example, setting deviceSplitCount: 10 means a physical GPU can allocate resources to up to 10 tasks. This allocation does not increase physical resources; it only defines logical visibility.
Why cannot I allocate two vGPUs on the same card?
-
Significance of vGPU vGPU represents different task views of the same physical GPU. It is not a separate partition of physical resources. When a task requests
nvidia.com/gpu: 2, it is interpreted as requiring two physical GPUs, not two vGPUs from the same GPU. -
Resource Allocation Mechanism vGPU is designed to allow multiple tasks to share one GPU, not to bind multiple vGPUs to a single task on the same GPU. A
deviceSplitCount: 10configuration enables up to 10 tasks to use the same GPU concurrently but does not permit one task to use multiple vGPUs. -
Consistency Between Container and Node Views The GPU UUID inside the container matches the physical node's UUID, reflecting the same GPU. Although there may be 10 visible vGPUs, these are logical overcommit views, not additional independent resources.
-
Design Intent The design of vGPU aims to allow one GPU to be shared by multiple tasks, rather than letting one task occupy multiple vGPUs on the same GPU. The purpose of vGPU overcommitment is to improve GPU utilization, not to increase resource allocation for individual tasks.
HAMi's nvidia.com/priority field only supports two levels. How to implement multi-level, user-defined priority-based scheduling for a queue of jobs, especially when cluster resources are limited?
TL;DR
HAMi's built-in two-level priority is for runtime preemption on a single GPU (e.g., an urgent task pausing a less critical one on the same card). For scheduling a queue of jobs based on multiple user-defined priorities, integrate HAMi with a scheduler like Volcano, which supports multi-level queue priorities for job allocation and preemption.
HAMi's native nvidia.com/priority field (0 for high, 1 for low/default) is specifically designed for runtime preemption on a single GPU. The typical scenario it addresses is when a low-priority task (e.g., training) is running, and a high-priority task (e.g., inference) needs immediate access to that same GPU. In this case, the high-priority task will cause the low-priority task to pause, effectively ceding compute resources. Once the high-priority task completes, the low-priority task resumes. This mechanism is focused on immediate resource contention on a specific device, rather than for sorting a queue of many pending jobs with multiple priority levels for initial scheduling.
Regarding the scenario where resources are insufficient, 'n' jobs are waiting, and you need to sort them for scheduling based on multiple user-submitted priorities, HAMi's two-level system is not intended for this broader scheduling requirement.
However, achieving multi-level priority scheduling is feasible. The recommended approach is to integrate HAMi with a full-featured scheduler like Volcano:
- Volcano for Multi-Level Scheduling Priority:
- Volcano allows you to define multiple queues with different priority levels.
- It uses these queue priorities to determine the order in which jobs are allocated resources (including HAMi-managed vGPUs) and can manage preemption between jobs based on these wider scheduling priorities. This directly addresses the need for sorting the job queue based on multiple priority levels.
- HAMi for GPU Sharing & Its Runtime Priority:
- HAMi integrates with Volcano via the volcano-vgpu-device-plugin.
- It continues to manage the vGPU sharing and its own two-level runtime priority for tasks contending on the same physical GPU, as described earlier.
While HAMi's own priority serves a different, device-specific purpose (runtime preemption on a single card), implementing multi-level job scheduling priority is achievable by using Volcano in conjunction with HAMi. Volcano would handle which job from the queue is prioritized for resource allocation based on multiple priority levels, and HAMi would manage the GPU sharing and its specific on-device preemption.
Integration with Other Open-Source Tools
Currently Supported:
-
Volcano: Can be integrated with Volcano by using the
volcano-vgpu-device-pluginunder the HAMi project for GPU resource scheduling and management. -
Koordinator: HAMi can also be integrated with Koordinator to provide end-to-end GPU sharing solutions. By deploying HAMi-core on nodes and configuring the appropriate labels and resource requests in Pods, Koordinator uses HAMi’s GPU isolation capabilities, allowing multiple Pods to share the same GPU and improve GPU resource utilization.
For detailed configuration and usage instructions, refer to the Koordinator documentation: Device Scheduling - GPU Share With HAMi
Currently Not Supported:
- KubeVirt & Kata Containers: Incompatible due to their reliance on virtualization for resource isolation, whereas HAMi’s GPU Device Plugin depends on direct GPU mounting into containers. Supporting these would require adapting the device allocation logic, balancing performance overhead and implementation complexity. HAMi prioritizes high-performance scenarios with direct GPU mounting and thus does not currently support these virtualization solutions.
Why are there [HAMi-core Warn(...)] logs in my Pod's output? Can I disable them?
This is normal and can be ignored. If needed, disable the logs by setting the environment variable LIBCUDA_LOG_LEVEL=0 in the container.
Does HAMi support multi-node, multi-GPU distributed training? Does it support cross-host and cross-GPU scenarios?
TL;DR
HAMi supports multi-node, multi-GPU distributed training by scheduling multiple Pods on different nodes and leveraging distributed frameworks for cross-host and cross-GPU collaboration. A single Pod supports multiple GPUs on the same node.
Multi-Node, Multi-GPU Distributed Training
HAMi supports distributed training in Kubernetes by running multiple Pods across different nodes and using distributed computing frameworks (e.g., PyTorch, TensorFlow, Horovod) to achieve multi-node, multi-GPU collaboration. Each Pod utilizes local GPU resources, and inter-node communication occurs via high-performance networks such as NCCL or RDMA.
Cross-Host and Cross-GPU Scenarios
- Cross-Host: Multiple Pods are scheduled on different nodes, and inter-node communication synchronizes gradients and updates parameters.
- Cross-GPU: A single Pod can utilize multiple GPUs on the same node for computation tasks.
A single Pod cannot span multiple nodes. If cross-host resource coordination is required, adopt multi-Pod distributed training, where the distributed framework manages task execution across hosts.
Relationship and Compatibility Between HAMi Device Plugin, Volcano vGPU Device Plugin, and NVIDIA Official Device Plugin
TL;DR
Use only one GPU management plugin per node in a cluster to ensure clarity and stability in resource allocation.
Their Relationship
These three Device Plugins all manage GPU resources but differ in usage scenarios and resource reporting methods:
-
HAMi Device Plugin
- Reports GPU resources to Kubernetes using the extended resource name
nvidia.com/gpu. - Supports HAMi’s GPU resource management features, including custom vGPU splitting and scheduling.
- Designed for complex resource management scenarios such as vGPU overcommitment and customized scheduling.
- Reports GPU resources to Kubernetes using the extended resource name
-
Volcano vGPU Device Plugin
- Reports vGPU resources using the extended resource name
volcano.sh/vgpu-number. - Designed specifically for Volcano’s scheduling optimizations, supporting vGPU virtualization and distributed task scenarios.
- Typically used in Volcano environments requiring finer-grained scheduling control.
- Reports vGPU resources using the extended resource name
-
NVIDIA Official Device Plugin
- Reports physical GPU resources using the extended resource name
nvidia.com/gpu. - Provides basic GPU resource allocation functionalities.
- Focuses on simple and stable GPU allocation scenarios, suitable for tasks that directly use physical GPUs.
- Reports physical GPU resources using the extended resource name
Coexistence
- HAMi Device Plugin and NVIDIA Official Device Plugin: Should not coexist to avoid resource conflicts.
- HAMi Device Plugin and Volcano vGPU Device Plugin: Can theoretically coexist; use only one to avoid conflicts.
- NVIDIA Official Device Plugin and Volcano vGPU Device Plugin: Can theoretically coexist, but mixed usage is not advised.
Why do Node Capacity and Allocatable show only nvidia.com/gpu and not nvidia.com/gpucores or nvidia.com/gpumem?
TL;DR
Device Plugins can only report a single resource type. GPU memory and compute information is stored as node annotations for use by the scheduler.
Design Constraints of Device Plugins
- Device Plugin interfaces (e.g., Registration and ListAndWatch) only allow reporting and managing a single resource type per plugin instance.
- This design simplifies resource association management but restricts plugins from reporting multiple resource metrics (e.g., GPU compute power and memory).
HAMi’s Implementation
-
HAMi stores detailed GPU resource information (e.g., compute power, memory, model) as node annotations for use by the scheduler.
-
Example annotation:
hami.io/node-nvidia-register: GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,...
Follow-Up
Why does the Node Capacity show volcano.sh/vgpu-number and volcano.sh/vgpu-memory when using volcano-vgpu-device-plugin?
- volcano-vgpu-device-plugin creates three independent Device Plugin instances, each registering with kubelet for volcano.sh/vgpu-number, volcano.sh/vgpu-memory, and volcano.sh/vgpu-cores resources respectively. After kubelet receives the registration, it automatically writes the resources into Capacity and Allocatable.
- The
volcano.sh/vgpu-memoryresource is subject to a Kubernetes extended resources quantity limit of 32,767 maximum. For GPUs with large memory (e.g., A100 80GB), configure the--gpu-memory-factorparameter to avoid exceeding the limit.
Why don’t some domestic vendors require a runtime for installation?
Certain domestic vendors (e.g., Hygon, Cambricon) do not require a runtime because their DevicePlugin handles device discovery and mounting directly. In contrast, vendors like NVIDIA and Ascend rely on runtimes for environment configuration, device node mounting, and advanced functionality support.
TL;DR
If the official Device Plugin cannot meet specific requirements (e.g., insufficient information for advanced features), or if adapting it introduces complexity, HAMi implements its own Device Plugin.
HAMi's scheduler requires sufficient information from the Node to decode the corresponding GPU details, which can be provided via:
- Patch Node Annotations.
- Reporting resources via the Device Plugin interface to kubelet.
- Directly patching the Node’s
status.capacityandstatus.allocatable.
If the official Device Plugin cannot provide the required information, HAMi develops its own. For example:
- Ascend’s official Device Plugin requires a separate plugin for each card type. HAMi abstracts these card templates into a unified plugin for easier integration with the scheduler.
- NVIDIA requires custom implementations to support advanced features like compute and memory limits, overcommitment, and NUMA awareness, necessitating HAMi’s custom Device Plugin.
How does HAMi enforce GPU memory and compute limits - is it a kernel driver, hardware partition, or a library?
TL;DR
HAMi enforces limits using a user-space CUDA interception library (libvgpu.so, part of HAMi-core). It is not a kernel driver and does not use hardware partitioning. The library intercepts CUDA API calls inside the container before they reach the GPU driver.
When a container starts on a HAMi-managed node, the device plugin mounts libvgpu.so and a /etc/ld.so.preload file into the container via hostPath during the Allocate call. The /etc/ld.so.preload file contains a single line pointing to libvgpu.so. The Linux dynamic linker reads this file when any process starts inside the container and loads libvgpu.so first, before any other library. This achieves the same effect as LD_PRELOAD without modifying any environment variables. Every CUDA memory allocation call (cudaMalloc, cuMemAlloc, and related functions) is then intercepted before it reaches the NVIDIA driver. The library checks the remaining budget from the nvidia.com/gpumem annotation. If the allocation would exceed the limit, the call returns an out-of-memory error to the application.
Compute limits (nvidia.com/gpucores) use a token-bucket throttle inside the same library: compute calls are held until the slice’s compute share is available.
This approach has two implications:
- No kernel or firmware changes are required. HAMi works on any standard Kubernetes node with NVIDIA drivers v440 or later, without needing MIG-capable hardware or privileged kernel modules.
- Applications that bypass the CUDA library are not covered. If an application calls the driver API directly, or runs in an environment where the
/etc/ld.so.preloadmount is not effective (for example Docker-in-Docker or whenCUDA_DISABLE_CONTROL=trueis set), enforcement does not apply. See the related question on gpumem limits not being enforced.
For a detailed diagram of the interception flow, see GPU Virtualization.
How does HAMi vGPU differ from NVIDIA MIG? When should I use each?
TL;DR
HAMi vGPU is a software-only, flexible partition with no hardware requirements. NVIDIA MIG is a hardware partition available only on Ampere and later GPUs (A100, H100, A30). Use HAMi vGPU for workloads that need flexible, dynamic allocation across any NVIDIA GPU. Use MIG when hard hardware isolation and guaranteed performance are required.
| Property | HAMi vGPU | NVIDIA MIG |
|---|---|---|
| Hardware requirement | Any NVIDIA GPU, driver v440+ | Ampere or later only (A100, H100, A30, H200) |
| Isolation mechanism | User-space library interception | Hardware engine partitioning |
| Memory enforcement | Soft (CUDA API level) | Hard (hardware-enforced) |
| Compute enforcement | Soft (throttle inside libvgpu.so) | Hard (separate SM partitions) |
| Partition granularity | 1 MiB memory, 1% compute | Fixed MIG profiles (e.g. 1g.10gb) |
| Dynamic reconfiguration | Yes, no node drain needed | Requires reconfiguring the MIG profile and restarting device plugin |
| Multi-tenant noise isolation | Best-effort | Strong |
HAMi also supports dynamic MIG (dynamic-mig), which uses mig-parted to reconfigure MIG profiles on demand and then schedules through HAMi’s scheduler. See Dynamic MIG Support.
Choose HAMi vGPU when:
- The GPU model does not support MIG
- Workloads need flexible memory sizes that do not map to fixed MIG profiles
- Dynamic repacking of GPU resources is needed without node drains
Choose MIG when:
- Strict hardware-level isolation between tenants is a compliance or SLA requirement
- The workload benefits from guaranteed, predictable compute throughput
Why does nvidia-smi inside my container show less memory than on the host?
TL;DR
This is expected behavior. HAMi replaces the driver’s memory reporting inside the container so that nvidia-smi shows only the allocated limit, not the physical GPU memory. The host still sees the full physical memory.
When libvgpu.so intercepts CUDA driver calls, it also intercepts the query functions that nvidia-smi uses to report total and free GPU memory (nvmlDeviceGetMemoryInfo, cuDeviceTotalMem, and related calls). These return the values derived from nvidia.com/gpumem rather than the physical card capacity.
This design is intentional: workloads that check available GPU memory before deciding how much to allocate (for example, vLLM’s memory profiling step) will see only their budget and size accordingly.
If the host’s nvidia-smi shows more memory than expected on a running pod, that is also expected - the host view shows physical memory, not virtual limits.
Why is my nvidia.com/gpumem limit not enforced - the container uses more memory than requested?
TL;DR
The most common causes are: CUDA_DISABLE_CONTROL=true is set in the container, the workload runs inside a nested container environment (Docker-in-Docker), or the application bypasses the CUDA library and calls the GPU driver directly.
Cause 1: CUDA_DISABLE_CONTROL is set
Setting CUDA_DISABLE_CONTROL=true in the container environment disables the HAMi-core enforcement layer entirely. The container can then access the full physical GPU without restriction.
This variable is intended for debugging only. Remove it from production workloads that need memory limits.
Cause 2: Docker-in-Docker (DinD)
When a container runs another container runtime inside it (DinD), the inner container runtime creates a new root filesystem for inner containers. The /etc/ld.so.preload hostPath mount that the outer container has does not carry over to the filesystems of inner containers. The inner container’s CUDA calls go directly to the driver without passing through libvgpu.so.
HAMi enforcement does not apply inside DinD. This is a known limitation with no current workaround.
Cause 3: Direct driver API usage
Some workloads call the NVIDIA Management Library (NVML) or the CUDA Driver API directly, bypassing libvgpu.so. Examples include custom CUDA kernels that use driver-level allocation or monitoring tools that query NVML directly.
Cause 4: nvidia-container-runtime not set as default
If the container runtime on the node is not configured with nvidia-container-runtime as the default, the device plugin cannot inject libvgpu.so into the container environment. Verify the runtime configuration:
containerd config dump | grep default_runtime_name
The output must show nvidia. If it does not, follow the Prerequisites guide to reconfigure.
Does HAMi replace kube-scheduler or run alongside it?
TL;DR
HAMi runs alongside kube-scheduler as a scheduler extender. It does not replace kube-scheduler. All standard Kubernetes scheduling behavior is preserved.
HAMi deploys a hami-scheduler component that registers as an extender to the standard kube-scheduler. The extender adds two filter and score callbacks:
- Filter: removes nodes that do not have enough vGPU resources to satisfy the pod’s request
- Score: ranks the remaining nodes using the configured policy (Binpack or Spread)
kube-scheduler still runs all built-in filters and priorities. HAMi’s extender runs after them. Pods that do not request any HAMi resource (nvidia.com/gpu, nvidia.com/gpumem, etc.) are never touched by the extender and follow the standard scheduling path.
The MutatingWebhook sets schedulerName: hami-scheduler on pods that request HAMi resources. Pods without HAMi resource requests keep the default schedulerName and are not affected.
HAMi supports running multiple hami-scheduler replicas with leader election. See the configuration guide for Helm values that control scheduler deployment settings.
Does HAMi work with vLLM, and what are the known limitations for multi-GPU tensor parallelism?
TL;DR
HAMi works with vLLM for single-GPU and multi-GPU workloads. Multi-GPU tensor parallelism (tensor_parallel_size > 1) with vLLM versions greater than 0.18 requires HAMi v2.9.0 or later. Earlier versions had partial fixes but tensor parallelism initialization errors persisted in newer vLLM releases.
Single-GPU vLLM
Single-GPU vLLM with nvidia.com/gpumem works without any special configuration. The memory profiling step inside vLLM reads the memory limit from libvgpu.so and allocates accordingly.
Multi-GPU tensor parallelism
vLLM uses NCCL for cross-GPU communication in tensor parallel mode. Earlier HAMi versions had initialization errors when multiple processes inside a container shared CUDA device memory state files. These issues were progressively addressed across v2.7.x and v2.8.x, with full support for vLLM tensor parallelism on vLLM versions greater than 0.18 landing in v2.9.0.
If encountering NCCL initialization failures or Illegal device id segfaults with tensor parallelism, upgrade to HAMi v2.9.0 or later.
Running vLLM in a Volcano environment
vLLM across multiple pods in a Volcano job environment follows the same rules. Set tensor_parallel_size to the number of GPUs per pod, not the total across all pods. Inter-pod communication uses standard NCCL over the pod network (RDMA or TCP), not the HAMi vGPU layer.
vLLM’s --enforce-eager flag disables CUDA graph capture. Some HAMi versions have issues with CUDA graph capture due to shared memory layout differences. If encountering errors during graph capture, try --enforce-eager as a temporary workaround and check the release notes for the specific version.
For more context, see issues #1764 and #1853.
Is HAMi compatible with NVIDIA GPU Operator and DCGM metrics?
TL;DR
HAMi’s device plugin conflicts with the device plugin deployed by GPU Operator. Use HAMi’s device plugin instead of GPU Operator’s if GPU sharing is needed. DCGM-based metrics work independently and are not affected.
Device plugin conflict
GPU Operator installs its own nvidia-device-plugin DaemonSet. HAMi installs hami-device-plugin. Both report nvidia.com/gpu resources to kubelet. Running both on the same node causes resource reporting conflicts and unpredictable scheduling behavior.
Resolution: disable the NVIDIA device plugin component in GPU Operator by setting devicePlugin.enabled=false in the GPU Operator Helm values, then deploy HAMi’s device plugin normally.
# GPU Operator values.yaml
devicePlugin:
enabled: false
DCGM metrics
DCGM Exporter scrapes physical GPU metrics from the NVIDIA driver independently of the device plugin. It is not affected by HAMi’s libvgpu.so and continues to report physical-level counters (temperature, power, SM utilization, physical memory bandwidth) normally.
HAMi’s own metrics (per-container virtual memory usage, virtual core utilization) are exposed separately. See Prometheus and Grafana monitoring below.
How do I set up Prometheus and Grafana monitoring for HAMi vGPU metrics?
TL;DR
HAMi exposes per-container vGPU metrics through hami-device-plugin-monitor. Scrape it with Prometheus and use the bundled Grafana dashboard JSON at static/grafana/gpu-dashboard.json.
Metrics endpoint
The hami-device-plugin pod on each node exposes a metrics endpoint. The port is configurable via the devicePlugin.monitorPort Helm value (default: 31992).
Key metrics exposed:
| Metric name | Description |
|---|---|
Device_memory_desc_of_container | Virtual GPU memory allocated to a container |
Device_utilization_desc_of_container | GPU compute utilization reported per container |
Device_memory_limit_of_container | Memory limit set for the container |
Prometheus scrape config
Add a scrape job to your Prometheus configuration:
scrape_configs:
- job_name: hami-device-plugin
static_configs:
- targets:
- <node-ip>:31992
For Prometheus Operator, create a ServiceMonitor targeting the hami-device-plugin service on port 31992.
Grafana dashboard
A pre-built Grafana dashboard JSON is included in the repository at static/grafana/gpu-dashboard.json. Import it into Grafana via Dashboards > Import > Upload JSON file.
The dashboard shows per-node and per-container virtual GPU memory and compute usage alongside physical GPU counters. If DCGM Exporter is also deployed, the physical counters are populated automatically; otherwise, only the HAMi virtual metrics are available.
For a step-by-step walkthrough, see Grafana Dashboard.