HAMi vs MIG: 2 Weeks of Real Testing on H100 GPUs

After my previous post about MIG setup gained traction on Reddit, the community asked a critical question: How does MIG compare to HAMi for real workloads?

I spent 2 weeks testing both technologies on H100 hardware with actual ML workloads. Here's what surprised me.

Three Things That Surprised Me

1. Synthetic Benchmarks Are Misleading

Expected: 4x performance difference based on synthetic benchmarks
Reality: 1.7x performance difference with real BERT training
Lesson: Always test with your actual workloads

2. Error Messages Matter More Than Performance Numbers

HAMi: Clear, actionable CUDA out-of-memory errors with specific usage details
MIG: Cryptic internal PyTorch assertions that waste debugging time
Impact: When things break at 3 AM, clear error messages save hours

3. MIG Operations Are an SRE Nightmare

HAMi: 30-second job restart for configuration changes
MIG: 15-minute node reboot cycle affecting all workloads
Reality: MIG's operational immaturity in Kubernetes makes simple changes painful

Real-World Performance Results

BERT Training: The Transformer Test

I trained BERT-base for 100 epochs with 32-sequence batches on both technologies.

Technology	Average Time per Batch	Real Performance Difference
HAMi 12GB	35.3ms	70% faster
MIG 1g.12gb	60.0ms	Baseline

Training a custom model overnight? HAMi finishes in 6 hours, MIG takes 10 hours. This is not only time but more cost too.

The consistency story:

# HAMi performance (consistent across epochs)
HAMi - Epoch 70: 32.5ms
HAMi - Epoch 80: 33.1ms
HAMi - Epoch 90: 33.6ms

# MIG performance (predictably stable)
MIG - Epoch 70: 58.7ms
MIG - Epoch 80: 58.8ms
MIG - Epoch 90: 59.3ms

HAMi delivered consistent performance with software time-slicing. MIG provided predictably stable performance through hardware isolation. Both approaches worked reliably, just at different speeds.

The Debugging Experience That Actually Matters

I deliberately triggered memory pressure to see how each technology handles failures.

When HAMi Hits Memory Limits

[HAMI-core ERROR]: Device 0 OOM 12889096192 / 12582912000
Failed at 22 GB: CUDA out of memory. Tried to allocate 512.00 MiB. 
GPU 0 has a total capacity of 11.72 GiB of which 220.00 MiB is free.
Including non-PyTorch memory, this process has 11.50 GiB memory in use.

Analysis: Clear error with exact memory usage, allocation attempt size, and available memory. Your data scientist knows exactly what to fix.

When MIG Hits Memory Limits

Failed at 21 GB: NVML_SUCCESS == r INTERNAL ASSERT FAILED at 
"/opt/conda/conda-bld/pytorch_1708025847130/work/c10/cuda/CUDACachingAllocator.cpp":830, 
please report a bug to PyTorch.

Analysis: Cryptic internal assertion with no actionable information. The on-call engineer gets paged at 3 AM, sees this error, and has no idea what to do. With HAMi, they know exactly what happened and how to fix it.

This error message quality alone makes HAMi worth considering for teams that value debugging efficiency over perfect isolation.

Multi-User Reality: How They Handle Contention

Test: 5 identical training jobs submitted simultaneously

HAMi Multi-User Behavior

All jobs scheduled immediately
Time-slicing managed fairly by scheduler lock
Jobs completed within 5% variance of each other
No job starvation observed
Trade-off: Brief scheduling delays when many jobs compete for time slices

MIG Multi-User Behavior

Jobs ran in complete hardware isolation
Zero interference between workloads
Identical performance regardless of other activity
Trade-off: Fixed resource allocation means unused capacity can't be reclaimed

The reality: HAMi optimizes for maximum utilization through intelligent scheduling. MIG optimizes for predictable performance through hardware guarantees.

Operational Complexity: The SRE Perspective

Changing Resource Allocations

HAMi workflow:

kubectl delete job training-job
kubectl apply -f training-job-updated.yaml  # New memory allocation
# Impact: 30-second job restart, other workloads unaffected

MIG workflow:

# Full node maintenance cycle required:
1. kubectl drain gpu-node-1           # Evacuate ALL workloads  
2. Reconfigure MIG profiles           # Modify hardware partitions
3. Reboot node                        # Apply changes (mandatory)
4. kubectl uncordon gpu-node-1        # Resume scheduling
# Impact: 15-minute downtime affecting entire node

The SRE reality: MIG in Kubernetes feels operationally immature. Simple configuration changes require maintenance windows, node labeling, and complex coordination. HAMi treats GPU resources like standard Kubernetes resources.

Monitoring and Debugging

Common challenges (both technologies):

DCGM metrics show node/instance-level data only
Missing per-pod GPU utilization metrics
Custom monitoring required for detailed workload insights

HAMi advantages:

Standard Kubernetes debugging workflows
Clear error messages with actionable details
Application logs contain useful GPU context

MIG advantages:

Hardware isolation simplifies problem identification
Per-instance metrics via nvidia-smi mig -lgip
No cross-workload interference during troubleshooting

Simple Decision Framework

Most teams should start with HAMi unless they have specific compliance requirements

Choose HAMi if:

You run overnight training jobs (6 hours vs 10 hours matters)
You debug GPU memory issues regularly (clear error messages save hours)
You need to change GPU allocations frequently (30 seconds vs 15 minutes)
You have trusted internal users (data science teams, research groups)

Choose MIG if:

You have external users or compliance requirements (hardware isolation is non-negotiable)
You prefer predictable performance over peak performance
You can plan resource allocations in advance
You have the operational overhead budget for complex reconfiguration workflows

Supporting Evidence: Synthetic Benchmarks

These controlled tests explain the theoretical differences but don't predict real-world performance

Memory Bandwidth Performance

Configuration	GPU Internal (GB/s)
HAMi 12GB	1,667
MIG 1g.12gb	202

Compute Performance

Configuration	Matrix Multiplication (TFLOPS)
HAMi 12GB	45.4
MIG 1g.12gb	4.8

Key insight: Synthetic benchmarks showed 8x performance differences, but real BERT training achieved only 1.7x improvement. This proves why testing with actual workloads is critical.

Getting Started: Test Both Approaches

Week 1: HAMi Evaluation

# Install HAMi
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm install hami hami-charts/hami -n kube-system

# Test with your actual training job
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: bert-training-hami
spec:
  template:
    spec:
      containers:
      - name: training
        image: your-training-image
        resources:
          limits:
            nvidia.com/gpu: 1
            nvidia.com/gpumem: 12000
EOF

Week 2: MIG Evaluation

# Configure MIG profiles (prepare for operational complexity)
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      test-config:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.12gb": 2
EOF

kubectl label nodes gpu-node-1 nvidia.com/mig.config=test-config

# Run same training job for comparison
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: bert-training-mig
spec:
  template:
    spec:
      containers:
      - name: training
        image: your-training-image
        resources:
          limits:
            nvidia.com/mig-1g.12gb: 1
EOF

What to Measure

Training time with your actual models (ignore synthetic benchmarks)
Error handling experience during memory pressure
Multi-user scenarios with your team's usage patterns
Configuration change complexity and acceptable downtime
Debugging workflows when things inevitably break

My Honest Recommendation

For most internal ML teams: Start with HAMi, keep it simple. The 70% performance improvement matters for iteration speed, the error messages save debugging time, and the operational model fits Kubernetes better.

For multi-tenant platforms: Although the operational complexity is a nightmare, I'd choose MIG. Hardware isolation and compliance guarantees are worth the overhead when serving external users.

Hybrid approach: Use HAMi for training clusters (performance-critical) and MIG for inference serving (isolation-critical).

Most importantly: Test with YOUR workloads. These results show what's possible, but your specific models, team dynamics, and operational requirements determine the right choice.

What's Next

I'm continuing to test these technologies in production-scale scenarios. Upcoming posts will cover:

Multi-node GPU scheduling strategies
Cost optimization for shared GPU infrastructure
Production monitoring best practices
Migration strategies between technologies

Resources

Reproduce these tests: github.com/kaskol10/gpu-benchmark
HAMi project: github.com/Project-HAMi/HAMi
MIG setup guide: NVIDIA MIG User Guide
Previous MIG post: Setting Up MIG on H100
Community discussion: Reddit r/kubernetes thread

Results based on 2 weeks of testing with H100 hardware. Performance will vary based on your workloads, cluster configuration, and operational requirements.