Назад към всички

rocm_vllm_deployment

// Production-ready vLLM deployment on AMD ROCm GPUs. Combines environment auto-check, model parameter detection, Docker Compose deployment, health verification, and functional testing with comprehensive logging and security best practices.

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namerocm_vllm_deployment
descriptionProduction-ready vLLM deployment on AMD ROCm GPUs. Combines environment auto-check, model parameter detection, Docker Compose deployment, health verification, and functional testing with comprehensive logging and security best practices.
version1.0.0
authorAlex He <heye_dev@163.com>
timeout3600s
platformLinux (AMD GPU ROCm)
tagsLLM,Deployment,AMD,ROCm,Docker Compose,vLLM,Automation,EnvCheck,AutoRepair

ROCm vLLM Deployment Skill

Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.

Features

  • Environment Auto-Check - Detects and repairs missing dependencies
  • Model Parameter Detection - Auto-reads config.json for optimal settings
  • VRAM Estimation - Calculates memory requirements before deployment
  • Secure Token Handling - Never writes tokens to compose files
  • Structured Output - All logs and test results saved per-model
  • Deployment Reports - Human-readable summary for each deployment
  • Health Verification - Automated health checks and functional tests
  • Troubleshooting Guide - Common issues and solutions

Environment Prerequisites

Recommended (for production): Add to ~/.bash_profile:

# HuggingFace authentication token (required for gated models)
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Model cache directory (optional)
export HF_HOME="$HOME/models"

# Apply changes
source ~/.bash_profile

Not required for testing: The skill will proceed without these set:

  • HF_TOKEN: Optional — public models work without it; gated models fail at download with clear error
  • HF_HOME: Optional — defaults to /root/.cache/huggingface/hub

Environment Variable Detection

Priority Order:

  1. Explicit parameter (highest) — Provided in task/request (e.g., hf_token: "xxx")
  2. Environment variable — Already set in shell or from parent process
  3. ~/.bash_profile — Source to load variables
  4. Default value (lowest) — HF_HOME defaults to /root/.cache/huggingface/hub
VariableRequiredIf Missing
HF_TOKENConditionalContinue without token (public models work; gated models fail at download with clear error)
HF_HOMENoWarning + Default — Use /root/.cache/huggingface/hub

Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.


Helper Scripts

Location: <skill-dir>/scripts/

check-env.sh

Validate and load environment variables before deployment.

Usage:

# Basic check (HF_TOKEN optional, HF_HOME optional with default)
./scripts/check-env.sh

# Strict mode (HF_HOME required, fails if not set)
./scripts/check-env.sh --strict

# Quiet mode (minimal output, for automation)
./scripts/check-env.sh --quiet

# Test with environment variables
HF_TOKEN="hf_xxx" HF_HOME="/models" ./scripts/check-env.sh

Exit Codes:

CodeMeaning
0Environment check completed (variables loaded or defaulted)
2Critical error (e.g., cannot source ~/.bash_profile)

Note: This script is optional. You can also directly run source ~/.bash_profile.


generate-report.sh

Generate human-readable deployment report after successful deployment.

Usage:

./scripts/generate-report.sh <model-id> <container-name> <port> <status> [model-load-time] [memory-used]

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Parameters:

ParameterRequiredDescription
model-idYesModel ID (with / replaced by -)
container-nameYesDocker container name
portYesHost port for API endpoint
statusYesDeployment status (e.g., "✅ Success")
model-load-timeNoModel loading time in seconds
memory-usedNoMemory consumption in GiB

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Exit Codes:

CodeMeaning
0Report generated successfully
1Missing required parameters
2Output directory not found

Integration: This script is automatically called in Phase 7 of the deployment workflow.


Input Schema

ParameterTypeRequiredDefaultDescription
model_idStringYes-HuggingFace model ID
docker_imageStringNorocm/vllm-dev:nightlyvLLM Docker image
tensor_parallel_sizeIntegerNo1Number of GPUs
portIntegerNo9999API server port
hf_homeStringNo${HF_HOME} or /root/.cache/huggingface/hubModel cache directory
hf_tokenSecretConditional${HF_TOKEN}HuggingFace token (optional for public models, required for gated models)
max_model_lenIntegerNoAuto-detectMaximum sequence length
gpu_memory_utilizationFloatNo0.85GPU memory utilization
auto_installBooleanNotrueAuto-install dependencies
log_levelStringNoINFOLogging verbosity

Output Structure

All deployment artifacts MUST be saved to:

$HOME/vllm-compose/<model-id-slash-to-dash>/

Convert model ID to directory name by replacing / with -:

  • openai/gpt-oss-20b$HOME/vllm-compose/openai-gpt-oss-20b/
  • Qwen/Qwen3-Coder-Next-FP8$HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/

Per-model directory structure:

$HOME/vllm-compose/<model-id>/
├── deployment.log          # Full deployment logs (stdout + stderr)
├── test-results.json       # Functional test results (JSON format)
├── docker-compose.yml      # Generated Docker Compose file
├── .env                    # HF_TOKEN environment (chmod 600, optional)
└── DEPLOYMENT_REPORT.md    # Human-readable deployment summary

File requirements:

  • deployment.log — Capture ALL container logs during deployment
  • test-results.json — Save API response from functional test request
  • DEPLOYMENT_REPORT.md — Generated in Phase 7
  • All three files MUST exist before marking deployment as complete

Execution Workflow

Phase 0: Environment Check & Auto-Repair

Step 0.1: Load Environment Variables

# Source ~/.bash_profile to load HF_HOME and HF_TOKEN
source ~/.bash_profile

# If HF_HOME is not defined, it defaults to /root/.cache/huggingface/hub

If HF_HOME is not defined in ~/.bash_profile, it defaults to /root/.cache/huggingface/hub.

Step 0.2: Create Output Directory

  • Create: $HOME/vllm-compose/<model-id>/

Step 0.3: Initialize Logging

  • All output → $HOME/vllm-compose/<model-id>/deployment.log

Step 0.4: System Checks

  • Detect OS and package manager
  • Check Python, pip, huggingface_hub
  • Check Docker, docker compose
  • Check ROCm tools (rocm-smi/amd-smi)
  • Check GPU access (/dev/kfd, /dev/dri)
  • Check disk space (20GB minimum)

Phase 1: Model Download

Use HF_HOME from Phase 0 (environment variable or default):

# Download model to HF_HOME
huggingface-cli download <model_id> --local-dir "$HF_HOME/hub/models--<org>--<model>"

# Or use snapshot_download via Python:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='<model_id>', cache_dir='$HF_HOME')"

Authentication Handling:

ScenarioBehavior
Public model + no token✅ Download succeeds
Public model + token provided✅ Download succeeds
Gated model + no token❌ Download fails with "authentication required" error
Gated model + invalid token❌ Download fails with "invalid token" error
Gated model + valid token✅ Download succeeds

On Authentication Failure:

echo "ERROR: Model download failed - authentication required"
echo "This model requires a valid HF_TOKEN."
echo ""
echo "Please add to ~/.bash_profile:"
echo "  export HF_TOKEN=\"hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\""
echo "Then run: source ~/.bash_profile"
exit 1
  • Locate model path in HF cache: $HF_HOME/hub/models--<org>--<model-name>/
  • Log download progress to deployment.log

Phase 2: Model Parameter Detection

  • Read config.json from model
  • Auto-detect: max_model_len, hidden_size, num_attention_heads, num_hidden_layers, vocab_size, dtype
  • Validate TP size divides attention heads
  • Estimate VRAM requirement

Phase 3: Docker Compose Configuration

Generate files in output directory:

  • docker-compose.yml$HOME/vllm-compose/<model-id>/docker-compose.yml

    • Mount HF_HOME as volume (read-only for models)
    • NO hardcoded tokens in compose file
  • .env$HOME/vllm-compose/<model-id>/.env (optional)

    • Contains: HF_TOKEN=<value>
    • Permissions: chmod 600
    • Only created if user explicitly requests persistent token storage

Volume mount example:

volumes:
  - ${HF_HOME}:/root/.cache/huggingface/hub:ro
  - /dev/kfd:/dev/kfd
  - /dev/dri:/dev/dri

Important: Docker Compose reads ${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: source ~/.bash_profile

Phase 4: Container Launch

Important: Before deploying, pull the latest image to ensure updates:

docker pull rocm/vllm-dev:nightly

Note: Default port is 9999. Before running docker compose, check if port is available: ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.

  • Pass HF_TOKEN at runtime: HF_TOKEN=$HF_TOKEN docker compose up -d
  • Wait for container initialization

Phase 5: Health Verification

  • Check container status
  • Test /health endpoint
  • Test /v1/models endpoint

Phase 6: Functional Testing

  • Run completion test via /v1/chat/completions API
  • Save response to: $HOME/vllm-compose/<model-id>/test-results.json
  • Verify response contains valid completion
  • Log deployment complete → Append to deployment.log
  • Deployment is complete only when both files exist:
    • deployment.log
    • test-results.json

Phase 7: Deployment Report

Generate human-readable deployment report using the helper script.

Step 7.1: Extract Deployment Metrics

# Parse deployment.log for metrics
MODEL_LOAD_TIME=$(grep -o "model loading took [0-9.]* seconds" deployment.log | grep -o '[0-9.]*' || echo "N/A")
MEMORY_USED=$(grep -o "took [0-9.]* GiB memory" deployment.log | grep -o '[0-9.]*' || echo "N/A")

Step 7.2: Generate Report

# Execute the report generation script
<skill-dir>/scripts/generate-report.sh \
  "<model-id>" \
  "<container-name>" \
  "<port>" \
  "<status>" \
  "$MODEL_LOAD_TIME" \
  "$MEMORY_USED"

# Example:
./scripts/generate-report.sh \
  "Qwen-Qwen3-0.6B" \
  "vllm-qwen3-0-6b" \
  "8001" \
  "✅ Success" \
  "3.6" \
  "1.2"

Output: $HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md

Report Contents:

  • Output structure verification (file checklist)
  • Deployment summary table (health, test, metrics)
  • Test results (request/response preview)
  • Environment configuration
  • Quick commands for operations

Completion Criteria:

  • DEPLOYMENT_REPORT.md exists in output directory
  • Report contains all required sections
  • All file checks show ✅

Security Best Practices

  1. Never commit tokens to version control — Add .env to .gitignore
  2. Use .env files with chmod 600 — Restrict access to owner only
  3. Mask tokens in logs — Show only first 10 chars: ${TOKEN:0:10}...
  4. Pass tokens at runtimeHF_TOKEN=$HF_TOKEN docker compose up -d
  5. Store tokens in ~/.bash_profile — For production environments, set HF_TOKEN in user's shell config
  6. Set token for gated models — HF_TOKEN is validated at download time; set in ~/.bash_profile for production

Troubleshooting

Environment Variables

IssueSolution
HF_TOKEN not setAdd export HF_TOKEN="hf_xxx" to ~/.bash_profile, then source ~/.bash_profile. Or provide via parameter.
HF_HOME not setdefaults to /root/.cache/huggingface/hub. For production, add export HF_HOME="/path" to ~/.bash_profile.
~/.bash_profile not foundCreate ~/.bash_profile and add environment variables.
Changes not taking effectRun source ~/.bash_profile or restart terminal.
HF_TOKEN provided but download still failsToken may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens

Model Download

IssueSolution
Authentication required (gated model)Set HF_TOKEN in ~/.bash_profile or provide via parameter. Ensure token has access to the model.
Model not foundVerify model ID is correct (case-sensitive). Check model exists on HuggingFace.
Download timeoutCheck network connection. Large models may take time.

Deployment

IssueSolution
hf CLI not foundpip install huggingface_hub
Docker Compose failsUse docker compose (no hyphen)
GPU access failsAdd user to render group: sudo usermod -aG render $USER
Port in useChange port parameter
OOMReduce gpu_memory_utilization

Cleanup

cd $HOME/vllm-compose/<model-id>
docker compose down

Status Check

Check deployment status and logs:

# View deployment directory
ls -la $HOME/vllm-compose/<model-id>/

# View live logs
tail -f $HOME/vllm-compose/<model-id>/deployment.log

# View test results
cat $HOME/vllm-compose/<model-id>/test-results.json

# Check container status
docker ps | grep <model-id>

# Verify environment variables
echo "HF_TOKEN: ${HF_TOKEN:0:10}..."
echo "HF_HOME: $HF_HOME"

Quick Start (Production)

Step 1: Add environment variables to ~/.bash_profile

# Required: HuggingFace token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Recommended: Custom model storage path (production)
export HF_HOME="/data/models/huggingface"

# Apply changes
source ~/.bash_profile

Step 2: Verify environment is ready

# Source ~/.bash_profile to load variables
source ~/.bash_profile

# Expected output:
# === Environment Ready ===
# Summary:
#   HF_TOKEN: hf_xxxxxx...
#   HF_HOME:  /data/models/huggingface

Step 3: Run deployment

# The skill will automatically:
# 1. Source ~/.bash_profile to load HF_HOME and HF_TOKEN
# 2. Use HF_TOKEN and HF_HOME from environment (or ~/.bash_profile, or defaults)
# 3. Proceed without token for public models
# 4. Fail at download time with clear error if gated model requires token

Version History

VersionChanges
1.0.0Initial release