Hi HN, I’m one of the authors of this post.
We’ve updated Docker Model Runner to support vLLM alongside the existing llama.cpp backend. The goal is to bridge the gap between local prototyping (often done with GGUF/llama.cpp) and high-throughput production (often done with Safetensors/vLLM) using a consistent Docker workflow.
Key technical details:
Auto-routing: The tool detects the model format. If you pull a GGUF model, it routes to llama.cpp. If you pull a Safetensors model, it routes to vLLM.
API: It exposes an OpenAI-compatible API (/v1/chat/completions), so the client code doesn't need to change based on the backend.
Usage: It’s just docker model run ai/smollm2-vllm.
Current Limitations:
Right now, the vLLM backend is optimized for x86_64 with Nvidia GPUs.
We are actively working on WSL2 support for Windows users and DGX Spark compatibility.
Happy to answer any questions about the integration or the roadmap!
https://www.docker.com/blog/docker-model-runner-integrates-v...
Comments URL: https://news.ycombinator.com/item?id=45996081
Points: 2
# Comments: 1
from Hacker News: Newest https://ift.tt/RiYfJ0z
0 comments:
Post a Comment