DEV Community

# vllm

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster

I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster

Comments
2 min read
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

Comments
9 min read
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

1
Comments
8 min read
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

4
Comments
14 min read
vLLM's V1 Release Fixes the Silent Killer in RL Training

vLLM's V1 Release Fixes the Silent Killer in RL Training

Comments
2 min read
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation

The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation

Comments
8 min read
How RunPod FlashBoot Actually Works (4-Request Test)

How RunPod FlashBoot Actually Works (4-Request Test)

1
Comments
10 min read
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26

Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26

8
Comments 6
3 min read
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

Comments 1
5 min read
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X

72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X

Comments
13 min read
From one model to seven — what it took to make TurboQuant model-portable

From one model to seven — what it took to make TurboQuant model-portable

Comments
3 min read
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

1
Comments
2 min read
Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI

Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI

26
Comments
16 min read
11-Second Time to First Token on a Healthy vLLM Server

11-Second Time to First Token on a Healthy vLLM Server

1
Comments
5 min read
How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLM

How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLM

2
Comments 1
9 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.