Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
vllm
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
GaeaRuiW
GaeaRuiW
GaeaRuiW
Follow
Jun 9
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
#
kubernetes
#
vllm
#
devops
#
opensource
Comments
Add Comment
2 min read
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 7
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
#
llm
#
ai
#
infrastructure
#
vllm
Comments
Add Comment
9 min read
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 6
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
#
llm
#
ai
#
vllm
#
performance
1
 reaction
Comments
Add Comment
8 min read
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
xbill
xbill
xbill
Follow
for
Google Developer Experts
May 30
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
#
googleantigravity
#
vllm
#
googlecloudrun
#
gemma4
4
 reactions
Comments
Add Comment
14 min read
vLLM's V1 Release Fixes the Silent Killer in RL Training
Aamer Mihaysi
Aamer Mihaysi
Aamer Mihaysi
Follow
May 8
vLLM's V1 Release Fixes the Silent Killer in RL Training
#
vllm
#
machinelearning
#
python
Comments
Add Comment
2 min read
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
Matthew Gladding
Matthew Gladding
Matthew Gladding
Follow
Apr 24
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
#
model
#
memory
#
models
#
vllm
Comments
Add Comment
8 min read
How RunPod FlashBoot Actually Works (4-Request Test)
Sergey Shmakov
Sergey Shmakov
Sergey Shmakov
Follow
May 26
How RunPod FlashBoot Actually Works (4-Request Test)
#
runpod
#
flashboot
#
serverless
#
vllm
1
 reaction
Comments
Add Comment
10 min read
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
Grace
Grace
Grace
Follow
May 21
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
#
vllm
#
ai
#
machinelearning
#
llm
8
 reactions
Comments
6
 comments
3 min read
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
Thurmon Demich
Thurmon Demich
Thurmon Demich
Follow
May 20
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
#
ollama
#
llamacpp
#
vllm
#
comparison
Comments
1
 comment
5 min read
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMDÂ MI300X
Manikandan T
Manikandan T
Manikandan T
Follow
May 13
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMDÂ MI300X
#
vllm
#
rocm
#
mi300x
#
genai
Comments
Add Comment
13 min read
From one model to seven — what it took to make TurboQuant model-portable
Alberto Nieto
Alberto Nieto
Alberto Nieto
Follow
Apr 1
From one model to seven — what it took to make TurboQuant model-portable
#
python
#
vllm
#
gpu
#
triton
Comments
Add Comment
3 min read
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1
Alberto Nieto
Alberto Nieto
Alberto Nieto
Follow
Mar 28
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1
#
python
#
vllm
#
gpu
#
containers
1
 reaction
Comments
Add Comment
2 min read
Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI
xbill
xbill
xbill
Follow
for
Google Developer Experts
Apr 28
Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI
#
vllm
#
googleadk
#
tpu
#
gemini
26
 reactions
Comments
Add Comment
16 min read
11-Second Time to First Token on a Healthy vLLM Server
Ingero Team
Ingero Team
Ingero Team
Follow
Apr 21
11-Second Time to First Token on a Healthy vLLM Server
#
vllm
#
observability
#
ebpf
#
mcp
1
 reaction
Comments
Add Comment
5 min read
How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLM
Maksim Danilchenko
Maksim Danilchenko
Maksim Danilchenko
Follow
Apr 11
How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLM
#
gemma4
#
ollama
#
llamacpp
#
vllm
2
 reactions
Comments
1
 comment
9 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account