Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

17 points by anditherobot 2 days ago | 6 comments

bluejay2387 1 days ago [-]

2x 3090's running Ollama and VLLM... Ollama for most stuff and VLLM for the few models that I need to test that don't run on Ollama. Open Web UI as my primary interface. I just moved to Devstral for coding using the Continue plugin in VSCode. I use Qwen 3 32b for creative stuff and Flux Dev for images. Gemma 3 27b for most everything else (slightly less smart than Qwen, but its faster). Mixed Bread for embeddings (though apparently NV-Embed-v2 is better?). Pydantic as my main utility library. This is all for personal stuff. My stack at work is completely different and driven more by our Legal teams than technical decisions.

fazlerocks 2 days ago [-]

Running Llama 3.1 70B on 2x4090s with vLLM. Memory is a pain but works decent for most stuff.

Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.

The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.

runjake 1 days ago [-]

Ollama + M3 Max 36GB Mac. Usually with Python + SQLite3.

The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.

I use various smaller (~3B) models for simpler tasks.

v5v3 15 hours ago [-]

Ollama on a M1 MacBook pro but will be moving to a Nvidia GPU setup.

gabriel_dev 1 days ago [-]

Ollama + mac mini 24gb (inference)

xyc 1 days ago [-]

recurse.chat + M2 max Mac

2 days ago [-]