2x 3090's running Ollama and VLLM... Ollama for most stuff and VLLM for the few models that I need to test that don't run on Ollama. Open Web UI as my primary interface. I just moved to Devstral for coding using the Continue plugin in VSCode. I use Qwen 3 32b for creative stuff and Flux Dev for images. Gemma 3 27b for most everything else (slightly less smart than Qwen, but its faster). Mixed Bread for embeddings (though apparently NV-Embed-v2 is better?). Pydantic as my main utility library. This is all for personal stuff. My stack at work is completely different and driven more by our Legal teams than technical decisions.
fazlerocks 2 days ago [-]
Running Llama 3.1 70B on 2x4090s with vLLM. Memory is a pain but works decent for most stuff.
Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.
The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.
runjake 1 days ago [-]
Ollama + M3 Max 36GB Mac. Usually with Python + SQLite3.
The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.
I use various smaller (~3B) models for simpler tasks.
v5v3 15 hours ago [-]
Ollama on a M1 MacBook pro but will be moving to a Nvidia GPU setup.
Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.
The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.
The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.
I use various smaller (~3B) models for simpler tasks.