What does Q4 vs Q8 mean for local LLMs?

Quantization compresses the model's weights from full precision (16-bit) down to fewer bits per weight. Q4 means roughly 4 bits per weight (smallest, fastest, slight quality loss). Q8 means 8 bits per weight (larger, slower, near-full quality). Q4_K_M is the modern sweet spot — about 75% size reduction from Q8 with usually 5% or less quality loss. Use Q4 unless you're running benchmarks or have RAM to spare.

Best Local LLM by RAM Tier 2026: 8GB, 16GB, 24GB, 32GB, 64GB+ Picks

Q: What's the best local LLM for 16GB RAM?

Qwen 3 14B at Q4_K_M quantization is the best general pick for 16GB machines in 2026 — fits comfortably with room for context, runs at 15-25 tokens/sec on Mac M-series, and is genuinely good. Llama 3.3 8B at Q8 is the runner-up for tasks requiring more accuracy. Gemma 3 12B Q4 is the best multimodal option.

Q: Can I run a good LLM on 8GB RAM?

Yes, but with limits. Phi-4 mini at Q4 (around 4GB) and Gemma 3 4B at Q4 are the realistic picks for 8GB systems. They're surprisingly capable for chat, summarization, and basic code, but they will not match cloud quality or 30B+ local models. If you have 8GB and want serious local AI, plan a hardware upgrade.

Q: What's the best LLM for 32GB RAM?

Qwen 3 32B at Q4_K_M is the standout at 32GB — uses about 18-20GB and runs at 8-15 tokens/sec on Mac M-series. Llama 4 32B Q4 is the multimodal alternative. DeepSeek-Coder V2 33B Q4 is the coding specialist. All three are genuinely close to cloud-tier quality for most tasks.

Q: Do I need 64GB to run a 70B model?

Yes, comfortably. Llama 4 70B at Q4_K_M needs about 42GB, so 64GB lets you run it with room for context and other apps. 48GB works at Q3 with limited context. On Apple Silicon's unified memory, 64GB and 96GB tiers are the realistic minimums for 70B-class models. On consumer GPUs you typically need a 48GB+ card or split across two GPUs.

The local-LLM ecosystem in 2026 is finally good enough that the question is no longer "can I run something useful?" — it's "what fits my machine?" The honest answer depends almost entirely on RAM (or VRAM if you're on a discrete GPU). This is the per-tier guide.

Short version: 8GB — Phi-4 mini or Gemma 3 4B (limited but usable). 16GB — Qwen 3 14B Q4 (the value sweet spot). 24GB — Qwen 3 32B Q4 (first taste of "good enough"). 32GB — Qwen 3 32B Q5 or Llama 4 32B (cloud-comparable for most tasks). 64GB+ — Llama 4 70B Q4 or Qwen 3 72B Q4 (genuinely replaces ChatGPT for many workflows).

Quantization in one paragraph

Models are stored as numerical weights. Full precision is 16-bit per weight. Quantization compresses to fewer bits — Q8 (8 bits, larger, near-full quality), Q5 (smaller, slight loss), Q4_K_M (the modern sweet spot, ~75% smaller than Q8 with usually under 5% quality loss). Use Q4_K_M as the default unless you have RAM to spare or you're doing benchmark-sensitive work. Q3 and Q2 exist but quality drops noticeably.

Rule of thumb for memory needed: divide model parameters in billions by 2 for Q4, by 1 for Q8. So Llama 4 70B at Q4 needs ~35GB just for weights, plus 5–10GB for context, runtime, and OS overhead.

8GB tier — entry level (M1/M2 base, low-end PCs)

Pick: Phi-4 mini Q4 or Gemma 3 4B Q4

You can absolutely run an LLM on 8GB. You just won't run a great one. Phi-4 mini (Microsoft's distilled model, around 3.8B parameters) at Q4 takes about 2.5GB and is surprisingly capable for chat, summarization, and basic coding questions. Gemma 3 4B Q4 is similar with the bonus of multimodal (image input) support.

Expected speed: 25–40 tokens/sec on Mac M-series base chips. Faster than reading speed.

Honest weakness: these models are below cloud quality. They'll hallucinate facts more often. They struggle with multi-step reasoning. Use them for offline access, basic chat, and learning — not for serious work.

Don't bother with 7B models at Q4 on 8GB systems — the system overhead leaves no room for real context windows, and performance gets ugly.

16GB tier — the value sweet spot (M-series mid-range, RTX 4060 Ti 16GB)

Pick: Qwen 3 14B Q4_K_M

This is the tier where local LLMs become genuinely useful for everyday work. Qwen 3 14B at Q4 takes about 8GB, leaves room for a meaningful context window (32K-64K), and runs at 15–25 tokens/sec on Mac M-series. The model itself is excellent at general chat, code, multilingual tasks, and instruction-following. It's the tier that proves "local is a real option."

Strong runner-up: Llama 3.3 8B at Q8. Smaller model, fuller precision, very accurate for what it is. Better than Qwen 3 14B Q4 on tasks where accuracy matters more than capability ceiling.

Multimodal pick: Gemma 3 12B Q4. The best image-understanding option at this tier. Take a photo, ask it questions.

Coding specialist: Qwen 3 14B Coder Q4. The same Qwen base, fine-tuned for code. Better at code completion and explanation than the general model.

What you can't run well at 16GB: 30B+ models, 70B models, anything at full precision (Q8) above 8B.

24GB tier — first taste of "good enough" (M-series Pro, RTX 4090 24GB)

Pick: Qwen 3 32B Q4_K_M

24GB is where local LLMs start to feel competitive with cloud chat for most tasks. Qwen 3 32B at Q4 takes about 18GB, leaves room for context, and runs at 8–15 tokens/sec on Mac M-series. The quality jump from 14B to 32B is substantial — better reasoning, fewer hallucinations, longer-form coherence.

Mistral Small 3 (24B) Q4 is a strong alternative — about 14GB, faster, slightly less capable but very efficient.

Coding specialist: DeepSeek-Coder V2 33B Q4. Specifically designed for coding tasks; arguably the best open-weight code model in 2026 for this RAM tier.

Multimodal: Qwen 2.5-VL 32B Q4. Strong vision-language model, useful for image-heavy work.

What you still can't do well at 24GB: 70B models (you'd need Q3 with no context), full-precision 30B models.

32GB tier — cloud-comparable for most tasks (M-series Pro/Max, dual GPUs)

Pick: Qwen 3 32B Q5_K_M or Llama 4 32B Q4

32GB is the tier where you can run 30B-class models at higher quantization (Q5 or Q6) for noticeably better quality, or run 32B models with very large context windows (128K+). For most everyday work — writing, coding, analysis, research — this tier is competitive with ChatGPT or Claude on most tasks. You give up frontier reasoning capability and bleeding-edge multimodal, but the gap is small.

Llama 4 32B Q4 is the strongest multimodal option at this tier. Qwen 3 32B Q5 is the strongest text model.

Coding: DeepSeek-Coder V2 33B Q5. The Q5 version at 32GB is a meaningful step up over Q4.

What's possible but tight: Llama 4 70B at Q3. Runs but with limited context window and slower output. Workable if you only run the LLM and nothing else.

64GB tier — the "drop your subscription" zone (M-series Max, M Ultra, 64GB+ systems)

Pick: Llama 4 70B Q4_K_M or Qwen 3 72B Q4_K_M

This is the tier where local LLMs are genuinely cloud-replacing for most users. Llama 4 70B at Q4 takes about 42GB. Add 8–12GB for context and OS overhead and you have a model that handles writing, coding, complex reasoning, and multimodal tasks at a level that actually competes with GPT-5.4 and Claude Sonnet 4.6 for many use cases.

Expected speed: 6–12 tokens/sec on Mac M-series Max/Ultra. Slower than cloud chat but acceptable for sustained work.

Qwen 3 72B Q4 is the alternative — excellent multilingual performance, slightly better at code than Llama 4 70B.

Specialist picks at this tier: DeepSeek V3 distilled at Q4 for the absolute best open-weight reasoning quality. Mixtral 8x22B Q4 for fast inference (sparse mixture of experts).

What's the ceiling? At 64GB you can't comfortably run the full DeepSeek V3 (671B sparse MoE — needs 192GB+) or Llama 4 405B (needs 200GB+). Those are 96GB-and-up territory.

96GB+ tier — workstation territory (M3 Ultra 96GB+, multi-GPU rigs)

Pick: DeepSeek V3 Q4 or Llama 4 70B Q8

At 96GB you start running flagship-class models at higher quantization. Llama 4 70B at Q8 (full precision) is genuinely indistinguishable from cloud quality on most tasks. DeepSeek V3 at Q4 (quantized but the full model) is the strongest open-weight reasoner available.

192GB+ tier (rare — M3 Ultra 192GB or multi-GPU server) is where Llama 4 405B and full-precision DeepSeek V3 become possible. Niche but real.

Mac vs GPU — quick note

On Apple Silicon, the unified memory acts as both RAM and "VRAM" for the model. A 64GB M3 Max can run a 70B model that would require a $5,000+ GPU rig on a PC. This is why Macs are disproportionately popular for local LLMs in 2026 — the value math is excellent.

On NVIDIA GPUs, you're limited to VRAM (24GB on RTX 4090, 32GB on RTX 5090). For 70B models on PC, you typically need an RTX 6000 Ada (48GB) or split across two consumer GPUs.

For most buyers in 2026: an M-series Mac with 32GB+ unified memory is the cheapest way to run serious local LLMs.

The verdict

Match your RAM tier to the right model — there's no universal best LLM, just the best for your hardware. Quantize to Q4_K_M unless you have RAM to spare. Pick Qwen 3 or Llama 4 at the tier that fits.

If you're shopping for a machine to run local LLMs in 2026, target 32GB minimum for "useful daily driver" and 64GB+ for "actually replaces my cloud subscription." On Mac, those are the M-series Pro 32GB or M-series Max 64GB tiers respectively.

FAQ

What's the best local LLM for 16GB RAM?

Qwen 3 14B Q4_K_M is the best general pick — runs at 15-25 tokens/sec on Mac M-series and is genuinely good. Llama 3.3 8B Q8 for accuracy, Gemma 3 12B Q4 for multimodal.

Can I run a good LLM on 8GB RAM?

Yes but with limits. Phi-4 mini Q4 and Gemma 3 4B Q4 are the realistic picks — capable for chat and basic code, but won't match cloud or 30B+ models.

What's the best LLM for 32GB RAM?

Qwen 3 32B at Q5 is the standout. Llama 4 32B Q4 is the multimodal alternative. DeepSeek-Coder V2 33B Q4 is the coding specialist.

Do I need 64GB to run a 70B model?

Yes, comfortably. Llama 4 70B Q4 needs about 42GB; 64GB lets you run it with room for context. On Mac unified memory the math is friendlier than on PC.

What does Q4 vs Q8 mean?

Quantization compresses model weights. Q4 means ~4 bits per weight (smallest, fastest, slight quality loss). Q8 is ~8 bits (larger, near-full quality). Q4_K_M is the modern default — 75% size reduction with usually under 5% quality loss.

Need a frontend to run these models?

Read Local LLMs 2026: Ollama vs LM Studio vs Jan — pick the right runtime for your machine.