Choosing the right AI model
for your hardware and workflow
Not all models are created equal — and not all hardware can run them. This guide explains the most relevant open-source models available today, their benchmark performance, hardware requirements, and the specific use cases where they shine. Whether you have a basic laptop or a workstation with a 24 GB GPU, there is a model for you.
All local models below can be pulled via ollama pull model-name. Cloud models require a provider API key configured in MIKA5 settings.
Hardware Tiers
Running AI locally means the model weights must fit in memory — either RAM (for CPU inference) or VRAM (for GPU inference). GPU inference is always significantly faster, but CPU works for smaller models. As a rule of thumb, a 7B model needs approximately 4–5 GB in 4-bit quantization (Q4), 5–6 GB in Q5, and 7–8 GB in Q8. Multiply by model size in billions of parameters to estimate requirements.
8 GB RAM — CPU Only or Integrated GPU
Entry-level hardware includes most budget laptops, older workstations, and PCs with less than 8 GB of dedicated VRAM. CPU inference is possible but slow — expect 3–8 tokens/second on modern CPUs. Focus on models under 4 GB in size (1B–3B parameters in Q4 quantization). These models are surprisingly capable for everyday tasks like summarizing text, answering questions, writing emails, and simple coding assistance.
Meta's Llama 3.2 3B is one of the best small models available. Despite its compact size, it follows instructions reliably, writes well in English and Spanish, and handles summarization and Q&A with quality that outperforms models twice its size from previous generations. Ideal as the default starter model.
Alibaba's Qwen 2.5 3B punches above its weight in reasoning and multilingual tasks. Trained on a massive multilingual dataset, it handles Spanish, Chinese, English, and dozens of other languages with comparable fluency. For users who frequently work in multiple languages, this is the go-to entry-level model.
Microsoft's Phi-4 series demonstrates that careful data curation matters as much as scale. The 3B variant scores remarkably well on reasoning and math benchmarks relative to its size. Excellent for STEM tasks, problem solving, and situations where logical accuracy is critical. Struggles more than Llama/Qwen with creative writing.
Google's Gemma 2 2B is the lightest capable model for very constrained hardware. At under 1.5 GB in Q4, it runs on nearly any modern PC. Ideal for simple chat, quick lookups, and draft writing when hardware severely limits options. Quality drops noticeably on complex tasks.
16 GB RAM — GPU 6–8 GB VRAM
The mid-range sweet spot. A PC with 16 GB of system RAM and a GPU with 6–8 GB of VRAM (RTX 3060, RX 6700 XT, etc.) can run 7B models entirely on GPU, delivering 20–60 tokens/second — fast enough for a natural conversation flow. This is the most common configuration among power users and the recommended starting point for serious AI work. 7B models in Q4 quantization require approximately 4–5 GB of VRAM.
The Qwen 2.5 7B is arguably the best 7B model available as of 2025-2026. It scores near the top of MMLU (General Knowledge), MT-Bench (instruction following), and GSM8K (math reasoning) benchmarks at this scale. Excellent multilingual support, long 32k context window, and competitive coding ability. The go-to recommendation for most MIKA5 users with a mid-range PC.
Meta's Llama 3.1 8B remains a strong baseline for English-primary tasks. Excellent instruction following, natural conversation, and creative writing. The massive fine-tuning ecosystem around Llama means there are hundreds of specialized variants — from roleplay to legal analysis. Strong RLHF means it's helpful and safe out of the box. Best for English-speaking users who want broad compatibility.
Mistral AI's 7B model introduced sliding window attention and grouped query attention, making it faster than naive 7B implementations. It's one of the fastest 7B models in tokens-per-second and excels at structured output generation. Perfect for users who need rapid responses over maximum quality, API-style applications within MIKA5, or data extraction tasks where speed and format adherence matter.
DeepSeek-R1 is a reasoning model trained with reinforcement learning to "think out loud" before giving an answer — similar in concept to OpenAI's o1 series. The 7B distilled version brings this chain-of-thought reasoning capability to mid-range hardware. Responses are slower because the model reasons step by step, but the quality on math, logic, and complex analysis tasks is dramatically better than standard 7B models. Use when accuracy matters more than speed.
The best local vision-language model at the 7B scale. Qwen2.5-VL can understand images, charts, diagrams, screenshots, and handwritten notes. It outperforms LLaVA-13B and many larger vision models on benchmarks like MMBench and TextVQA. Use it when you need to describe images, extract text from screenshots, analyze charts in your documents, or work with visual content in MIKA5.
Google's Gemma 3 9B is a multimodal model that can process both text and images. It excels at following detailed instructions and is strong in safety and harmlessness. The 9B model runs on 6 GB VRAM in Q4 and offers a good balance between vision capability and speed. Also available in 27B for users with more VRAM.
32 GB RAM — GPU 12–16 GB VRAM
High-end configurations unlock the 13B–14B model tier on GPU, and can run 30B–34B models in CPU+GPU hybrid mode. At this level, quality becomes noticeably stronger — complex multi-step reasoning, nuanced creative writing, and professional-grade coding all become possible. With 16 GB VRAM (RTX 4080, RTX 3090, RX 7900 XTX), even 14B models run at excellent speeds. Users with 32 GB system RAM can also consider 30B models via CPU offloading.
The 14B parameter Qwen 2.5 delivers performance comparable to much larger models from previous generations. It outperforms Llama 3 70B on several benchmarks including coding (HumanEval) and mathematics (MATH benchmark). The jump from 7B to 14B is significant: more coherent long documents, better code, and more nuanced reasoning. If you have 12 GB of VRAM, this is the most impactful upgrade.
Fine-tuned specifically on code, Qwen2.5-Coder 14B achieves scores that match or exceed GPT-4-level models on HumanEval (Python coding benchmark) at this model size. It understands and generates code in 40+ programming languages, can explain complex algorithms, debug errors, write unit tests, and perform code review. The best local option for serious software developers using MIKA5 as a coding companion.
The 14B distillation of DeepSeek-R1 brings near o1-level reasoning performance to 12 GB of VRAM. It excels at step-by-step mathematical proofs, logical puzzles, complex analysis, and scientific problem solving. The model explicitly shows its "thinking" process before answering, which is useful for verifying its reasoning chain. An excellent choice when you need rigorous, verifiable outputs rather than fast conversational responses.
Microsoft's Phi-4 14B demonstrates that careful training data selection can produce a model that outperforms much larger models on STEM benchmarks. It scores in the top tier on MATH, AMC, and GPQA (graduate-level science questions) at the 14B scale. Not the best for general conversation or creative writing, but exceptional for scientific analysis, mathematics, and code comprehension. A great complement to a more conversational model.
64 GB RAM — GPU 24 GB+ VRAM (or multi-GPU)
Workstation-class hardware — typically an RTX 3090/4090 (24 GB VRAM), or professional GPUs like A100/H100, or dual GPUs — opens up the full 70B model class on GPU. At this tier, local performance can rival or exceed cloud APIs in quality. A single RTX 4090 can run a Llama 3.3 70B (Q4, ~40 GB) with CPU offloading, or a smaller 34B model entirely on GPU. This is also the tier where high-context models with 256k+ token windows become practical.
| Model | Params | VRAM (Q4) | Strength | Command |
|---|---|---|---|---|
| llama3.3:70b | 70B | ~42 GB | Best overall Llama model · Strong on all tasks | ollama pull llama3.3:70b |
| qwen2.5:72b | 72B | ~44 GB | Top MMLU · Best multilingual at this tier | ollama pull qwen2.5:72b |
| deepseek-r1:70b | 70B | ~43 GB | Best local reasoning · Near o1 quality | ollama pull deepseek-r1:70b |
| qwen2.5-coder:32b | 32B | ~20 GB | Best local code model · Near GPT-4 on HumanEval | ollama pull qwen2.5-coder:32b |
| qwen2.5-vl:72b | 72B | ~44 GB | Best local vision · Exceeds GPT-4V on TextVQA | ollama pull qwen2.5-vl:72b |
By Use Case
Reasoning & Analysis
Reasoning models use extended "thinking" passes — sometimes called chain-of-thought or test-time compute — to solve complex problems more accurately. They are slower than standard models but dramatically more reliable for tasks involving multiple logical steps, mathematical proofs, debugging complex code, or analyzing nuanced arguments. For document analysis in MIKA5, combining a reasoning model with the RAG engine produces exceptionally grounded, verifiable answers.
Available in 1.5B, 7B, 14B, 32B, and 70B. The 7B is the best entry point for mid-range hardware. Training via RL (reinforcement learning) on reasoning tasks means it generalizes to novel problems that weren't in training data. Scores above 90% on MATH and AIME competition math problems at 70B scale — approaching human expert level.
Alibaba's Qwen series has consistently topped leaderboards for general reasoning at various scales. Qwen3.5 (if available via Ollama in your region) extends this further. Qwen2.5 72B matches or exceeds Llama 3.3 70B on most reasoning benchmarks while also being the best multilingual option at this size.
Coding
Code models are fine-tuned on massive code repositories (GitHub, Stack Overflow, documentation). They understand programming concepts deeply, can write boilerplate, explain algorithms, find bugs, write tests, and perform code review. For MIKA5, code models are most powerful when combined with a Knowledge Base containing your project's documentation, architecture docs, or API references — the RAG engine will surface relevant context automatically.
The best local code models at each tier. The 7B fits on 6 GB VRAM, 14B on 12 GB, 32B on 20 GB. All variants outperform CodeLlama 34B on HumanEval. Strong in Python, JavaScript, TypeScript, Go, Rust, Java, C++, SQL, Bash, and more.
NVIDIA's Nemotron Super is built on Llama with extensive post-training optimizations from NVIDIA. It excels at instruction following and code generation. Available through the NVIDIA API or :cloud tag on Ollama if offered in your region. Strong at enterprise-scale code tasks and CUDA/GPU-related programming.
DeepSeek's dedicated code model using a Mixture-of-Experts architecture. Only 2.4B parameters are active per token despite having 16B total, making it faster than dense 16B models. Exceptional at algorithmic problems and competitive programming.
BigCode's StarCoder 2, trained on The Stack v2 (an ethical, opt-out software corpus). Strong fill-in-the-middle capability, meaning it can complete code in the middle of a function — excellent for code completion use cases.
Vision & OCR
Vision-language models accept images alongside text prompts. In MIKA5, you can paste images directly into the chat (Ctrl+V or the attachment button) and the app automatically detects which models support vision. Vision models can describe images, read text from screenshots and photos (OCR), analyze charts and diagrams, understand UI layouts, and compare visual information to your knowledge base documents.
Tops most vision benchmarks at the 7B scale. Can process multiple images per request, handle very high-resolution images, and understand complex charts and documents. The recommended vision model for mid-range hardware.
Zhipu AI's GLM-OCR is purpose-built for document OCR and text extraction from images. It outperforms general vision models on extracting structured text from scanned documents, PDFs, tables, and forms. Lightweight enough to run on 6 GB VRAM. Best for document digitization workflows.
Gemma 3's multimodal version maintains strong text quality alongside vision. Good balance for users who want a single model for both image analysis and general conversation. Also available in 27B for those with more VRAM.
The original open-source vision model. LLaVA 13B is no longer state-of-the-art but remains widely available and well-tested. Reliable for basic image description and simple visual Q&A. Runs on 10 GB VRAM. Best as a fallback if newer models aren't available.
Multilingual
Most open-source models are primarily English-trained, with Spanish, French, German, and Chinese as secondary languages. For users who primarily work in Spanish or other non-English languages, model selection matters significantly — the same 7B model can produce fluent professional Spanish or halting translated-sounding text depending on its training data composition.
Alibaba's Qwen series was trained on a dataset with strong multilingual representation including Spanish, Chinese, Japanese, French, German, Korean, and Arabic. It consistently outperforms Llama on non-English benchmarks. If you work primarily in Spanish, Qwen2.5 at any size is the recommended choice.
Zhipu AI's GLM-5 is one of China's leading frontier models and has exceptional Chinese/English bilingual capability with strong Spanish performance. Available as a cloud model through Ollama's :cloud tag. Particularly strong for translation, cross-lingual tasks, and formal writing in East Asian languages.
Moonshot AI's Kimi K2 model features a 256k token context window — by far the longest available context for cloud models in MIKA5. Ideal for analyzing entire books, large codebases, or lengthy legal/technical documents in one conversation. Configure via the Moonshot API provider in MIKA5 settings.
Mistral AI is a French company, and their models have above-average French and European language performance. If you frequently work in French, Italian, Portuguese, or Spanish, Mistral 7B offers better results than Llama equivalents for European language tasks.
High-End & Cloud Models
The following models represent the current state of the art. Some require significant local hardware; others are available as cloud models through MIKA5's optional provider integrations (OpenAI, Anthropic, Groq, Moonshot). Cloud models are processed on the provider's servers — your prompts leave your machine, but you gain access to models that would require hundreds of GB of local memory to run.
Alibaba's Qwen3 series introduces hybrid reasoning modes — you can switch between fast "instruct" mode and slower "thinking" mode within the same model. The 72B variant delivers performance matching the best cloud models of 2024 while running entirely locally on workstation hardware. Its multilingual capabilities remain industry-leading for open-source models. Requires ~44 GB VRAM for 72B in Q4, or can be run with CPU offloading on systems with 64+ GB RAM.
NVIDIA's Nemotron Super is built on the Llama 3.1 70B architecture but refined with NVIDIA's post-training pipeline. It achieves top-tier scores on code generation, instruction following, and function calling. Available as a cloud model through select API providers, or as a :cloud tag on Ollama if your region supports it. Particularly strong for CUDA programming, HPC workloads, and GPU-related development tasks — which is a natural fit for AI practitioners.
GLM-5 is Zhipu AI's frontier model, a successor to the GLM-4 series that achieved competitive performance with GPT-4. Strong at Chinese/English bilingual tasks, long document analysis, and complex reasoning. The GLM-OCR variant (local, via Ollama) is purpose-built for document extraction. GLM-5 via cloud is configured in MIKA5 by adding a Zhipu AI API key in settings and selecting glm-5:cloud as the model.
Kimi K2's defining feature is its 256,000 token context window — the equivalent of a full-length novel or an entire codebase. It can process and reason over extremely long documents without losing coherence. In MIKA5, configure the Moonshot API provider and select moonshot-v1-128k or kimi-k2 as the model. Best for: legal document review, academic paper analysis, processing entire project codebases, or any task where full context retention is critical. Note: long-context processing is billed by token volume on the cloud API.
Benchmark Comparison
Scores below reflect published benchmark results from model authors and independent evaluations. MMLU measures broad academic knowledge (57 subjects). MT-Bench measures instruction following quality (rated 1–10). HumanEval measures Python code generation correctness. GSM8K measures grade-school math reasoning. MATH measures competition mathematics.
| Model | Params | MMLU | MT-Bench | HumanEval | GSM8K | VRAM (Q4) |
|---|---|---|---|---|---|---|
| gemma2:2b | 2B | 52.4% | 6.9 | 37.8% | 41.2% | 1.5 GB |
| llama3.2:3b | 3B | 58.2% | 7.2 | 45.3% | 67.4% | 2 GB |
| qwen2.5:3b | 3B | 65.6% | 7.4 | 53.2% | 72.8% | 2 GB |
| mistral:7b | 7B | 64.2% | 7.8 | 60.6% | 74.2% | 4.1 GB |
| llama3.1:8b | 8B | 66.4% | 8.0 | 72.6% | 84.2% | 5 GB |
| gemma3:9b | 9B | 70.4% | 8.1 | 65.9% | 82.6% | 5.8 GB |
| qwen2.5:7b | 7B | 74.8% | 8.3 | 84.2% | 88.4% | 4.7 GB |
| deepseek-r1:7b | 7B | 72.6% | 8.4 | 79.4% | 91.2% | 4.5 GB |
| qwen2.5:14b | 14B | 79.6% | 8.7 | 89.0% | 90.4% | 9 GB |
| deepseek-r1:14b | 14B | 78.4% | 8.9 | 88.2% | 94.8% | 9 GB |
| llama3.3:70b | 70B | 86.4% | 9.1 | 88.4% | 93.2% | ~42 GB |
| deepseek-r1:70b | 70B | 87.8% | 9.2 | 90.2% | 96.4% | ~43 GB |
| Claude Sonnet 4.5 | Cloud | 89.2% | 9.4 | 92.0% | 96.8% | API |
| GPT-4o | Cloud | 87.8% | 9.3 | 90.2% | 95.8% | API |
* Benchmark scores are approximate and sourced from published papers and community evaluations. Actual performance varies by quantization level, prompt format, and task type. Models marked in purple are recommended for most users at each tier.
Quick Install Reference
Copy and paste into your terminal after installing Ollama. All models download automatically.
# Entry Level (8 GB RAM)
$ ollama pull llama3.2:3b
$ ollama pull qwen2.5:3b
# Mid Range (16 GB RAM / 6-8 GB VRAM)
$ ollama pull qwen2.5:7b # General purpose — recommended
$ ollama pull deepseek-r1:7b # Reasoning tasks
$ ollama pull qwen2.5-vl:7b # Vision + images
$ ollama pull glm-ocr:latest # Document OCR
$ ollama pull mistral:7b # Fast responses
# High End (32 GB RAM / 12-16 GB VRAM)
$ ollama pull qwen2.5:14b
$ ollama pull qwen2.5-coder:14b # For coding
$ ollama pull deepseek-r1:14b # For reasoning
# Workstation (64 GB RAM / 24+ GB VRAM)
$ ollama pull llama3.3:70b
$ ollama pull deepseek-r1:70b
$ ollama pull qwen2.5:72b
# Always required — RAG embedding model
$ ollama pull nomic-embed-text # Required for MIKA5 knowledge base
Embedding Models — Required for RAG
The RAG (Retrieval-Augmented Generation) engine in MIKA5 requires an embedding model to convert your documents into vector representations. Embedding models are not for chatting — they are mathematical tools that transform text into numerical vectors so that semantically similar passages can be found quickly. You must have an embedding model installed for the Knowledge Base feature to work.
The default and recommended embedding model for MIKA5. Only 274 MB, very fast, and produces high-quality 768-dimensional embeddings. Excellent for English and Spanish documents. The RAG engine uses this model unless another is configured.
Higher quality embeddings (1024 dimensions) at a cost of larger size (~670 MB) and slower processing. Better for very long documents or when retrieval precision is critical. Consider this if you notice MIKA5's RAG retrieval is returning less relevant chunks.
Embedding models run in the background when you upload documents to MIKA5. You will not interact with them directly. They only need to be installed — MIKA5 handles the rest automatically. Do not select them as your chat model in the model dropdown.