Back to MIKA5
Model Guide
MIKA5 Official Guide · Updated March 2026

Choosing the right AI model
for your hardware and workflow

Not all models are created equal — and not all hardware can run them. This guide explains the most relevant open-source models available today, their benchmark performance, hardware requirements, and the specific use cases where they shine. Whether you have a basic laptop or a workstation with a 24 GB GPU, there is a model for you.

All local models below can be pulled via ollama pull model-name. Cloud models require a provider API key configured in MIKA5 settings.

Hardware Tiers

Running AI locally means the model weights must fit in memory — either RAM (for CPU inference) or VRAM (for GPU inference). GPU inference is always significantly faster, but CPU works for smaller models. As a rule of thumb, a 7B model needs approximately 4–5 GB in 4-bit quantization (Q4), 5–6 GB in Q5, and 7–8 GB in Q8. Multiply by model size in billions of parameters to estimate requirements.

Entry Level

8 GB RAM — CPU Only or Integrated GPU

Entry-level hardware includes most budget laptops, older workstations, and PCs with less than 8 GB of dedicated VRAM. CPU inference is possible but slow — expect 3–8 tokens/second on modern CPUs. Focus on models under 4 GB in size (1B–3B parameters in Q4 quantization). These models are surprisingly capable for everyday tasks like summarizing text, answering questions, writing emails, and simple coding assistance.

llama3.2:3b Local

Meta's Llama 3.2 3B is one of the best small models available. Despite its compact size, it follows instructions reliably, writes well in English and Spanish, and handles summarization and Q&A with quality that outperforms models twice its size from previous generations. Ideal as the default starter model.

~2 GB (Q4) 6–10 tok/s CPU 8k context
$ ollama pull llama3.2:3b
qwen2.5:3b Local

Alibaba's Qwen 2.5 3B punches above its weight in reasoning and multilingual tasks. Trained on a massive multilingual dataset, it handles Spanish, Chinese, English, and dozens of other languages with comparable fluency. For users who frequently work in multiple languages, this is the go-to entry-level model.

~2 GB (Q4) 6–10 tok/s CPU 32k context
$ ollama pull qwen2.5:3b
phi4:3b Local

Microsoft's Phi-4 series demonstrates that careful data curation matters as much as scale. The 3B variant scores remarkably well on reasoning and math benchmarks relative to its size. Excellent for STEM tasks, problem solving, and situations where logical accuracy is critical. Struggles more than Llama/Qwen with creative writing.

~2 GB (Q4) 8–12 tok/s CPU 16k context
$ ollama pull phi4:3b
gemma2:2b Local

Google's Gemma 2 2B is the lightest capable model for very constrained hardware. At under 1.5 GB in Q4, it runs on nearly any modern PC. Ideal for simple chat, quick lookups, and draft writing when hardware severely limits options. Quality drops noticeably on complex tasks.

~1.5 GB (Q4) 10–15 tok/s CPU 8k context
$ ollama pull gemma2:2b
Mid Range

16 GB RAM — GPU 6–8 GB VRAM

The mid-range sweet spot. A PC with 16 GB of system RAM and a GPU with 6–8 GB of VRAM (RTX 3060, RX 6700 XT, etc.) can run 7B models entirely on GPU, delivering 20–60 tokens/second — fast enough for a natural conversation flow. This is the most common configuration among power users and the recommended starting point for serious AI work. 7B models in Q4 quantization require approximately 4–5 GB of VRAM.

qwen2.5:7b
Local

The Qwen 2.5 7B is arguably the best 7B model available as of 2025-2026. It scores near the top of MMLU (General Knowledge), MT-Bench (instruction following), and GSM8K (math reasoning) benchmarks at this scale. Excellent multilingual support, long 32k context window, and competitive coding ability. The go-to recommendation for most MIKA5 users with a mid-range PC.

~4.7 GB (Q4) 30–55 tok/s GPU 32k context Multilingual
$ ollama pull qwen2.5:7b
llama3.1:8b Local

Meta's Llama 3.1 8B remains a strong baseline for English-primary tasks. Excellent instruction following, natural conversation, and creative writing. The massive fine-tuning ecosystem around Llama means there are hundreds of specialized variants — from roleplay to legal analysis. Strong RLHF means it's helpful and safe out of the box. Best for English-speaking users who want broad compatibility.

~5.0 GB (Q4) 25–50 tok/s GPU 128k context
$ ollama pull llama3.1:8b
mistral:7b Local

Mistral AI's 7B model introduced sliding window attention and grouped query attention, making it faster than naive 7B implementations. It's one of the fastest 7B models in tokens-per-second and excels at structured output generation. Perfect for users who need rapid responses over maximum quality, API-style applications within MIKA5, or data extraction tasks where speed and format adherence matter.

~4.1 GB (Q4) 40–70 tok/s GPU 32k context Very fast
$ ollama pull mistral:7b
deepseek-r1:7b
LocalReasoning

DeepSeek-R1 is a reasoning model trained with reinforcement learning to "think out loud" before giving an answer — similar in concept to OpenAI's o1 series. The 7B distilled version brings this chain-of-thought reasoning capability to mid-range hardware. Responses are slower because the model reasons step by step, but the quality on math, logic, and complex analysis tasks is dramatically better than standard 7B models. Use when accuracy matters more than speed.

~4.5 GB (Q4) 15–30 tok/s GPU 128k context Chain-of-thought
$ ollama pull deepseek-r1:7b
qwen2.5-vl:7b
LocalVision

The best local vision-language model at the 7B scale. Qwen2.5-VL can understand images, charts, diagrams, screenshots, and handwritten notes. It outperforms LLaVA-13B and many larger vision models on benchmarks like MMBench and TextVQA. Use it when you need to describe images, extract text from screenshots, analyze charts in your documents, or work with visual content in MIKA5.

~5.5 GB (Q4) 20–40 tok/s GPU 128k context Image input
$ ollama pull qwen2.5-vl:7b
gemma3:9b
LocalVision

Google's Gemma 3 9B is a multimodal model that can process both text and images. It excels at following detailed instructions and is strong in safety and harmlessness. The 9B model runs on 6 GB VRAM in Q4 and offers a good balance between vision capability and speed. Also available in 27B for users with more VRAM.

~5.8 GB (Q4) 20–35 tok/s GPU 32k context Image input
$ ollama pull gemma3:9b
High End

32 GB RAM — GPU 12–16 GB VRAM

High-end configurations unlock the 13B–14B model tier on GPU, and can run 30B–34B models in CPU+GPU hybrid mode. At this level, quality becomes noticeably stronger — complex multi-step reasoning, nuanced creative writing, and professional-grade coding all become possible. With 16 GB VRAM (RTX 4080, RTX 3090, RX 7900 XTX), even 14B models run at excellent speeds. Users with 32 GB system RAM can also consider 30B models via CPU offloading.

qwen2.5:14b
Local

The 14B parameter Qwen 2.5 delivers performance comparable to much larger models from previous generations. It outperforms Llama 3 70B on several benchmarks including coding (HumanEval) and mathematics (MATH benchmark). The jump from 7B to 14B is significant: more coherent long documents, better code, and more nuanced reasoning. If you have 12 GB of VRAM, this is the most impactful upgrade.

~9 GB (Q4) 15–30 tok/s GPU 12GB 32k context
$ ollama pull qwen2.5:14b
qwen2.5-coder:14b
LocalCode

Fine-tuned specifically on code, Qwen2.5-Coder 14B achieves scores that match or exceed GPT-4-level models on HumanEval (Python coding benchmark) at this model size. It understands and generates code in 40+ programming languages, can explain complex algorithms, debug errors, write unit tests, and perform code review. The best local option for serious software developers using MIKA5 as a coding companion.

~9 GB (Q4) 15–30 tok/s GPU 12GB 32k context 40+ languages
$ ollama pull qwen2.5-coder:14b
deepseek-r1:14b
LocalReasoning

The 14B distillation of DeepSeek-R1 brings near o1-level reasoning performance to 12 GB of VRAM. It excels at step-by-step mathematical proofs, logical puzzles, complex analysis, and scientific problem solving. The model explicitly shows its "thinking" process before answering, which is useful for verifying its reasoning chain. An excellent choice when you need rigorous, verifiable outputs rather than fast conversational responses.

~9 GB (Q4) 10–20 tok/s GPU 12GB Chain-of-thought
$ ollama pull deepseek-r1:14b
phi4:14b
LocalCode

Microsoft's Phi-4 14B demonstrates that careful training data selection can produce a model that outperforms much larger models on STEM benchmarks. It scores in the top tier on MATH, AMC, and GPQA (graduate-level science questions) at the 14B scale. Not the best for general conversation or creative writing, but exceptional for scientific analysis, mathematics, and code comprehension. A great complement to a more conversational model.

~9 GB (Q4) 15–25 tok/s GPU 12GB 16k context
$ ollama pull phi4:14b
Workstation

64 GB RAM — GPU 24 GB+ VRAM (or multi-GPU)

Workstation-class hardware — typically an RTX 3090/4090 (24 GB VRAM), or professional GPUs like A100/H100, or dual GPUs — opens up the full 70B model class on GPU. At this tier, local performance can rival or exceed cloud APIs in quality. A single RTX 4090 can run a Llama 3.3 70B (Q4, ~40 GB) with CPU offloading, or a smaller 34B model entirely on GPU. This is also the tier where high-context models with 256k+ token windows become practical.

Model Params VRAM (Q4) Strength Command
llama3.3:70b 70B ~42 GB Best overall Llama model · Strong on all tasks ollama pull llama3.3:70b
qwen2.5:72b 72B ~44 GB Top MMLU · Best multilingual at this tier ollama pull qwen2.5:72b
deepseek-r1:70b 70B ~43 GB Best local reasoning · Near o1 quality ollama pull deepseek-r1:70b
qwen2.5-coder:32b 32B ~20 GB Best local code model · Near GPT-4 on HumanEval ollama pull qwen2.5-coder:32b
qwen2.5-vl:72b 72B ~44 GB Best local vision · Exceeds GPT-4V on TextVQA ollama pull qwen2.5-vl:72b

By Use Case

Reasoning & Analysis

Reasoning models use extended "thinking" passes — sometimes called chain-of-thought or test-time compute — to solve complex problems more accurately. They are slower than standard models but dramatically more reliable for tasks involving multiple logical steps, mathematical proofs, debugging complex code, or analyzing nuanced arguments. For document analysis in MIKA5, combining a reasoning model with the RAG engine produces exceptionally grounded, verifiable answers.

deepseek-r1Top Pick

Available in 1.5B, 7B, 14B, 32B, and 70B. The 7B is the best entry point for mid-range hardware. Training via RL (reinforcement learning) on reasoning tasks means it generalizes to novel problems that weren't in training data. Scores above 90% on MATH and AIME competition math problems at 70B scale — approaching human expert level.

MATH: 92.3%
MMLU: 90.8%
HumanEval: 89.4%
qwen3.5 / qwen2.5Strong Alt

Alibaba's Qwen series has consistently topped leaderboards for general reasoning at various scales. Qwen3.5 (if available via Ollama in your region) extends this further. Qwen2.5 72B matches or exceeds Llama 3.3 70B on most reasoning benchmarks while also being the best multilingual option at this size.

MMLU: 87.3%
MT-Bench: 9.1/10
GSM8K: 91.6%

Coding

Code models are fine-tuned on massive code repositories (GitHub, Stack Overflow, documentation). They understand programming concepts deeply, can write boilerplate, explain algorithms, find bugs, write tests, and perform code review. For MIKA5, code models are most powerful when combined with a Knowledge Base containing your project's documentation, architecture docs, or API references — the RAG engine will surface relevant context automatically.

qwen2.5-coder:7b / 14b / 32b

The best local code models at each tier. The 7B fits on 6 GB VRAM, 14B on 12 GB, 32B on 20 GB. All variants outperform CodeLlama 34B on HumanEval. Strong in Python, JavaScript, TypeScript, Go, Rust, Java, C++, SQL, Bash, and more.

nemotron-3-super:cloudCloud via API

NVIDIA's Nemotron Super is built on Llama with extensive post-training optimizations from NVIDIA. It excels at instruction following and code generation. Available through the NVIDIA API or :cloud tag on Ollama if offered in your region. Strong at enterprise-scale code tasks and CUDA/GPU-related programming.

deepseek-coder-v2:16bLocal

DeepSeek's dedicated code model using a Mixture-of-Experts architecture. Only 2.4B parameters are active per token despite having 16B total, making it faster than dense 16B models. Exceptional at algorithmic problems and competitive programming.

starcoder2:15bLocal

BigCode's StarCoder 2, trained on The Stack v2 (an ethical, opt-out software corpus). Strong fill-in-the-middle capability, meaning it can complete code in the middle of a function — excellent for code completion use cases.

Vision & OCR

Vision-language models accept images alongside text prompts. In MIKA5, you can paste images directly into the chat (Ctrl+V or the attachment button) and the app automatically detects which models support vision. Vision models can describe images, read text from screenshots and photos (OCR), analyze charts and diagrams, understand UI layouts, and compare visual information to your knowledge base documents.

qwen2.5-vl:7bBest 7B Vision

Tops most vision benchmarks at the 7B scale. Can process multiple images per request, handle very high-resolution images, and understand complex charts and documents. The recommended vision model for mid-range hardware.

glm-ocr:latestOCR Specialist

Zhipu AI's GLM-OCR is purpose-built for document OCR and text extraction from images. It outperforms general vision models on extracting structured text from scanned documents, PDFs, tables, and forms. Lightweight enough to run on 6 GB VRAM. Best for document digitization workflows.

gemma3:9bVision + Chat

Gemma 3's multimodal version maintains strong text quality alongside vision. Good balance for users who want a single model for both image analysis and general conversation. Also available in 27B for those with more VRAM.

llava:13bClassic

The original open-source vision model. LLaVA 13B is no longer state-of-the-art but remains widely available and well-tested. Reliable for basic image description and simple visual Q&A. Runs on 10 GB VRAM. Best as a fallback if newer models aren't available.

Multilingual

Most open-source models are primarily English-trained, with Spanish, French, German, and Chinese as secondary languages. For users who primarily work in Spanish or other non-English languages, model selection matters significantly — the same 7B model can produce fluent professional Spanish or halting translated-sounding text depending on its training data composition.

qwen2.5Best for ES/ZH/multilingual

Alibaba's Qwen series was trained on a dataset with strong multilingual representation including Spanish, Chinese, Japanese, French, German, Korean, and Arabic. It consistently outperforms Llama on non-English benchmarks. If you work primarily in Spanish, Qwen2.5 at any size is the recommended choice.

glm-5:cloudCloud via API

Zhipu AI's GLM-5 is one of China's leading frontier models and has exceptional Chinese/English bilingual capability with strong Spanish performance. Available as a cloud model through Ollama's :cloud tag. Particularly strong for translation, cross-lingual tasks, and formal writing in East Asian languages.

kimi-k2:cloudCloud via Moonshot API

Moonshot AI's Kimi K2 model features a 256k token context window — by far the longest available context for cloud models in MIKA5. Ideal for analyzing entire books, large codebases, or lengthy legal/technical documents in one conversation. Configure via the Moonshot API provider in MIKA5 settings.

mistral:7bBest for FR/EU languages

Mistral AI is a French company, and their models have above-average French and European language performance. If you frequently work in French, Italian, Portuguese, or Spanish, Mistral 7B offers better results than Llama equivalents for European language tasks.

High-End & Cloud Models

The following models represent the current state of the art. Some require significant local hardware; others are available as cloud models through MIKA5's optional provider integrations (OpenAI, Anthropic, Groq, Moonshot). Cloud models are processed on the provider's servers — your prompts leave your machine, but you gain access to models that would require hundreds of GB of local memory to run.

Qwen3.5 / Qwen3 Frontier Local Ollama

Alibaba's Qwen3 series introduces hybrid reasoning modes — you can switch between fast "instruct" mode and slower "thinking" mode within the same model. The 72B variant delivers performance matching the best cloud models of 2024 while running entirely locally on workstation hardware. Its multilingual capabilities remain industry-leading for open-source models. Requires ~44 GB VRAM for 72B in Q4, or can be run with CPU offloading on systems with 64+ GB RAM.

$ ollama pull qwen3.5:72b # ~44 GB — requires high-end hardware
Nemotron-3-Super Cloud Code Specialist

NVIDIA's Nemotron Super is built on the Llama 3.1 70B architecture but refined with NVIDIA's post-training pipeline. It achieves top-tier scores on code generation, instruction following, and function calling. Available as a cloud model through select API providers, or as a :cloud tag on Ollama if your region supports it. Particularly strong for CUDA programming, HPC workloads, and GPU-related development tasks — which is a natural fit for AI practitioners.

GLM-5 (Zhipu AI) Cloud Multilingual

GLM-5 is Zhipu AI's frontier model, a successor to the GLM-4 series that achieved competitive performance with GPT-4. Strong at Chinese/English bilingual tasks, long document analysis, and complex reasoning. The GLM-OCR variant (local, via Ollama) is purpose-built for document extraction. GLM-5 via cloud is configured in MIKA5 by adding a Zhipu AI API key in settings and selecting glm-5:cloud as the model.

Kimi K2 (Moonshot AI) Cloud — Moonshot API

Kimi K2's defining feature is its 256,000 token context window — the equivalent of a full-length novel or an entire codebase. It can process and reason over extremely long documents without losing coherence. In MIKA5, configure the Moonshot API provider and select moonshot-v1-128k or kimi-k2 as the model. Best for: legal document review, academic paper analysis, processing entire project codebases, or any task where full context retention is critical. Note: long-context processing is billed by token volume on the cloud API.

Benchmark Comparison

Scores below reflect published benchmark results from model authors and independent evaluations. MMLU measures broad academic knowledge (57 subjects). MT-Bench measures instruction following quality (rated 1–10). HumanEval measures Python code generation correctness. GSM8K measures grade-school math reasoning. MATH measures competition mathematics.

Model Params MMLU MT-Bench HumanEval GSM8K VRAM (Q4)
gemma2:2b2B52.4%6.937.8%41.2%1.5 GB
llama3.2:3b3B58.2%7.245.3%67.4%2 GB
qwen2.5:3b3B65.6%7.453.2%72.8%2 GB
mistral:7b7B64.2%7.860.6%74.2%4.1 GB
llama3.1:8b8B66.4%8.072.6%84.2%5 GB
gemma3:9b9B70.4%8.165.9%82.6%5.8 GB
qwen2.5:7b7B74.8%8.384.2%88.4%4.7 GB
deepseek-r1:7b7B72.6%8.479.4%91.2%4.5 GB
qwen2.5:14b14B79.6%8.789.0%90.4%9 GB
deepseek-r1:14b14B78.4%8.988.2%94.8%9 GB
llama3.3:70b70B86.4%9.188.4%93.2%~42 GB
deepseek-r1:70b70B87.8%9.290.2%96.4%~43 GB
Claude Sonnet 4.5Cloud89.2%9.492.0%96.8%API
GPT-4oCloud87.8%9.390.2%95.8%API

* Benchmark scores are approximate and sourced from published papers and community evaluations. Actual performance varies by quantization level, prompt format, and task type. Models marked in purple are recommended for most users at each tier.

Quick Install Reference

Copy and paste into your terminal after installing Ollama. All models download automatically.

# Entry Level (8 GB RAM)

$ ollama pull llama3.2:3b

$ ollama pull qwen2.5:3b

# Mid Range (16 GB RAM / 6-8 GB VRAM)

$ ollama pull qwen2.5:7b # General purpose — recommended

$ ollama pull deepseek-r1:7b # Reasoning tasks

$ ollama pull qwen2.5-vl:7b # Vision + images

$ ollama pull glm-ocr:latest # Document OCR

$ ollama pull mistral:7b # Fast responses

# High End (32 GB RAM / 12-16 GB VRAM)

$ ollama pull qwen2.5:14b

$ ollama pull qwen2.5-coder:14b # For coding

$ ollama pull deepseek-r1:14b # For reasoning

# Workstation (64 GB RAM / 24+ GB VRAM)

$ ollama pull llama3.3:70b

$ ollama pull deepseek-r1:70b

$ ollama pull qwen2.5:72b

# Always required — RAG embedding model

$ ollama pull nomic-embed-text # Required for MIKA5 knowledge base

Embedding Models — Required for RAG

The RAG (Retrieval-Augmented Generation) engine in MIKA5 requires an embedding model to convert your documents into vector representations. Embedding models are not for chatting — they are mathematical tools that transform text into numerical vectors so that semantically similar passages can be found quickly. You must have an embedding model installed for the Knowledge Base feature to work.

nomic-embed-text Recommended

The default and recommended embedding model for MIKA5. Only 274 MB, very fast, and produces high-quality 768-dimensional embeddings. Excellent for English and Spanish documents. The RAG engine uses this model unless another is configured.

$ ollama pull nomic-embed-text
mxbai-embed-large

Higher quality embeddings (1024 dimensions) at a cost of larger size (~670 MB) and slower processing. Better for very long documents or when retrieval precision is critical. Consider this if you notice MIKA5's RAG retrieval is returning less relevant chunks.

$ ollama pull mxbai-embed-large

Embedding models run in the background when you upload documents to MIKA5. You will not interact with them directly. They only need to be installed — MIKA5 handles the rest automatically. Do not select them as your chat model in the model dropdown.