Swama vs Ollama: Why Apple Silicon Macs Deserve a Faster Local AI Runtime

The reason is simple: because you can — and even faster now. If you have an Apple Silicon Mac (M1 or later) with 16GB of RAM or more, you can run powerful LLMs locally without sending any data to the cloud. And with Swama, you’ll get significantly better performance than Ollama because it’s built on Apple’s MLX framework, specifically optimized for Apple Silicon. Just download Swama, run the app, pick a model, and you’re done. If you’ve invested in a powerful Mac, why not put it to work at its full potential?

Beyond the practical appeal, there’s a more important business reason: data privacy. When you send queries to OpenAI, Anthropic, Claude, or Google, you’re essentially giving these companies access to your data and usage patterns. Their terms and conditions make it clear — this information can be used to train future models and inform their competitive strategies. While you may not be able to avoid these services entirely, you can be selective. Local LLMs allow you to keep sensitive or personal information on your machine, where it belongs. The rule is simple: don’t send anything to these providers that you wouldn’t want them to have a permanent copy of.

Why Swama Instead of Ollama?

If you’re already using Ollama, you might wonder why switch. Here’s the key difference: Swama is built from the ground up for Apple Silicon using Apple’s MLX framework. This means:

  • Faster inference speeds — MLX is optimized specifically for the unified memory architecture of M1/M2/M3/M4 chips
  • Native macOS experience — Written in pure Swift with a beautiful menu bar app
  • Better memory efficiency — Takes full advantage of Apple Silicon’s unified memory
  • Modern feature set — Built-in support for vision models (VLM), speech recognition (Whisper), text-to-speech, and embeddings out of the box

If you have an Intel Mac, stick with Ollama. But if you have Apple Silicon, Swama will squeeze more performance out of your hardware.

System Requirements

  • macOS 15.0 (Sequoia) or later
  • Apple Silicon (M1/M2/M3/M4)
  • 16GB RAM recommended (8GB minimum for smaller models)

Installation

Option 1: Homebrew (Recommended)

Open Terminal and run:

brew install swama

Option 2: Download the App

  1. Visit the Swama Releases page
  2. Download Swama.dmg from the latest release
  3. Double-click the DMG and drag Swama.app to your Applications folder
  4. Launch Swama from Applications or Spotlight

Note: On first launch, macOS may show a security warning. Go to System Preferences > Security & Privacy > General and click “Open Anyway”.

  1. Once running, click the Swama menu bar icon and select “Install Command Line Tool” to add the swamacommand to your PATH.

Running Your First Model

Open Terminal and run:

swama run qwen3 "Hello, how are you?"

That’s it! Swama will automatically download the model on first use (may take a few minutes) and start generating a response. No need to pull first — it just works.

For a smaller, faster model perfect for quick tasks:

swama run llama3.2-1b "What is the capital of France?"

To stop, press Ctrl+C.

RAM Configuration Recommendation

For most Apple Silicon Mac users, 16GB is the sweet spot. Here’s a quick guide:

8GB RAM: llama3.2–1b, qwen3–1.7b, gemma3 (smaller models only)
16GB RAM: qwen3, llama3.2, deepseek-r1–8b (most 7–8B models)
32GB RAM: qwen3–30b, gemma3–27b (larger models)
64GB+ RAM: qwen3–32b, deepseek-r1 (enterprise-grade models)

If you’re considering a Mac purchase, prioritize 16GB or more if you want to run useful local LLMs.

Model Aliases — Keep It Simple

Swama uses friendly aliases instead of long HuggingFace URLs. Here are the most useful ones:

Language Models:

  • qwen3 — Qwen3 8B, great all-rounder (4.3 GB)
  • llama3.2 — Llama 3.2 3B (1.7 GB)
  • llama3.2-1b — Fastest small model (876 MB)
  • deepseek-r1-8b — Strong reasoning model (8.6 GB)

Vision Models (can understand images):

  • gemma3 — Gemma 3 4B vision model (3.2 GB)
  • qwen3-vl — Qwen3 Vision 4B (~4 GB)

Speech Recognition:

  • whisper-large — Highest accuracy transcription (1.6 GB)
  • whisper-small — Fast transcription (252 MB)

Common Commands

swama list                     # See installed models
swama pull qwen3 # Download a model
swama run qwen3 "Your prompt" # Run inference
swama serve --port 28100 # Start API server
swama transcribe audio.wav # Transcribe audio file

Start an API Server (OpenAI Compatible)

Swama can run as an API server compatible with the OpenAI format:

swama serve --host 0.0.0.0 --port 28100

You can then use it with any tool that supports OpenAI’s API:

curl -X POST http://localhost:28100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'

This means you can use Swama with OpenAI-compatible clients.

Also, using this chance to promote my new app a little. Molten, now available on App Store for Mac, iPhone and iPad, it’s open source too, you can get it from our GitHub and compile it yourself if you prefer not to get it from App Store. Previously with Ollama, you can use Enchanted, which is a great client, but it doesn’t support Swama. (Ollama has it’s own API with some OpenAI API compatibility, but Swama is more strictly OpenAI API compatible.) So I hacked around with Enchanted code, got it working with Swama and Apple Foundation Model and released it on App Store and open sourced the result. The plan is to make it into a private LLM and RAG app, so you can easily put a few documents into your own folders and it will make it into a local knowledge base with no privacy concerns.

Bonus Features Ollama Doesn’t Have

Vision Language Models — Analyze images locally:

swama run gemma3 "What's in this image?" -i /path/to/photo.jpg

Local Speech Recognition — Transcribe audio without cloud services:

swama transcribe meeting.wav --model whisper-large --language en

Text Embeddings — For building local RAG systems:

curl -X POST http://localhost:28100/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": ["Hello world"], "model": "mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ"}'

Other Common Reasons to Use Local LLMs Instead of Paid APIs

  • Privacy and data control — data never leaves your computer
  • Cost efficiency — no recurring subscription or API fees after initial hardware investment
  • Reduced latency and faster response times — no network round-trips to external servers
  • Customization and fine-tuning — full control over model behavior and optimization
  • Offline availability — works without internet connectivity
  • Data sovereignty and regulatory compliance — keeps processing within local boundaries
  • Technical control and experimentation — freedom to modify, debug, and iterate without restrictions
  • No dependency on vendor changes — immunity to API changes, pricing increases, or service discontinuation

Tips

  • If you previously used Ollama, Swama serves as a near drop-in replacement for the API — just point your apps to port 28100 instead. Swama is more strictly OpenAI-compatible, using only the /v1/ endpoint structure. This makes it a near drop-in replacement for apps expecting OpenAI's API format. Ollama has its own native API format (/api/) plus OpenAI-compatible endpoints.
  • The menu bar app makes it easy to see what’s running and manage models without touching the terminal.
  • For maximum speed, use quantized (4-bit) models, which is the default for most Swama model aliases.
  • Learn basic bash scripting to batch process multiple documents through your local LLM.

Running Swama locally puts you back in control of your data while letting your Apple Silicon hardware work at full speed. If you’ve got an M1 or newer Mac, this is the fastest way to run local LLMs today.