Local AI on Mac M-chips

Build a Fully-Local Voice Assistant on Apple Silicon (MPS + MLX)

Github repo: https://github.com/streamlinecoreinitiative/MLX_Llama_TTS_MPS

This guide shows how to clone our repo, grab a 4-bit Llama-3 model, and add real-time text-to-speech (TTS) . All accelerated by Metal Performance Shaders (MPS). No cloud calls, no Docker, no Rosetta.

1 · What you’ll end up with

Conversational AI powered by Meta-Llama-3-8B (quantised to 4 bits; ~6 GB RAM at runtime).
Kokoro TTS that speaks each reply through your Mac’s speakers almost immediately.
Everything runs natively on M-series GPUs via MLX (Apple’s open-source tensor library).

2 · Prerequisites

macOS 13 or newer

Any M-series chip (M1 → M4) with ≥ 16 GB RAM

Python 3.10 +

Installed via Homebrew or pyenv

Command-line git

brew install git

3 · Clone the repo

git clone https://github.com/streamlinecoreinitiative/MLX_Llama_TTS_MPS.git

cd MLX_Llama_TTS_MPS

python -m venv .venv && source .venv/bin/activate

4 · Install the runtime stack

# Keep build tools fresh (once per machine)

pip install -U pip setuptools wheel cmake ninja

# 1 · Metal tensor backend

pip install -U mlx

# 2 · LLM helpers (quantise/chat/fine-tune)

pip install --no-cache-dir git+https://github.com/ml-explore/mlx-lm.git@main

# 3 · TTS wrapper (runs on the same MPS backend)

pip install --no-cache-dir git+https://github.com/Blaizzy/mlx-audio.git@main

Why MLX? It talks directly to Metal GPU kernels, so we avoid Rosetta emulation and get battery-friendly speed.

Why Kokoro TTS? It’s tiny (82 M parameters) and already ported to MLX, so voice synthesis lives on the same GPU.

5 · Download a model that fits

# First launch downloads ≈5 GB of 4-bit weights (cached after)

mlx_lm.chat --model mlx-community/Meta-Llama-3-8B-Instruct-4bit

# Hit Ctrl-C once you see the interactive prompt; the weights are now cached.

6 · Run the demo script

python main.py

You’ll see:

=== MPS Voice Assistant (Streamline Core) ===

Type 'exit' to quit.

You:

Type a question (e.g. “Explain quantum entanglement in one sentence”).

Within ~2 seconds the assistant prints a snappy <60-word answer and speaks it aloud.

7 · How the code works (high-level)

Both the LLM and TTS share the same GPU context, so there’s no copy overhead.

The script waits until each WAV chunk is fully written, then plays clips sequentially, eliminating overlap.

8 · Customising

Change the voice — pass a different voice="af_heart" (see Kokoro voice list).
Switch to English prompts — update language="en" in the generate_audio() call and tweak the SYSTEM_PROMPT.
Use a faster model — e.g. mlx-community/Mistral-7B-Instruct-4bit loads by changing one line in main.py.

10 · Ready for field use

Once the model files are cached, the assistant works offline—ideal for classrooms, remote clinics, or disaster sites where connectivity is limited.

For feedback, pull requests, or performance reports on other M-chips, open an issue on our GitHub repo. Together we can keep pushing responsible, device-native AI forward.

Updated for the Streamline Core Initiative educational site – June 2025.

Page updated

Google Sites

Report abuse