Github repo: https://github.com/streamlinecoreinitiative/MLX_Llama_TTS_MPS
This guide shows how to clone our repo, grab a 4-bit Llama-3 model, and add real-time text-to-speech (TTS) . All accelerated by Metal Performance Shaders (MPS). No cloud calls, no Docker, no Rosetta.
Conversational AI powered by Meta-Llama-3-8B (quantised to 4 bits; ~6 GB RAM at runtime).
Kokoro TTS that speaks each reply through your Mac’s speakers almost immediately.
Everything runs natively on M-series GPUs via MLX (Apple’s open-source tensor library).
macOS 13 or newer
Any M-series chip (M1 → M4) with ≥ 16 GB RAM
Python 3.10 +
Installed via Homebrew or pyenv
Command-line git
brew install git
git clone https://github.com/streamlinecoreinitiative/MLX_Llama_TTS_MPS.git
cd MLX_Llama_TTS_MPS
python -m venv .venv && source .venv/bin/activate
# Keep build tools fresh (once per machine)
pip install -U pip setuptools wheel cmake ninja
# 1 · Metal tensor backend
pip install -U mlx
# 2 · LLM helpers (quantise/chat/fine-tune)
pip install --no-cache-dir git+https://github.com/ml-explore/mlx-lm.git@main
# 3 · TTS wrapper (runs on the same MPS backend)
pip install --no-cache-dir git+https://github.com/Blaizzy/mlx-audio.git@main
Why MLX? It talks directly to Metal GPU kernels, so we avoid Rosetta emulation and get battery-friendly speed.
Why Kokoro TTS? It’s tiny (82 M parameters) and already ported to MLX, so voice synthesis lives on the same GPU.
# First launch downloads ≈5 GB of 4-bit weights (cached after)
mlx_lm.chat --model mlx-community/Meta-Llama-3-8B-Instruct-4bit
# Hit Ctrl-C once you see the interactive prompt; the weights are now cached.
python main.py
You’ll see:
=== MPS Voice Assistant (Streamline Core) ===
Type 'exit' to quit.
You:
Type a question (e.g. “Explain quantum entanglement in one sentence”).
Within ~2 seconds the assistant prints a snappy <60-word answer and speaks it aloud.
Both the LLM and TTS share the same GPU context, so there’s no copy overhead.
The script waits until each WAV chunk is fully written, then plays clips sequentially, eliminating overlap.
Change the voice — pass a different voice="af_heart" (see Kokoro voice list).
Switch to English prompts — update language="en" in the generate_audio() call and tweak the SYSTEM_PROMPT.
Use a faster model — e.g. mlx-community/Mistral-7B-Instruct-4bit loads by changing one line in main.py.
Once the model files are cached, the assistant works offline—ideal for classrooms, remote clinics, or disaster sites where connectivity is limited.
For feedback, pull requests, or performance reports on other M-chips, open an issue on our GitHub repo. Together we can keep pushing responsible, device-native AI forward.
Updated for the Streamline Core Initiative educational site – June 2025.