Agent skill
mlx
Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.
Install this agent skill to your Project
npx add-skill https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/mlx
SKILL.md
Using MLX for LLMs on Apple Silicon
MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.
Table of Contents
- Core Concepts
- Installation
- Text Generation
- Interactive Chat
- Model Conversion
- Quantization
- Fine-tuning with LoRA
- Serving Models
- Best Practices
- References
Core Concepts
Why MLX
| Aspect | PyTorch on Mac | MLX |
|---|---|---|
| Memory | Separate CPU/GPU copies | Unified memory, no copies |
| Optimization | Generic Metal backend | Apple Silicon native |
| Model loading | Slower, more memory | Lazy loading, efficient |
| Quantization | Limited support | Built-in 4/8-bit |
MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.
Supported Models
MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.
Installation
pip install mlx-lm
Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).
Text Generation
Python API
from mlx_lm import load, generate
# Load model (from HF hub or local path)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
# Generate text
response = generate(
model,
tokenizer,
prompt="Explain quantum computing in simple terms:",
max_tokens=256,
temp=0.7,
)
print(response)
Streaming Generation
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
print(response.text, end="", flush=True)
print()
Batch Generation
from mlx_lm import load, batch_generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
prompts = [
"What is machine learning?",
"Explain neural networks:",
"Define deep learning:",
]
responses = batch_generate(
model,
tokenizer,
prompts,
max_tokens=100,
)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n")
CLI Generation
# Basic generation
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Explain recursion:" \
--max-tokens 256
# With sampling parameters
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--prompt "Write a poem about AI:" \
--temp 0.8 \
--top-p 0.95
Interactive Chat
CLI Chat
# Start chat REPL (context preserved between turns)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
Python Chat
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of France?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)
Model Conversion
Convert Hugging Face models to MLX format:
CLI Conversion
# Convert with 4-bit quantization
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \
-q # Quantize to 4-bit
# With specific quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--q-bits 8 \
--q-group-size 64
# Upload to Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \
-q \
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
Python Conversion
from mlx_lm import convert
convert(
hf_path="meta-llama/Llama-3.2-3B-Instruct",
mlx_path="./llama-3.2-3b-mlx",
quantize=True,
q_bits=4,
q_group_size=64,
)
Conversion Options
| Option | Default | Description |
|---|---|---|
--q-bits |
4 | Quantization bits (4 or 8) |
--q-group-size |
64 | Group size for quantization |
--dtype |
float16 | Data type for non-quantized weights |
Quantization
MLX supports multiple quantization methods for different use cases:
| Method | Best For | Command |
|---|---|---|
| Basic | Quick conversion | mlx_lm.convert -q |
| DWQ | Quality-preserving | mlx_lm.dwq |
| AWQ | Activation-aware | mlx_lm.awq |
| Dynamic | Per-layer precision | mlx_lm.dynamic_quant |
| GPTQ | Established method | mlx_lm.gptq |
Quick Quantization
# 4-bit quantization during conversion
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q
# 8-bit for higher quality
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8
For detailed coverage of each method, see reference/quantization.md.
Fine-tuning with LoRA
MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.
Quick Start
# Prepare training data (JSONL format)
# {"text": "Your training text here"}
# or
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--train \
--data ./data \
--iters 1000
# Generate with adapter
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--prompt "Your prompt here"
Fuse Adapter into Model
# Merge LoRA weights into base model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--save-path ./fused-model
# Or export to GGUF
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--export-gguf
For detailed LoRA configuration and training patterns, see reference/fine-tuning.md.
Serving Models
OpenAI-Compatible Server
# Start server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
# Use with OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'
Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
max_tokens=100,
)
print(response.choices[0].message.content)
Best Practices
-
Use pre-quantized models: Download from
mlx-communityon Hugging Face for immediate use -
Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
-
Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
-
Use streaming for UX:
stream_generateprovides responsive output for interactive applications -
Cache prompt prefixes: Use
mlx_lm.cache_promptfor repeated prompts with varying suffixes -
Batch similar requests:
batch_generateis more efficient than sequential generation -
Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
-
Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
-
Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
-
Use chat templates: Always apply
tokenizer.apply_chat_template()for instruction-tuned models
References
See reference/ for detailed documentation:
quantization.md- Detailed quantization methods and when to use eachfine-tuning.md- Complete LoRA/QLoRA training guide with data formats and configuration
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
prompt-engineering
Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.
qlora
Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.
agents
Patterns and architectures for building AI agents and workflows with LLMs. Use when designing systems that involve tool use, multi-step reasoning, autonomous decision-making, or orchestration of LLM-driven tasks.
rlhf
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
transformers
Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
pytorch
Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.
Didn't find tool you were looking for?