Agent skill

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

Stars 23,776
Forks 2,298

Install this agent skill to your Project

npx add-skill https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/ai-research/optimization-awq

SKILL.md

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

  • Need 4-bit quantization with <5% accuracy loss
  • Deploying instruction-tuned or chat models (AWQ generalizes better)
  • Want ~2.5-3x inference speedup over FP16
  • Using vLLM for production serving
  • Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

  • Need maximum ecosystem compatibility (more tools support GPTQ)
  • Working with ExLlamaV2 backend specifically
  • Have older GPUs without Marlin support

Use bitsandbytes instead when:

  • Need zero calibration overhead (quantize on-the-fly)
  • Want to fine-tune with QLoRA
  • Prefer simpler integration

Quick start

Installation

bash
# Default (Triton kernels)
pip install autoawq

# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU optimization
pip install autoawq[cpu]

Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model

python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model

python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,      # Use zero-point quantization
    "q_group_size": 128,     # Group size (128 recommended)
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM"        # GEMM for batch, GEMV for single-token
}

# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

Timing: ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes

Feature AWQ GPTQ bitsandbytes
Speedup (4-bit) ~2.5-3x ~2x ~1.5x
Accuracy loss <5% ~5-10% ~5-15%
Calibration Minimal (128-1K tokens) More extensive None
Overfitting risk Low Higher N/A
Best for Production inference GPU inference Easy integration
vLLM support Native Yes Limited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends

GEMM (default, batch inference)

python
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

GEMV (single-token generation)

python
quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation: Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs)

python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible)

python
config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

HuggingFace Transformers integration

Direct loading

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note: Fused modules cannot combine with FlashAttention2.

vLLM integration

python
from vllm import LLM, SamplingParams

# vLLM auto-detects AWQ models
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks

Memory reduction

Model FP16 AWQ 4-bit Reduction
Mistral 7B 14 GB 5.5 GB 2.5x
Llama 2-13B 26 GB 10 GB 2.6x
Llama 2-70B 140 GB 35 GB 4x

Inference speed (RTX 4090)

Model Prefill (tok/s) Decode (tok/s) Memory
Mistral 7B GEMM 3,897 114 5.55 GB
TinyLlama 1B GEMV 5,179 431 2.10 GB
Llama 2-13B GEMM 2,279 74 10.28 GB

Accuracy (perplexity)

Model FP16 AWQ 4-bit Degradation
Llama 3 8B 8.20 8.48 +3.4%
Mistral 7B 5.25 5.42 +3.2%
Qwen2 72B 4.85 4.95 +2.1%

Custom calibration data

python
# Use custom dataset for domain-specific models
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # Or custom list of strings
    max_calib_samples=256,       # More samples = better accuracy
    max_calib_seq_len=512        # Sequence length
)

# Or provide your own samples
calib_samples = [
    "Your domain-specific text here...",
    "More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment

python
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

35+ architectures including:

  • Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
  • Qwen: Qwen, Qwen2, Qwen2.5-VL
  • Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
  • Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

Common issues

CUDA OOM during quantization:

python
# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference:

python
# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support:

python
# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

Existing quantized models remain usable.

References

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298
Explore
davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298
Explore
davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298
Explore
davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298
Explore
davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298
Explore
davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results