Agent skills
rwkv-architecture

Agent skill

rwkv-architecture

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

View SKILL.md on GitHub Repository

Stars 23,776

Forks 2,298

Install this agent skill to your Project

npx add-skill https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/ai-research/model-architecture-rwkv

SKILL.md

RWKV - Receptance Weighted Key Value

Quick start

RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).

Installation:

bash

# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

# Install RWKV
pip install rwkv

Basic usage (GPT mode + RNN mode):

python

import os
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1'  # Use CUDA kernel for speed

# Load model
model = RWKV(
    model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
    strategy='cuda fp16'
)

# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy())  # Logits

# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None)  # First 2 tokens
out, state = model.forward([1563], state)      # Next token
out, state = model.forward([310, 247], state)  # Last tokens
print(out.detach().cpu().numpy())  # Same logits as above!

Common workflows

Workflow 1: Text generation (streaming)

Efficient token-by-token generation:

python

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")

# Initial prompt
prompt = "The future of AI is"
state = None

# Generate token by token
for token in prompt:
    out, state = pipeline.model.forward(pipeline.encode(token), state)

# Continue generation
for _ in range(100):
    out, state = pipeline.model.forward(None, state)
    token = pipeline.sample_logits(out)
    print(pipeline.decode(token), end='', flush=True)

Key advantage: Constant memory per token (no growing KV cache)

Workflow 2: Long context processing (infinite context)

Process million-token sequences:

python

model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

# Process very long document
state = None
long_document = load_document()  # e.g., 1M tokens

# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
    out, state = model.forward(chunk, state)

# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)

Workflow 3: Fine-tuning RWKV

Standard fine-tuning workflow:

python

# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer

# Configure model
config = {
    'n_layer': 24,
    'n_embd': 1024,
    'vocab_size': 50277,
    'ctx_len': 1024
}

# Setup trainer
trainer = pl.Trainer(
    accelerator='gpu',
    devices=8,
    precision='bf16',
    strategy='deepspeed_stage_2',
    max_epochs=1
)

# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)

Workflow 4: RWKV vs Transformer comparison

Memory comparison (1M token sequence):

python

# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)

# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!

Speed comparison (inference):

python

# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations

# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)

When to use vs alternatives

Use RWKV when:

Need very long context (100K+ tokens)
Want constant memory usage
Building streaming applications
Need RNN efficiency with Transformer performance
Memory-constrained deployment

Key advantages:

Linear time: O(n) vs O(n²) for Transformers
No KV cache: Constant memory per token
Infinite context: No fixed window limit
Parallelizable training: Like GPT
Sequential inference: Like RNN

Use alternatives instead:

Transformers: Need absolute best performance, have compute
Mamba: Want state-space models
RetNet: Need retention mechanism
Hyena: Want convolution-based approach

Common issues

Issue: Out of memory during training

Use gradient checkpointing and DeepSpeed:

python

trainer = pl.Trainer(
    strategy='deepspeed_stage_3',  # Full ZeRO-3
    precision='bf16'
)

Issue: Slow inference

Enable CUDA kernel:

python

os.environ["RWKV_CUDA_ON"] = '1'

Issue: Model not loading

Check model path and strategy:

python

model = RWKV(
    model='/absolute/path/to/model.pth',
    strategy='cuda fp16'  # Or 'cpu fp32' for CPU
)

Issue: State management in RNN mode

Always pass state between forward calls:

python

# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None)  # No context from tokens1!

# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state)  # Has context from tokens1

Advanced topics

Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.

State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.

RWKV-7 improvements: See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.

Hardware requirements

GPU: NVIDIA (CUDA 11.6+) or CPU
VRAM (FP16):
- 169M model: 1GB
- 430M model: 2GB
- 1.5B model: 4GB
- 3B model: 8GB
- 7B model: 16GB
- 14B model: 32GB
Inference: O(1) memory per token
Training: Parallelizable like GPT

Performance (vs Transformers):

Speed: Similar training, faster inference
Memory: 1000× less for long sequences
Scaling: Linear vs quadratic

Resources

Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
Docs: https://wiki.rwkv.com/
Models: https://huggingface.co/BlinkDL
Linux Foundation AI: Official project
Production: Microsoft Windows, Office integration, NeMo support

Maintainer

davila7 Core maintainer

Source details

Full Name: davila7/claude-code-templates
Branch: main
Path in repo: cli-tool/components/skills/ai-research/model-architecture-rwkv
License: MIT License
Topics: claude-code anthropic anthropic-claude claude

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298

Explore

davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298

Explore

davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298

Explore

davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298

Explore

davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298

Explore

davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

RWKV - Receptance Weighted Key Value

Quick start

Common workflows

Workflow 1: Text generation (streaming)

Workflow 2: Long context processing (infinite context)

Workflow 3: Fine-tuning RWKV

Workflow 4: RWKV vs Transformer comparison

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

Recommended Agent Skills

verl-rl-training

openrlhf-training

gguf-quantization

Claude Code Guide

qdrant-vector-search

behavioral-modes