性能优化

核心问题

瓶颈在哪里？优化值得吗？

先测量，再优化。不要猜测。

优化优先级

1. 算法选择     (10x - 1000x)   ← 最大收益
2. 数据结构     (2x - 10x)
3. 减少分配     (2x - 5x)
4. 缓存优化     (1.5x - 3x)
5. SIMD/并行    (2x - 8x)

警告：过早优化是万恶之源。先让代码跑起来，再优化热点。

测量工具

Benchmark

bash

# cargo bench
cargo bench
# criterion 统计基准测试

Profiling

工具	用途
`perf` / `flamegraph`	CPU 火焰图
`heaptrack`	分配追踪
`valgrind --tool=cachegrind`	缓存分析
`dhat`	堆分配分析

常见优化技术

1. 预分配

rust

// ❌ 每次增长都分配
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ 预分配已知大小
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}

2. 避免 clone

rust

// ❌ 不必要的 clone
fn process(item: &Item) {
    let data = item.data.clone();
    // ...
}

// ✅ 使用引用
fn process(item: &Item) {
    let data = &item.data;
    // ...
}

3. 批量操作

rust

// ❌ 多次数据库调用
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ 批量更新
db.update_all(user_ids, status)?;

4. 小对象优化

rust

// 常用小集合用 SmallVec
use smallvec::SmallVec;
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();
// 16 个以内不分配堆内存

5. 并行处理

rust

use rayon::prelude::*;
let sum: i32 = data
    .par_iter()
    .map(|x| expensive(x))
    .sum();

反模式

反模式	为什么不好	正确做法
clone 躲避生命周期	性能开销	正确所有权设计
什么都 Box	间接成本	优先栈分配
HashMap 小数据集	开销过大	Vec + 线性搜索
循环中字符串拼接	O(n²)	`with_capacity` 或 `format!`
LinkedList	缓存不友好	`Vec` 或 `VecDeque`

常见问题排查

症状	可能原因	排查方法
内存持续增长	泄漏、累积	heaptrack
CPU 占用高	算法问题	flamegraph
响应不稳定	分配波动	dhat
吞吐量低	串行处理	rayon 并行

优化检查清单

测了吗？不要猜测
瓶颈确认了吗？
算法最优吗？
数据结构合适吗？
减少不必要的分配了吗？
能并行吗？
释放内存了吗？（RAII）

高级性能优化

以下内容针对多线程、高并发场景

为什么多线程代码反而更慢？

性能问题往往藏在看不见的地方。

False Sharing (伪共享)

症状

rust

// 问题代码：多个 AtomicU64 挤在一个 struct 里
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}

CPU 一个核心长期 90%+
perf 显示大量 LLC miss
原子 RMW 操作异常多
增加线程数反而变慢

诊断

bash

# perf 分析
perf stat -d
# 看 LLC-load-misses 和 locked-instrs

# 火焰图
cargo flamegraph
# 找 atomic fetch_add 热点

解决：Cache Line Padding

rust

// 每个字段独立一个 cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}

验证

rust

// Benchmark 对比
fn bench_naive() { /* 多个 AtomicU64 */ }
fn bench_padded() { /* 独立 cache line */ }

锁竞争优化

症状

rust

// 全局共享 HashMap，所有线程竞争同一把锁
let shared: Arc<Mutex<HashMap<String, usize>>> = Arc::new(Mutex::new(HashMap::new()));

大量时间在 mutex lock/unlock
增加线程数性能不升反降
系统时间占比高

解决：分片本地计数

rust

// 每个线程本地 HashMap，最后合并
pub fn parallel_count(data: &[String], num_threads: usize) -> HashMap<String, usize> {
    let mut handles = Vec::new();
    
    for chunk in data.chunks(/*...*/) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key).or_insert(0) += 1;
            }
            local  // 返回本地计数
        }));
    }
    
    // 合并所有本地结果
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}

NUMA 感知

问题场景

rust

// 多 socket 服务器，内存分配在远端 NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing 让任务在任意线程执行
// 跨 NUMA 访问导致严重的内存迁移延迟

解决

rust

// 1. NUMA 节点绑定
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. 统一 allocator（jemalloc）
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. 避免跨 NUMA 的对象 clone
// 直接借用，不做数据拷贝

工具

bash

# 检查 NUMA 拓扑
numactl --hardware

# 绑定 NUMA node
numactl --cpunodebind=0 --membind=0 ./my_program

数据结构优化

HashMap vs 分片

场景	方案	原因
高并发写入	DashMap 或分片	减少锁竞争
读多写少	RwLock<HashMap>	读锁不阻塞
小数据集	Vec + 线性搜索	HashMap 开销更大
固定 key	Enum + 数组	完全无哈希开销

示例：读多写少

rust

// 大量读取，少量更新
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().get(key).cloned()
    }
    
    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().insert(key, value);
    }
}

常见陷阱速查

陷阱	症状	解决
相邻原子变量	伪共享	`#[repr(align(64))]`
全局 Mutex	锁竞争	本地计数 + 合并
跨 NUMA 分配	内存迁移	NUMA 感知分配
频繁小分配	allocator 压力	对象池
动态字符串 key	额外分配	用整数 ID 代替

性能诊断工具

工具	用途
`perf stat -d`	CPU 周期、缓存命中率
`perf record -g`	采样火焰图
`valgrind --tool=cachegrind`	缓存分析
`jemalloc profiling`	内存分配分析
`numactl`	NUMA 拓扑

Search AI Tools

rust-performance

Install this agent skill to your Project

SKILL.md

性能优化

核心问题

优化优先级

测量工具

Benchmark

Profiling

常见优化技术

1. 预分配

2. 避免 clone

3. 批量操作

4. 小对象优化

5. 并行处理

反模式

常见问题排查

优化检查清单

高级性能优化

为什么多线程代码反而更慢？

False Sharing (伪共享)

症状

诊断

解决：Cache Line Padding

验证

锁竞争优化

症状

解决：分片本地计数

NUMA 感知

问题场景

解决

工具

数据结构优化

HashMap vs 分片

示例：读多写少

常见陷阱速查

性能诊断工具