Files
gotgt/docs/PERFORMANCE_OPTIMIZATIONS.md
2026-03-14 11:45:35 +08:00

256 lines
6.7 KiB
Markdown

# Performance Optimizations for gotgt
This document describes the performance optimizations implemented for gotgt, focusing on NUMA-aware memory allocation and io_uring backend storage support.
## Overview
Two major performance optimizations have been implemented:
1. **NUMA-Aware Memory Allocation** - Optimizes memory access patterns on multi-socket systems
2. **io_uring Backend Storage** - Provides high-performance asynchronous I/O on Linux 5.1+
## 1. NUMA-Aware Memory Allocation
### What is NUMA?
Non-Uniform Memory Access (NUMA) is a memory design used in multi-processor systems where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).
### Implementation
The NUMA support is implemented in `pkg/util/numa/`:
- **Topology Detection** (`numa.go`, `numa_linux.go`): Automatically detects NUMA topology using `/sys/devices/system/node/` filesystem
- **NUMA-Local Buffer Pool** (`pool.go`): Provides buffer pools that allocate memory from local NUMA nodes
- **Thread Pinning** (`numa_linux.go`): Allows threads to be pinned to specific NUMA nodes
### Key Components
#### NUMABufferPool
```go
pool := numa.NewNUMABufferPool(&numa.BufferPoolConfig{
BufferSize: 256 * 1024, // 256KB buffers
PerNodePoolSize: 1024, // 1024 buffers per node
EnableNUMA: true,
})
buf := pool.Get() // Get buffer from local NUMA node
// use buffer...
pool.Put(buf) // Return buffer to pool
```
#### Thread Pinning
```go
// Pin current goroutine to NUMA node 0
numa.PinThreadToNode(0)
defer numa.UnpinThread()
// Or use RunOnNode for a function
numa.RunOnNode(0, func() {
// This function runs on NUMA node 0
})
```
### Performance Benefits
- Reduced memory latency by accessing local NUMA nodes
- Better cache utilization
- Reduced cross-socket traffic
- Predictable performance on multi-socket systems
### Configuration
Enable NUMA support in the configuration file:
```json
{
"performance": {
"enableNUMA": true,
"numaBufferPoolSize": 1024,
"numaBufferSize": 262144
}
}
```
## 2. io_uring Backend Storage
### What is io_uring?
io_uring is a Linux kernel interface for asynchronous I/O that was introduced in Linux 5.1. It provides a highly efficient interface for submitting and completing I/O operations with minimal system call overhead.
### Benefits of io_uring
- Reduced system call overhead (batching of operations)
- Lower latency for I/O operations
- Higher throughput especially for high queue depth workloads
- Better CPU efficiency
### Implementation
The io_uring backend is implemented in `pkg/scsi/backingstore/iouring/`:
- **Async I/O Operations**: Read, Write, and Fsync using io_uring
- **Queue Management**: Configurable queue depth
- **Fallback Support**: Automatically falls back to regular I/O on older kernels
### Usage
Enable io_uring in the storage configuration:
```json
{
"storages": [
{
"deviceID": 1000,
"path": "/var/tmp/disk.img",
"online": true,
"backendType": "iouring",
"ioUringQueueDepth": 4096
}
],
"performance": {
"enableIoUring": true,
"ioUringQueueDepth": 4096
}
}
```
### Backend Type Options
- `file` - Standard synchronous file I/O (default)
- `iouring` - io_uring-based asynchronous I/O (Linux 5.1+)
### Requirements
- Linux kernel 5.1 or later
- x86_64, ARM64, or other supported architectures
- O_DIRECT support recommended for best performance
## 3. Combined Configuration Example
For maximum performance, combine both NUMA and io_uring:
```json
{
"storages": [
{
"deviceID": 1000,
"path": "/var/tmp/disk.img",
"online": true,
"backendType": "iouring",
"enableNUMA": true,
"numaNode": 0,
"ioUringQueueDepth": 4096
}
],
"iscsiportals": [
{
"id": 0,
"portal": "192.168.1.100:3260"
}
],
"iscsitargets": {
"iqn.2024-01.com.gotgt:fast-storage": {
"tpgts": { "1": [0] },
"luns": { "1": 1000 }
}
},
"performance": {
"enableNUMA": true,
"enableIoUring": true,
"ioUringQueueDepth": 4096,
"numaBufferPoolSize": 1024,
"numaBufferSize": 262144
}
}
```
## 4. Performance Tuning Guide
### NUMA Tuning
1. **Determine NUMA Topology**:
```bash
numactl --hardware
lscpu | grep NUMA
```
2. **Align Network and Storage**:
- Ensure network interfaces are on the same NUMA node as the iSCSI process
- Place storage devices on the same NUMA node if possible
3. **Buffer Pool Sizing**:
- `numaBufferPoolSize`: Number of buffers per node (default: 1024)
- `numaBufferSize`: Size of each buffer (default: 256KB)
- Size based on expected concurrent I/O and I/O size
### io_uring Tuning
1. **Queue Depth**:
- Higher queue depth = better throughput, higher latency
- Lower queue depth = lower latency, lower throughput
- Typical values: 128-4096 depending on workload
2. **I/O Size**:
- Match application I/O size for best efficiency
- Use direct I/O (O_DIRECT) to bypass page cache if appropriate
3. **System Limits**:
```bash
# Check current limits
ulimit -a
# Increase if needed (in /etc/security/limits.conf)
* soft nofile 1048576
* hard nofile 1048576
```
## 5. Benchmarking
Use the following tools to benchmark performance:
1. **fio** (Flexible I/O Tester):
```bash
fio --name=iscsi-test --ioengine=libaio --iodepth=32 \
--rw=randread --bs=4k --direct=1 --size=1G \
--filename=/dev/sdX
```
2. **iperf3** (for network bandwidth):
```bash
iperf3 -c <target-ip> -p 3260
```
3. **iscsi-perf** (if available from libiscsi)
## 6. Troubleshooting
### NUMA Issues
- Check if NUMA is available: `numa.Available()`
- Verify topology detection: Check logs for NUMA node count
- Thread pinning failures: Ensure sufficient privileges (CAP_SYS_NICE)
### io_uring Issues
- Kernel version check: `uname -r` (must be 5.1+)
- io_uring availability: Check if `/proc/sys/kernel/io_uring_disabled` exists
- Permission issues: Ensure user has appropriate file permissions
## 7. Future Enhancements
Potential future optimizations:
1. **DPDK Support** - Kernel-bypass networking for iSCSI
2. **SPDK Integration** - User-space NVMe driver support
3. **CPU Affinity Configuration** - Fine-grained CPU pinning
4. **Memory Interleaving** - Automatic memory interleaving policies
5. **Adaptive Buffer Sizing** - Dynamic buffer pool sizing based on workload
## References
- [io_uring by Jens Axboe](https://kernel.dk/io_uring.pdf)
- [NUMA FAQ](https://www.kernel.org/doc/html/latest/vm/numa.html)
- [iSCSI RFC 7143](https://tools.ietf.org/html/rfc7143)