256 lines
6.7 KiB
Markdown
256 lines
6.7 KiB
Markdown
# Performance Optimizations for gotgt
|
|
|
|
This document describes the performance optimizations implemented for gotgt, focusing on NUMA-aware memory allocation and io_uring backend storage support.
|
|
|
|
## Overview
|
|
|
|
Two major performance optimizations have been implemented:
|
|
|
|
1. **NUMA-Aware Memory Allocation** - Optimizes memory access patterns on multi-socket systems
|
|
2. **io_uring Backend Storage** - Provides high-performance asynchronous I/O on Linux 5.1+
|
|
|
|
## 1. NUMA-Aware Memory Allocation
|
|
|
|
### What is NUMA?
|
|
|
|
Non-Uniform Memory Access (NUMA) is a memory design used in multi-processor systems where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).
|
|
|
|
### Implementation
|
|
|
|
The NUMA support is implemented in `pkg/util/numa/`:
|
|
|
|
- **Topology Detection** (`numa.go`, `numa_linux.go`): Automatically detects NUMA topology using `/sys/devices/system/node/` filesystem
|
|
- **NUMA-Local Buffer Pool** (`pool.go`): Provides buffer pools that allocate memory from local NUMA nodes
|
|
- **Thread Pinning** (`numa_linux.go`): Allows threads to be pinned to specific NUMA nodes
|
|
|
|
### Key Components
|
|
|
|
#### NUMABufferPool
|
|
|
|
```go
|
|
pool := numa.NewNUMABufferPool(&numa.BufferPoolConfig{
|
|
BufferSize: 256 * 1024, // 256KB buffers
|
|
PerNodePoolSize: 1024, // 1024 buffers per node
|
|
EnableNUMA: true,
|
|
})
|
|
|
|
buf := pool.Get() // Get buffer from local NUMA node
|
|
// use buffer...
|
|
pool.Put(buf) // Return buffer to pool
|
|
```
|
|
|
|
#### Thread Pinning
|
|
|
|
```go
|
|
// Pin current goroutine to NUMA node 0
|
|
numa.PinThreadToNode(0)
|
|
defer numa.UnpinThread()
|
|
|
|
// Or use RunOnNode for a function
|
|
numa.RunOnNode(0, func() {
|
|
// This function runs on NUMA node 0
|
|
})
|
|
```
|
|
|
|
### Performance Benefits
|
|
|
|
- Reduced memory latency by accessing local NUMA nodes
|
|
- Better cache utilization
|
|
- Reduced cross-socket traffic
|
|
- Predictable performance on multi-socket systems
|
|
|
|
### Configuration
|
|
|
|
Enable NUMA support in the configuration file:
|
|
|
|
```json
|
|
{
|
|
"performance": {
|
|
"enableNUMA": true,
|
|
"numaBufferPoolSize": 1024,
|
|
"numaBufferSize": 262144
|
|
}
|
|
}
|
|
```
|
|
|
|
## 2. io_uring Backend Storage
|
|
|
|
### What is io_uring?
|
|
|
|
io_uring is a Linux kernel interface for asynchronous I/O that was introduced in Linux 5.1. It provides a highly efficient interface for submitting and completing I/O operations with minimal system call overhead.
|
|
|
|
### Benefits of io_uring
|
|
|
|
- Reduced system call overhead (batching of operations)
|
|
- Lower latency for I/O operations
|
|
- Higher throughput especially for high queue depth workloads
|
|
- Better CPU efficiency
|
|
|
|
### Implementation
|
|
|
|
The io_uring backend is implemented in `pkg/scsi/backingstore/iouring/`:
|
|
|
|
- **Async I/O Operations**: Read, Write, and Fsync using io_uring
|
|
- **Queue Management**: Configurable queue depth
|
|
- **Fallback Support**: Automatically falls back to regular I/O on older kernels
|
|
|
|
### Usage
|
|
|
|
Enable io_uring in the storage configuration:
|
|
|
|
```json
|
|
{
|
|
"storages": [
|
|
{
|
|
"deviceID": 1000,
|
|
"path": "/var/tmp/disk.img",
|
|
"online": true,
|
|
"backendType": "iouring",
|
|
"ioUringQueueDepth": 4096
|
|
}
|
|
],
|
|
"performance": {
|
|
"enableIoUring": true,
|
|
"ioUringQueueDepth": 4096
|
|
}
|
|
}
|
|
```
|
|
|
|
### Backend Type Options
|
|
|
|
- `file` - Standard synchronous file I/O (default)
|
|
- `iouring` - io_uring-based asynchronous I/O (Linux 5.1+)
|
|
|
|
### Requirements
|
|
|
|
- Linux kernel 5.1 or later
|
|
- x86_64, ARM64, or other supported architectures
|
|
- O_DIRECT support recommended for best performance
|
|
|
|
## 3. Combined Configuration Example
|
|
|
|
For maximum performance, combine both NUMA and io_uring:
|
|
|
|
```json
|
|
{
|
|
"storages": [
|
|
{
|
|
"deviceID": 1000,
|
|
"path": "/var/tmp/disk.img",
|
|
"online": true,
|
|
"backendType": "iouring",
|
|
"enableNUMA": true,
|
|
"numaNode": 0,
|
|
"ioUringQueueDepth": 4096
|
|
}
|
|
],
|
|
"iscsiportals": [
|
|
{
|
|
"id": 0,
|
|
"portal": "192.168.1.100:3260"
|
|
}
|
|
],
|
|
"iscsitargets": {
|
|
"iqn.2024-01.com.gotgt:fast-storage": {
|
|
"tpgts": { "1": [0] },
|
|
"luns": { "1": 1000 }
|
|
}
|
|
},
|
|
"performance": {
|
|
"enableNUMA": true,
|
|
"enableIoUring": true,
|
|
"ioUringQueueDepth": 4096,
|
|
"numaBufferPoolSize": 1024,
|
|
"numaBufferSize": 262144
|
|
}
|
|
}
|
|
```
|
|
|
|
## 4. Performance Tuning Guide
|
|
|
|
### NUMA Tuning
|
|
|
|
1. **Determine NUMA Topology**:
|
|
```bash
|
|
numactl --hardware
|
|
lscpu | grep NUMA
|
|
```
|
|
|
|
2. **Align Network and Storage**:
|
|
- Ensure network interfaces are on the same NUMA node as the iSCSI process
|
|
- Place storage devices on the same NUMA node if possible
|
|
|
|
3. **Buffer Pool Sizing**:
|
|
- `numaBufferPoolSize`: Number of buffers per node (default: 1024)
|
|
- `numaBufferSize`: Size of each buffer (default: 256KB)
|
|
- Size based on expected concurrent I/O and I/O size
|
|
|
|
### io_uring Tuning
|
|
|
|
1. **Queue Depth**:
|
|
- Higher queue depth = better throughput, higher latency
|
|
- Lower queue depth = lower latency, lower throughput
|
|
- Typical values: 128-4096 depending on workload
|
|
|
|
2. **I/O Size**:
|
|
- Match application I/O size for best efficiency
|
|
- Use direct I/O (O_DIRECT) to bypass page cache if appropriate
|
|
|
|
3. **System Limits**:
|
|
```bash
|
|
# Check current limits
|
|
ulimit -a
|
|
|
|
# Increase if needed (in /etc/security/limits.conf)
|
|
* soft nofile 1048576
|
|
* hard nofile 1048576
|
|
```
|
|
|
|
## 5. Benchmarking
|
|
|
|
Use the following tools to benchmark performance:
|
|
|
|
1. **fio** (Flexible I/O Tester):
|
|
```bash
|
|
fio --name=iscsi-test --ioengine=libaio --iodepth=32 \
|
|
--rw=randread --bs=4k --direct=1 --size=1G \
|
|
--filename=/dev/sdX
|
|
```
|
|
|
|
2. **iperf3** (for network bandwidth):
|
|
```bash
|
|
iperf3 -c <target-ip> -p 3260
|
|
```
|
|
|
|
3. **iscsi-perf** (if available from libiscsi)
|
|
|
|
## 6. Troubleshooting
|
|
|
|
### NUMA Issues
|
|
|
|
- Check if NUMA is available: `numa.Available()`
|
|
- Verify topology detection: Check logs for NUMA node count
|
|
- Thread pinning failures: Ensure sufficient privileges (CAP_SYS_NICE)
|
|
|
|
### io_uring Issues
|
|
|
|
- Kernel version check: `uname -r` (must be 5.1+)
|
|
- io_uring availability: Check if `/proc/sys/kernel/io_uring_disabled` exists
|
|
- Permission issues: Ensure user has appropriate file permissions
|
|
|
|
## 7. Future Enhancements
|
|
|
|
Potential future optimizations:
|
|
|
|
1. **DPDK Support** - Kernel-bypass networking for iSCSI
|
|
2. **SPDK Integration** - User-space NVMe driver support
|
|
3. **CPU Affinity Configuration** - Fine-grained CPU pinning
|
|
4. **Memory Interleaving** - Automatic memory interleaving policies
|
|
5. **Adaptive Buffer Sizing** - Dynamic buffer pool sizing based on workload
|
|
|
|
## References
|
|
|
|
- [io_uring by Jens Axboe](https://kernel.dk/io_uring.pdf)
|
|
- [NUMA FAQ](https://www.kernel.org/doc/html/latest/vm/numa.html)
|
|
- [iSCSI RFC 7143](https://tools.ietf.org/html/rfc7143)
|