Files
gotgt/docs/PERFORMANCE_OPTIMIZATIONS.md
2026-03-14 11:45:35 +08:00

6.7 KiB

Performance Optimizations for gotgt

This document describes the performance optimizations implemented for gotgt, focusing on NUMA-aware memory allocation and io_uring backend storage support.

Overview

Two major performance optimizations have been implemented:

  1. NUMA-Aware Memory Allocation - Optimizes memory access patterns on multi-socket systems
  2. io_uring Backend Storage - Provides high-performance asynchronous I/O on Linux 5.1+

1. NUMA-Aware Memory Allocation

What is NUMA?

Non-Uniform Memory Access (NUMA) is a memory design used in multi-processor systems where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).

Implementation

The NUMA support is implemented in pkg/util/numa/:

  • Topology Detection (numa.go, numa_linux.go): Automatically detects NUMA topology using /sys/devices/system/node/ filesystem
  • NUMA-Local Buffer Pool (pool.go): Provides buffer pools that allocate memory from local NUMA nodes
  • Thread Pinning (numa_linux.go): Allows threads to be pinned to specific NUMA nodes

Key Components

NUMABufferPool

pool := numa.NewNUMABufferPool(&numa.BufferPoolConfig{
    BufferSize:      256 * 1024,  // 256KB buffers
    PerNodePoolSize: 1024,        // 1024 buffers per node
    EnableNUMA:      true,
})

buf := pool.Get()  // Get buffer from local NUMA node
// use buffer...
pool.Put(buf)      // Return buffer to pool

Thread Pinning

// Pin current goroutine to NUMA node 0
numa.PinThreadToNode(0)
defer numa.UnpinThread()

// Or use RunOnNode for a function
numa.RunOnNode(0, func() {
    // This function runs on NUMA node 0
})

Performance Benefits

  • Reduced memory latency by accessing local NUMA nodes
  • Better cache utilization
  • Reduced cross-socket traffic
  • Predictable performance on multi-socket systems

Configuration

Enable NUMA support in the configuration file:

{
  "performance": {
    "enableNUMA": true,
    "numaBufferPoolSize": 1024,
    "numaBufferSize": 262144
  }
}

2. io_uring Backend Storage

What is io_uring?

io_uring is a Linux kernel interface for asynchronous I/O that was introduced in Linux 5.1. It provides a highly efficient interface for submitting and completing I/O operations with minimal system call overhead.

Benefits of io_uring

  • Reduced system call overhead (batching of operations)
  • Lower latency for I/O operations
  • Higher throughput especially for high queue depth workloads
  • Better CPU efficiency

Implementation

The io_uring backend is implemented in pkg/scsi/backingstore/iouring/:

  • Async I/O Operations: Read, Write, and Fsync using io_uring
  • Queue Management: Configurable queue depth
  • Fallback Support: Automatically falls back to regular I/O on older kernels

Usage

Enable io_uring in the storage configuration:

{
  "storages": [
    {
      "deviceID": 1000,
      "path": "/var/tmp/disk.img",
      "online": true,
      "backendType": "iouring",
      "ioUringQueueDepth": 4096
    }
  ],
  "performance": {
    "enableIoUring": true,
    "ioUringQueueDepth": 4096
  }
}

Backend Type Options

  • file - Standard synchronous file I/O (default)
  • iouring - io_uring-based asynchronous I/O (Linux 5.1+)

Requirements

  • Linux kernel 5.1 or later
  • x86_64, ARM64, or other supported architectures
  • O_DIRECT support recommended for best performance

3. Combined Configuration Example

For maximum performance, combine both NUMA and io_uring:

{
  "storages": [
    {
      "deviceID": 1000,
      "path": "/var/tmp/disk.img",
      "online": true,
      "backendType": "iouring",
      "enableNUMA": true,
      "numaNode": 0,
      "ioUringQueueDepth": 4096
    }
  ],
  "iscsiportals": [
    {
      "id": 0,
      "portal": "192.168.1.100:3260"
    }
  ],
  "iscsitargets": {
    "iqn.2024-01.com.gotgt:fast-storage": {
      "tpgts": { "1": [0] },
      "luns": { "1": 1000 }
    }
  },
  "performance": {
    "enableNUMA": true,
    "enableIoUring": true,
    "ioUringQueueDepth": 4096,
    "numaBufferPoolSize": 1024,
    "numaBufferSize": 262144
  }
}

4. Performance Tuning Guide

NUMA Tuning

  1. Determine NUMA Topology:

    numactl --hardware
    lscpu | grep NUMA
    
  2. Align Network and Storage:

    • Ensure network interfaces are on the same NUMA node as the iSCSI process
    • Place storage devices on the same NUMA node if possible
  3. Buffer Pool Sizing:

    • numaBufferPoolSize: Number of buffers per node (default: 1024)
    • numaBufferSize: Size of each buffer (default: 256KB)
    • Size based on expected concurrent I/O and I/O size

io_uring Tuning

  1. Queue Depth:

    • Higher queue depth = better throughput, higher latency
    • Lower queue depth = lower latency, lower throughput
    • Typical values: 128-4096 depending on workload
  2. I/O Size:

    • Match application I/O size for best efficiency
    • Use direct I/O (O_DIRECT) to bypass page cache if appropriate
  3. System Limits:

    # Check current limits
    ulimit -a
    
    # Increase if needed (in /etc/security/limits.conf)
    * soft nofile 1048576
    * hard nofile 1048576
    

5. Benchmarking

Use the following tools to benchmark performance:

  1. fio (Flexible I/O Tester):

    fio --name=iscsi-test --ioengine=libaio --iodepth=32 \
        --rw=randread --bs=4k --direct=1 --size=1G \
        --filename=/dev/sdX
    
  2. iperf3 (for network bandwidth):

    iperf3 -c <target-ip> -p 3260
    
  3. iscsi-perf (if available from libiscsi)

6. Troubleshooting

NUMA Issues

  • Check if NUMA is available: numa.Available()
  • Verify topology detection: Check logs for NUMA node count
  • Thread pinning failures: Ensure sufficient privileges (CAP_SYS_NICE)

io_uring Issues

  • Kernel version check: uname -r (must be 5.1+)
  • io_uring availability: Check if /proc/sys/kernel/io_uring_disabled exists
  • Permission issues: Ensure user has appropriate file permissions

7. Future Enhancements

Potential future optimizations:

  1. DPDK Support - Kernel-bypass networking for iSCSI
  2. SPDK Integration - User-space NVMe driver support
  3. CPU Affinity Configuration - Fine-grained CPU pinning
  4. Memory Interleaving - Automatic memory interleaving policies
  5. Adaptive Buffer Sizing - Dynamic buffer pool sizing based on workload

References