this patch adds an non-RFC extenstion to the redirect login response which allows
a target to temporarily redirect not only to a different target address but also to
a different targetname.
This is needed to allow scenarious in active/active storage clusters where each
node had its own targetname, but maps the same volumes behind equal LUN ids.
For this non-RFC behaviour the environment variable LIBISCSI_ALLLOW_TARGETNAME_REDIRECT
has to be set.
Signed-off-by: Peter Lieven <pl@dlhnet.de>
this patch adds a read of VPD page 0x80 (unit serial number) after successful login.
The serial is then validated on secutive reconnects to avoid the accidental mismatch
of LUN ids if some kind of remapping appears between loss of connection and a later
reconnect.
An additional url parameter force_usn is added to enforce the usn right from the beginning.
If not set via url or the new iscsi_set_unit_serial_number function the usn is learned
at the first successful login.
Signed-off-by: Peter Lieven <pl@dlhnet.de>
Apple do not support spinlocks so we must remap to the much slower
mutexes instead if we think we are compiling for apple.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
This is the basic support for doing i/o in a separate worker thread.
It is still not threads safe but a start.
Now we need to protect all variables such as outqueue, waitpdu
and friends.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Add an abstraction for mutexts and threads
that handles both pthread api and native win32 api
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
When an iSCSI connection enters the reconnection phase, the backoff
time (next_reconnect) increases with reconnection retry_cnt. However,
if the client detects that the target has recovered before reaching
next_reconnect, calling iscsi_reconnect/iscsi_force_reconnect has no
any effect, making fast reconnection impossible.
This patch introduces an interface to reset next_reconnect, so that
the client can reset the backoff time upon detecting target recovery
and achieve faster reconnection.
Resolves: https://github.com/sahlberg/libiscsi/issues/428
Signed-off-by: raywang <honglei.wang@smartx.com>
Originally, we use this in scsi-lowlevel.c only, this works as static
function. It also could be used to dump ISCSI opcode, so move it into
common utils.h/utils.c.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
The existing implementation of iscsi_task_mgmt_lun_reset_async cancels
all tasks in ready-to-send and wait-for-completion queues.
If the ISCSI context has in-flight tasks for a different LUNs or
tasks that are not LUN-specific (such as NOPIN, NOPOUT), those tasks
are not supposed to be affected by the LUN reset.
Also, the tasks for the LUN being reset may have in-flight responses
not affected by a concurrent LUN reset; they have to be handled
accordingly.
This change cancels only the tasks for the LUN being reset if they are
in the ready-to-send queue ('outqueue'). The tasks in the wait-for-
completion queue should be cancelled on LUN reset completion.
For example:
iscsi_task_mgmt_lun_reset_async(iscsi, lun, lun_reset_cb, ctxt);
....
....
void lun_reset_cb(struct iscsi_context * iscsi, int status,
void * command_data, void * private_data)
{
// 'response' field per ISCSI spec rfc7143 section 11.6.1
uint8_t iscsi_response = *(uint8_t *)command_data;
if (iscsi_response == 0) {
// The LUN has been reset. No further replies are expected
// for in-flight tasks for that LUN. Explicitly cancelling
// the tasks in wait-for-completion queue.
for (.. scsi_task-s in flight ..) {
iscsi_scsi_cancel_task(iscsi, task);
}
} ...
}
Users need to check if a task is queued to send (not sent yet)
before invoking functions such as iscsi_task_mgmt_abort_task_async.
Otherwise, because task management requests are automatically
treated as "immediate", a request to abort a task is sent before
the task itself.
if (iscsi_scsi_is_task_in_outqueue(iscsi_, task_)) {
iscsi_scsi_cancel_task(iscsi_, task_);
} else {
iscsi_task_mgmt_abort_task_async(iscsi_, task_, AbortCb, context_);
}
If a connection attempt is hung, then iscsi_reconnect() won't do anything. This
makes sense if we'd just re-try to connect to the same target, but if (for
example) login redirect might point us to a different, healthy, target, it
should be possible to restart the full connection process on request.
Signed-off-by: John Levon <john.levon@nutanix.com>
C++ requires explicit conversions from a void to a non-void pointer.
Signed-off-by: Li Feng <fengli@smartx.com>
[ bvanassche: edited commit message, removed casts and changed the declaration
type ]
Instead of adding __attribute__((unused)) to unused arguments, add the
-Wno-unused-parameter compiler flag.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Instead of defining the macro _R_(), define __attribute__() as a macro for
compilers that do not support __attribute__(), namely Microsoft Visual
Studio.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Check if the residual data does not owerwrite existing data blocks has now
been added for all testing data to improve the uniformity of test runs,
increase test readability and remove the duplicate testing data records.
Looking at test_write10_residuals.c, test_write12_residuals.c and
test_write16_residuals.c tests the similarity of the testing scenario
can be found. There are several EDTL and SPDTL combinations which are
the same for different length write command tests. They form the core
of testing scenario of these commands. There aren't so much differences
in the way of testing these combinations itself either. It is possible
to move the main parameters describing the testing scenario into a
separate structure and move the scenario itself into a separate function.
It will increase the readability and reduce the duplicate code of these tests.
As iser_pdu->desc->data_dir is not initialised when sending a PDU.
The value remains what it was when it was used last time. Thus
a PDU could be considered to have data if it previously had and
might cause segmentation fault.
For example if a pdu is a reset task management task with no data
to transfer and the pdu is previously used as a read task. Thus
it would cause fault like below:
> struct scsi_iovector *iovector_in = &task->iovector_in;
0 0x00007ffff7bcb2d1 in iser_rcv_completion (rx_desc=0x555555b79e48, iser_conn=0x555555b573a0) at iser.c:1349
1 0x00007ffff7bcb53e in iser_handle_wc (wc=0x7fffffffdc00, iser_conn=0x555555b573a0) at iser.c:1426
2 0x00007ffff7bcb685 in cq_event_handler (iser_conn=0x555555b573a0) at iser.c:1468
3 0x00007ffff7bcb81b in cq_handle (iser_conn=0x555555b573a0) at iser.c:1516
4 0x00007ffff7bc8b28 in iscsi_iser_service (iscsi=0x555555b58710, revents=1) at iser.c:118
5 0x00007ffff7bc3862 in iscsi_service (iscsi=0x555555b58710, revents=1) at socket.c:1016
6 0x00007ffff7bc3f6c in event_loop (iscsi=0x555555b58710, state=0x7fffffffe000) at sync.c:71
7 0x00007ffff7bc4605 in iscsi_task_mgmt_sync (iscsi=0x555555b58710, lun=0, function=ISCSI_TM_LUN_RESET, ritt=4294967295, rcmdsn=0) at sync.c:281
8 0x00007ffff7bc46cf in iscsi_task_mgmt_lun_reset_sync (iscsi=0x555555b58710, lun=0) at sync.c:312
9 0x000055555555500d in iscsi_lun_reset_sync (iscsi=0x555555b58710) at iscsiclient_lun_reset.c:34
10 0x0000555555555680 in main (argc=7, argv=0x7fffffffe1c8) at iscsiclient_lun_reset.c:211
Signed-off-by: Hou Pu <houpu@bytedance.com>
The Information descriptor type is defined in SPC-5 (r17 4.4.2.2) and
may be used to provide the data offset on COMPARE_AND_WRITE miscompare.
Signed-off-by: David Disseldorp <ddiss@suse.de>
This patch is used to fix the following problems in the current connection
method:
1. iscsi_iser_connect() waits until the connection is established or failed,
and may block the caller for a long time.
2. Although there's a cm_thread handles communication events, but in fact it
has no effects after the connection is established.
3. Resources are not released properly after reconnection failed. And once we
try to reconnect again, the resources will leak permanently.
(see iscsi_reconnect()).
This patch eliminate cm_thread and handle communication events in the caller
thread.
Connection procedure:
1. Create a mock fd by eventfd() (or just use old_iscsi->fd while reconnecting),
and assign it to iscsi->fd.
2. Create communication event channel, make it non-blocking and dup the
notifier fd to iscsi->fd.
3. Handle communication events by iscsi_which_events()/iscsi_service() loop
until connection established or falied.
4. If connection is established successfully, dup the notifier fd of completion
queue (CQ) events to iscsi->fd.
5. Handle completion queue (CQ) events by iscsi_which_events()/iscsi_service()
loop.
The entire procedure is non-blocking.
After established, whenever iscsi_service() is called with revents=0 or
queue_pdu() is called with a NOP pdu, communication events will be checked.
When connection failed, iser transport cleanup itself before callbacks.
Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
Implement an allocator for allocating memory region of different lengths.
The allocator registers 4MB memory chunks as memory regions, and select a
free segment from one of them each time.
4KB is the minimum allocation unit, and free segments in the same chunk can be
merged into a larger free segment by the rule of buddy allocation. As a result,
size of allocated segments will be power of 2, this may waste some space but
produces less fragments.
In each chunk, a complete binary tree (which is actully an array) is used
to maintain free segments. Each node records the order of the largest segment
can be allocated from its subtree. Here's a miniature example.
A chunk with all segments free:
level 4 4(0x1)
level 3 3(0x2) 3(0x3)
level 2 2(0x4) 2(0x5) 2(0x6) 2(0x7)
level 1 1(0x8) 1(0x9) 1(0xa) 1(0xb) 1(0xc) 1(0xd) 1(0xe) 1(0xf)
After allocate a 16KB(order=3) memory region:
level 4 3(0x1)
level 3 0(0x2) 3(0x3)
level 2 2(0x4) 2(0x5) 2(0x6) 2(0x7)
level 1 1(0x8) 1(0x9) 1(0xa) 1(0xb) 1(0xc) 1(0xd) 1(0xe) 1(0xf)
It tooks 1 comparison to determine if a chunk can satisfy and at most 11
loops to find the leftmost free segment meets the requirments.
The value of each node is not more than 11, and a 8-bit integer is enough
to store it, so only 2048 bytes is required for each tree. And since the
entire tree is in a contiguous piece of memory and no rotations are needed,
it's far more efficient than self-balancing trees of the same size.
Different 4MB chunks are linked as a list, and the selection order is from
head to tail each time. If no existing chunks can satisfy the allocation,
the allocator will register another 4M chunk and add it to the tail.
+---------+ +---------+ +---------+
|4MB chunk| ---> |4MB chunk| ---> |4MB chunk|
+---------+ +---------+ +---------+
In most cases, smaller IOs can always get memory regions from the first or
second chunk and never traverse the list too much, and if we really send a lot
of large IOs, the cost of the traversal is rarely critical.
At last, obviously, the chunks can only allocate a maximum of 4MB memory region,
if a larger memory region is needed, the allocater registers/deregisters a
memory region directly regardless of buffer.
Signed-off-by: wanghonghao <wanghonghao@bytedance.com>
Type 0x0A segment descriptors provide a 32-bit wide number-of-bytes
field, allowing for much larger copy offloads with a single segment
descriptor.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reduce the size of struct iscsi_context by reordering the members of this
data structure. Additionally, change the rdma_ack_timeout value from
'unsigned char' into 'uint8_t' to make it clear that this variable
represents an integer.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Since 2c1619edef61a03cb516efaa81750784c3071d10 for linux kernel and
55843c4ab8f559679d28c559cc4d681836be769b for rdma-core, rdma cma
supports RDMA_OPTION_ID_ACK_TIMEOUT. It's useful for RDMA out of
sequence case. Because this feature is added recently, we have to
check this in autogen.sh before building source code.
Depend on production enviroument, tunning rdma ack timeout could get
the best performance. Suggested by Bart and Ronnie, instead of using
a fixed timeout value, add two methods to set rdma ack timeout value.
1, add URL variable 'LIBISCSI_RDMA_ACK_TIMEOUT'. This could works
for a specified connection.
2, add env argument 'LIBISCSI_RDMA_ACK_TIMEOUT'. This works as a
common setting for all the connection of a process.
Test under different packet loss rate and different ack timeout, run
fio (iodepth=1) in a guest os, I got this result:
latency under packet loss rate 0.00001:
timeout 19: avg 170.22, pct99.9 215
timeout 10: avg 160.08, pct99.9 215
timeout 8 : avg 146.39, pct99.9 177
timeout 7 : avg 148.37, pct99.9 211
latency under packet loss rate 0.0001:
timeout 19: avg 949.23, pct99.9 306
timeout 10: avg 818.53, pct99.9 378
timeout 8 : avg 615.84, pct99.9 189
timeout 7 : avg 618.89, pct99.9 310
Base on this test report, setting ack timeout to 8(1048.576 usec) is
a good choice in my test enviroument.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Declare dynamically allocated strings as 'char *' instead of 'const char *'.
Remove the discard_const() macro. Do not test whether or not a pointer is
NULL before calling free() because it is allowed to pass NULL to free().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>