Thread (89 messages) 89 messages, 4 authors, 2021-02-22

Re: [PATCH 00/35] RFC: add "nvme monitor" subcommand

From: Martin Wilck <hidden>
Date: 2021-01-29 11:18:24

On Thu, 2021-01-28 at 17:14 -0800, Sagi Grimberg wrote:
quoted
From: Martin Wilck <redacted>

(Cover letter copied from 
https://github.com/linux-nvme/nvme-cli/pull/877)

This patch set adds a new subcommand **nvme monitor**. In this
mode,
**nvme-cli** runs continuously, monitors events (currently,
uevents) relevant
for discovery, and optionally autoconnects to newly discovered
subsystems.

The monitor mode is suitable to be run in a systemd service. An
appropriate
unit file is provided. As such, **nvme monitor** can be used as an
alternative
to the current auto-connection mechanism based on udev rules and
systemd
template units. If `--autoconnect` is active, **nvme monitor**
masks the
respective udev rules in order to prevent simultaneous connection
attempts
from udev and itself.
I think that a monitor daemon is a good path forward.
quoted
This method for discovery and autodetection has some advantages
over the
current udev-rule based approach:

  * By using the `--persistent` option, users can easily control
whether
    persistent discovery controllers for discovered transport
addresses should
    be created and monitored for AEN events. **nvme monitor**
watches known
    transport addresses, creates discovery controllers as required,
and re-uses
    existing ones if possible.
What does that mean?
In general, if the monitor detects a new host_traddr/traddr/trsvcid
tuple, it runs a discovery on it, and keeps the discovery controller
open if --persistent was given. On startup, it scans existing
controllers, and if it finds already existing discovery controllers,
re-uses them. These will not be shut down when the monitor exits.

This allows users fine-grained control about what discovery controllers
to operate persistently. Users who want all discovery controllers to be
persistent just use --persistent. Others can set up those that they
want to have (manually or with a script), and not use --persistent.

The background is that hosts may not need every detected discovery
controller to be persistent. In multipath scenarios, you may see more
discovery subsystems than anything else, and not everyone likes that.
That's a generic issue and unrelated to the monitor, but running the
monitor with --persistent creates discovery controllers that would
otherwise not be visible.

Hope this clarifies it.
quoted
  * In certain situations, the systemd-based approach may miss
events due to
    race conditions. This can happen e.g. if an FC remote port is
detected, but
    shortly after it's detection an FC relogin procedure is
necessary e.g. due to
    an RSCN. In this case, an `fc_udev_device` uevent will be
received on the
    first detection and handled by an `nvme connect-all` command
run from
    `nvmf-connect@.service`. The connection attempt to the rport in
question will
    fail with "no such device" because of the simultaneous FC
    relogin. `nvmf-connect@.service` may not terminate immediately,
because it
    attempts to establish other connections listed in the Discovery
Log page it
    retrieved. When the FC relogin eventually finishes, a new
uevent will be
    received, and `nvmf-connect@` will be started again, but *this
has no effect*
    if the previous `nvmf-connect@` service hasn't finished yet.
This is the
    general semantics of systemd services, no easy workaround
exists.  **nvme
    monitor** doesn't suffer from this problem. If it sees an
uevent for a
    transport address for which a discovery is already running, it
will queue the
    handling of this event up and restart the discovery after it's
finished.
While I understand the issue, this reason alone is an overkill for
doing 
this.
I agree. But it's not easy to fix the issue otherwise. In the customer
problem where we observed it, I worked around it by adding the udev
seqnum to the "instance name" of the systemd service, thus allowing
several "nvme connect-all" processes to run for the same transport
address simultaneously. But I don't think that would scale well;
the monitor can handle it more cleanly.
quoted
  * Resource consumption for handling uevents is lower. Instead of
running an
    udev worker, executing the rules, executing `systemctl start`
from the
    worker, starting a systemd service, and starting a separate
**nvme-cli**
    instance, only a single `fork()` operation is necessary. Of
course, on the
    back side, the monitor itself consumes resources while it's
running and
    waiting for events. On my system with 8 persistent discovery
controllers,
    its RSS is ~3MB. CPU consumption is zero as long as no events
occur.
What is the baseline with what we have today?
A meaningful comparsion is difficult and should be done when the
monitor functionality is finalized. I made this statement only to
provide a rough idea of the resource usage, not more.
quoted
  * **nvme monitor** could be easily extended to handle events for
non-FC
    transports.
Which events?
Network discovery, mDNS or the like. I haven't digged into the details
yet.
quoted
I've tested `fc_udev_device` handling for NVMeoFC with an Ontap
target, and
AEN handling for RDMA using a Linux **nvmet** target.

### Implementation notes

I've tried to change the exisiting **nvme-cli** code as little as
possible
while reusing the code from `fabrics.c`. The majority of changes in
the
existing code exports formerly static functions and variables, so
that they
are usable from the monitor code.
General comment, can you please separate out fixes/cleanups that are
not
related to the goal of this patchset?
Which ones are you referring to? 09 and 19? While these are minor
improvements to the existing code, I wouldn't say they qualify as fixes
or cleanups. They aren't necessary without adding the monitor code.

But yes, I can post all changes to existing code separately.
quoted
  *  When "add" uevents for nvme controller devices are received,
the
     controller is consistently not in `live` state yet, and
attempting to read
     the `subsysnqn` sysfs attribute returns `(efault)`. While this
should
     arguably be fixed in the kernel, it could be worked around in
user space
     by using timers or polling the `state` sysfs attribute for
changes.
This is a bug, what in the code causes this? nothing in controller
state
should prevent from this sysfs read from executing correctly...
I think it can be fixed by making nvme_sysfs_show_subsysnqn() fall back
to ctrl->opts->subsysnqn if ctrl->subsys is NULL.

I'll send a patch. Anyway, it'll take time until this is fixed
everywhere.
quoted
  * Parse and handle `discovery.conf` on startup.
This is a must I think, where do you get the known transport
addresses
on startup today?
There's a systemd service that runs "nvme connect-all" once during
boot. That exists today. I'm not sure if it should be integrated in the
monitor, perhaps it's good to keep these separate. People who don't
need the monitor can still run the existing service only, whereas for
others, the two would play together just fine.
quoted
  * Implement support for RDMA and TCP protocols.
What is needed for supporting them? Not sure I follow (I thought
you mentioned that you tested against linux nvmet-rdma?)
AENs over existing discovery controllers are supported for all
transports. But there's no support for discovery of new transports
except for NVMeoFC's "fc_udev_device" mechanism.

Regards
Martin





_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help