Thread (14 messages) 14 messages, 3 authors, 2021-12-17

Re: [PATCH 3/3] nvme: add KConfig options for debug features

From: Sagi Grimberg <sagi@grimberg.me>
Date: 2021-12-13 09:28:25

quoted
quoted
Add KConfig menu option to enable and disable gencounter debug
feature that uses config NVME_DEBUG_USE_CID_GENCTR.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
   drivers/nvme/host/Kconfig | 10 ++++++++++
   1 file changed, 10 insertions(+)
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index dc0450ca23a3..dfa2609b7006 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -1,4 +1,14 @@
   # SPDX-License-Identifier: GPL-2.0-only
+menu "Debug (Enable driver debug features)"
+config NVME_DEBUG_USE_CID_GENCTR
+     bool "Enable command ID gen counter for spurious request
completion"
+     depends on NVME_CORE
+     help
+       The NVM Express driver will use generation counter
+       when calculating the command id. This is needed to debug the
+       spurious request completions coming from a buggy controller.
This is not just to debug - it is also to protect against such a
controller. What is the purpose of this config option anyways?
The main distributions will (as they should) enable it anyways...
I can rewrite the text and rename it to "driver features".
It is not a feature.
We are protecting against such a controller which is not stable
(buggy), i.e. it is doing things which it shouldn't be
doing at the first place.
Yes, this is exactly what we are doing.
Consider a case if controller is not
buggy then it adds instructions in the fast path which are not
needed at all.
We keep repeating this, if the controller assumes _anything_ on
the command identifier then it _is_ buggy, period.

Again, the fact that the controller is buggy is perfectly fine
really, that's why we have quirks.

What happened to your nvme-cli argument? that would make it
specific to that controller.

The host side cost is very cheap as shown before.
A controller(s) that is used in the production environment goes
through qualification process from vendors and from the consumers
to make sure they are stable, something like spurious completions
detection is a basic part of the qualification,
I don't think that assuming that a production controller does not
have any hidden bugs is a great practice, and controllers evolve during
their production lifetime.
hence we should
allow user to configure genctr than forcing additional
instructions in the fast path and keep this pattern for future
such cases.
I personally think that given that the worst-case option for such
a bug is possible data-corruption then we should not make it optional,
but I won't insist on it if others think otherwise.
Maybe I didn't understand, can you please explain what are the
benefits of having gen-counter where controller is stable?
The benefit is that the host code does not "assume" anything
about the controller.
I'll wait to send V2 if you can suggest any other way
(than kconfig something like module param, I'm not sure) so user
can configure genctr calculations, please let me know.
As I said, the nvme-cli skip-genctr argument is perfectly valid in my
mind.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help