Re: [PATCH 3/3] nvme: add KConfig options for debug features
From: Sagi Grimberg <sagi@grimberg.me>
Date: 2021-12-13 09:28:25
quoted
quoted
Add KConfig menu option to enable and disable gencounter debug feature that uses config NVME_DEBUG_USE_CID_GENCTR. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> --- drivers/nvme/host/Kconfig | 10 ++++++++++ 1 file changed, 10 insertions(+)diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index dc0450ca23a3..dfa2609b7006 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig@@ -1,4 +1,14 @@ # SPDX-License-Identifier: GPL-2.0-only +menu "Debug (Enable driver debug features)" +config NVME_DEBUG_USE_CID_GENCTR + bool "Enable command ID gen counter for spurious requestcompletion" + depends on NVME_CORE + help + The NVM Express driver will use generation counter + when calculating the command id. This is needed to debug the + spurious request completions coming from a buggy controller.This is not just to debug - it is also to protect against such a controller. What is the purpose of this config option anyways? The main distributions will (as they should) enable it anyways...I can rewrite the text and rename it to "driver features".
It is not a feature.
We are protecting against such a controller which is not stable (buggy), i.e. it is doing things which it shouldn't be doing at the first place.
Yes, this is exactly what we are doing.
Consider a case if controller is not buggy then it adds instructions in the fast path which are not needed at all.
We keep repeating this, if the controller assumes _anything_ on the command identifier then it _is_ buggy, period. Again, the fact that the controller is buggy is perfectly fine really, that's why we have quirks. What happened to your nvme-cli argument? that would make it specific to that controller. The host side cost is very cheap as shown before.
A controller(s) that is used in the production environment goes through qualification process from vendors and from the consumers to make sure they are stable, something like spurious completions detection is a basic part of the qualification,
I don't think that assuming that a production controller does not have any hidden bugs is a great practice, and controllers evolve during their production lifetime.
hence we should allow user to configure genctr than forcing additional instructions in the fast path and keep this pattern for future such cases.
I personally think that given that the worst-case option for such a bug is possible data-corruption then we should not make it optional, but I won't insist on it if others think otherwise.
Maybe I didn't understand, can you please explain what are the benefits of having gen-counter where controller is stable?
The benefit is that the host code does not "assume" anything about the controller.
I'll wait to send V2 if you can suggest any other way (than kconfig something like module param, I'm not sure) so user can configure genctr calculations, please let me know.
As I said, the nvme-cli skip-genctr argument is perfectly valid in my mind.