Thread (22 messages) 22 messages, 4 authors, 2021-03-02

Re: [PATCH] nvme-tcp: Check if request has started before processing it

From: Keith Busch <kbusch@kernel.org>
Date: 2021-03-02 07:09:07
Also in: lkml

On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
On 3/1/21 5:05 PM, Keith Busch wrote:
quoted
On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
quoted
On 3/1/21 2:26 PM, Daniel Wagner wrote:
quoted
On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
quoted
Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?
If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?
This is actually a sensible solution.
Please send a patch for that.
Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.
We need to distinguish two cases here.
The one is use receiving a frame with an invalid tag, leading to a crash.
This can be easily resolved by issuing a reset, as clearly the command was
garbage and we need to invoke error handling (which is reset).

The other case is us receiving a frame with a _duplicate_ tag, ie a tag
which is _currently_ valid. This is a case which will fail _even now_, as we
have simply no way of detecting this.

So what again do we miss by fixing the first case?
Apart from a system which does _not_ crash?
I'm just saying each case is a symptom of the same problem. The only
difference from observing one vs the other is a race with the host's
dispatch. And since you're proposing this patch, it sounds like this
condition does happen on tcp compared to other transports where we don't
observe it. I just thought the implication that data corruption happens
is a alarming.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help