Re: [dpdk-dev] [PATCH v2] telemetry: fix "in-memory" process socket conflicts
From: Bruce Richardson <hidden>
Date: 2021-09-29 15:24:15
On Wed, Sep 29, 2021 at 03:54:48PM +0100, Kevin Traynor wrote:
On 29/09/2021 14:32, Bruce Richardson wrote:quoted
On Wed, Sep 29, 2021 at 01:28:53PM +0100, Kevin Traynor wrote:quoted
Hi Bruce, On 24/09/2021 17:18, Bruce Richardson wrote:quoted
When DPDK is run with --in-memory mode, multiple processes can run simultaneously using the same runtime dir. This leads to each process removing another process' telemetry socket as it started up, giving unexpected behaviour. This patch changes that behaviour to first check if the existing socket is active. If not, it's an old socket to be cleaned up and can be removed. If it is active, telemetry initialization fails and an error message is printed out giving instructions on how to remove the error; either by using file-prefix to have a different runtime dir (and therefore socket path) or by disabling telemetry if it not needed.telemetry is enabled by default but it may not be used by the application. Hitting this issue will cause rte_eal_init() to fail which will probably stop or severely limit the application. So it could change a working application to a non-working one (albeit one that doesn't interfere with other process' sockets). Can it just print a warning that telemetry will not be enabled and continue so it's not returning an rte_eal_init failure?For a backported fix, yes, that would probably be better behaviour, but for the latest branch, I think returning error and having the user explicitly choose the resolution they want to occur is best. I'll see about doing a separate backport patch for 20.11.But this is a runtime message dependent on runtime environment. The user may not have access or know how to change eal parameters.
True. But on the other hand, this problem only occurs with non-default EAL parameters anyway, so someone must have configured this with the --in-memory flag.
In the case where the application doesn't care about telemetry, they have gone from not having telemetry to rte_eal_init() failing, which probably has severe consequence.
Yes, I agree, which I why I would suggest that for any backport of this fix, the error be made non-fatal as you suggest. [Having looked into it, having it as a non-fatal error is rather awkward, so it may be best just left unfixed and the current behaviour documented as known-issue]. However, for any application being updated and rebuilt against 21.11, I would have thought it reasonable to flag this as an error, as any such application would require revalidation anyway.
I could maybe agree if telemetry was default disable and the application had set the --telemetry flag indicating that they want/need it. As it is, it feels like it's possibly a worse outcome for the user.
Perhaps, but I believe the only case of there being an issue would be where: 1) a user who cannot modify the EAL parameters 2) runs an application which has been updated and rebuilt against 21.11 3) where that application is hard-coded to use in-memory mode and 4) has never been verified with two or more instances of that running? Or am I missing something here? Regards, /Bruce