Thread (36 messages) 36 messages, 4 authors, 2021-05-27

Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

From: Slava Ovsiienko <hidden>
Date: 2021-05-07 10:14:52

Hi, Feifei

We should consider the locks in your scenario - it is crucial for the complete model description:

How agent_1 (in your terms) rebuilds global cache:

1a) lock()
1b) rebuild(global cache)
1c) update(dev_gen)
1d) wmb()
1e) unlock()

How agent_2 checks:

2a) check(dev_gen) (assume positive - changed)
2b) clear(local_cache)
2c) miss(on empty local_cache) - > eventually it goes to mr_lookup_caches()
2d) lock()
2e) get(new MR)
2f) unlock()
2g) update(local cache with obtained new MR)

Hence, even if 1c) becomes visible in 2a) before 1b) committed (say, due to out-of-order Arch) - the agent 2
would be blocked on 2d) and scenario depicted on your Fig2 would not happen (agent_2 will wait before step 3
till agent 1 unlocks after its step 5).

With best regards,
Slava
-----Original Message-----
From: Feifei Wang <redacted>
Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
Cc: dev@dpdk.org; nd <redacted>; stable@dpdk.org; Ruifeng Wang
[off-list ref]; nd [off-list ref]
Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
cache

Hi, Slava

Thanks very much for your reply.
quoted
-----邮件原件-----
发件人: Slava Ovsiienko [off-list ref]
发送时间: 2021年5月6日 19:22
收件人: Feifei Wang [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
抄送: dev@dpdk.org; nd [off-list ref]; stable@dpdk.org; Ruifeng Wang
[off-list ref]; nd [off-list ref]
主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
cache

Hi, Feifei

Sorry, I do not follow why we should get rid of the last (after
dev_gen update) wmb.
We've rebuilt the global cache, we should notify other agents it's
happened and they should flush local caches. So, dev_gen change should
be made visible to other agents to trigger this activity and the
second wmb is here to ensure this.
1. For the first problem why we should get rid of the last wmb and move it
before dev_gen updated, I think our attention is how the wmb implements
the synchronization between multiple agents.
					Fig1
----------------------------------------------------------------------------------------------
-------
Timeslot        		agent_1               		   agent_2
1          		rebuild global cache
2                       		wmb
3            		     update dev_gen ----------------------- load changed
dev_gen
4                                  			        	           rebuild local cache
----------------------------------------------------------------------------------------------
-------

First, wmb is only for local thread to keep the order between local write-
write :
Based on the picture above, for agent_1, wmb keeps the order that
rebuilding global cache is always before updating dev_gen.

Second, agent_1 communicates with agent_2 by the global variable
"dev_gen" :
If agent_1 updates dev_gen, agent_2 will load it and then it knows it should
rebuild local cache

Finally, agent_2 rebuilds local cache according to whether agent_1 has rebuilt
global cache, and agent_2 knows this information by the variable "dev_gen".
					Fig2
----------------------------------------------------------------------------------------------
-------
Timeslot        		agent_1               		   agent_2
1		        update dev_gen
2						      load changed dev_gen
3						          rebuild local cache
4        		    rebuild global cache
5			 wmb
----------------------------------------------------------------------------------------------
-------

However, in arm platform, if wmb is after dev_gen updated, "dev_gen" may
be updated before agent_1 rebuilding global cache, then agent_2 maybe
receive error message and rebuild its local cache in advance.

To summarize, it is not important which time other agents can see the
changed global variable "dev_gen".
(Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is
committed to the global).
It is more important that if other agents see the changed "dev_gen", they
also can know global cache has been updated.
quoted
One more point, due to registering new/destroying existing MR involves
FW (via kernel) calls, it takes so many CPU cycles that we could
neglect wmb overhead at all.
We just move the last wmb into the right place, and not delete it for
performance.
quoted
Also, regarding this:

 > > Another question suddenly occurred to me, in order to keep the  >
quoted
order that rebuilding global cache before updating ”dev_gen“, the  >
wmb should be before updating "dev_gen" rather than after it.
 > > Otherwise, in the out-of-order platforms, current order cannot be
kept.
quoted
it is not clear why ordering is important - global cache update and
dev_gen change happen under spinlock protection, so only the last wmb
is meaningful.
2. The second function of wmb before "dev_gen" updated is for
performance according to our previous discussion.
According to Fig2, if there is no wmb between "global cache updated" and
"dev_gen updated", "dev_gen" may update before global cache updated.

Then agent_2 may see the changed "dev_gen" and flush entire local cache in
advance.

This entire flush can degrade the performance:
"the local cache is getting empty and can't provide translation for other valid
(not being removed) MRs, and the translation has to look up in the global
cache, that is locked now for rebuilding, this causes the delays in data path on
acquiring global cache lock."

Furthermore, spinlock is just for global cache, not for dev_gen and local
cache.
quoted
To summarize, in my opinion:
- if you see some issue with ordering of global cache update/dev_gen
signalling,
  could you, please, elaborate? I'm not sure we should maintain an
order (due to spinlock protection)
- the last rte_smp_wmb() after dev_gen incrementing should be kept
intact
At last, for my view, there are two functions that moving wmb before
"dev_gen"
for the write-write order:
--------------------------------
a) rebuild global cache;
b) rte_smp_wmb();
c) updating dev_gen
--------------------------------
1. Achieve synchronization between multiple threads in the right way 2.
Prevent other agents from flushing local cache early to ensure performance

Best Regards
Feifei
quoted
With best regards,
Slava
quoted
-----Original Message-----
From: Feifei Wang <redacted>
Sent: Thursday, May 6, 2021 5:52
To: Slava Ovsiienko <redacted>; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
Cc: dev@dpdk.org; nd <redacted>; stable@dpdk.org; Ruifeng Wang
[off-list ref]; nd [off-list ref]
Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
Region cache

Hi, Slava

Would you have more comments about this patch?
For my sight, only one wmb before "dev_gen" updating is enough to
synchronize.

Thanks very much for your attention.


Best Regards
Feifei
quoted
-----邮件原件-----
发件人: Feifei Wang
发送时间: 2021年4月20日 16:42
收件人: Slava Ovsiienko [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
抄送: dev@dpdk.org; nd [off-list ref]; stable@dpdk.org; Ruifeng
Wang
quoted
quoted
[off-list ref]; nd [off-list ref]
主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
Region
quoted
quoted
quoted
cache

Hi, Slava

I think the second wmb can be removed.
As I know, wmb is just a barrier to keep the order between write
and
write.
quoted
quoted
and it cannot tell the CPU when it should commit the changes.

It is usually used before guard variable to keep the order that
updating guard variable after some changes, which you want to
release,
have been done.
quoted
For example, for the wmb  after global cache update/before
altering dev_gen, it can ensure the order that updating global
cache before altering
dev_gen:
1)If other agent load the changed "dev_gen", it can know the
global cache has been updated.
2)If other agents load the unchanged, "dev_gen", it means the
global cache has not been updated, and the local cache will not be
flushed.
quoted
quoted
quoted
As a result, we use  wmb and guard variable "dev_gen" to ensure
the global cache updating is "visible".
The "visible" means when updating guard variable "dev_gen" is
known by other agents, they also can confirm global cache has been
updated in the meanwhile. Thus, just one wmb before altering
dev_gen can ensure
this.
quoted
Best Regards
Feifei
quoted
-----邮件原件-----
发件人: Slava Ovsiienko [off-list ref]
发送时间: 2021年4月20日 15:54
收件人: Feifei Wang [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
抄送: dev@dpdk.org; nd [off-list ref]; stable@dpdk.org; Ruifeng
Wang
quoted
quoted
[off-list ref]; nd [off-list ref]; nd [off-list ref];
nd
quoted
quoted
quoted
[off-list ref]
主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
Region cache

Hi, Feifei

In my opinion, there should be 2 barriers:
 - after global cache update/before altering dev_gen, to ensure
the correct order
 - after altering dev_gen to make this change visible for other
agents and to trigger local cache update

With best regards,
Slava
quoted
-----Original Message-----
From: Feifei Wang <redacted>
Sent: Tuesday, April 20, 2021 10:30
To: Slava Ovsiienko <redacted>; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
Cc: dev@dpdk.org; nd <redacted>; stable@dpdk.org; Ruifeng
Wang
quoted
quoted
quoted
[off-list ref]; nd [off-list ref]; nd
[off-list ref];
quoted
quoted
nd
quoted
quoted
quoted
[off-list ref]
Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
Memory Region cache

Hi, Slava

Another question suddenly occurred to me, in order to keep the
order that rebuilding global cache before updating ”dev_gen“,
the wmb should be before updating "dev_gen" rather than after it.
Otherwise, in the out-of-order platforms, current order cannot
be
kept.
quoted
quoted
quoted
quoted
Thus, we should change the code as:
a) rebuild global cache;
b) rte_smp_wmb();
c) updating dev_gen

Best Regards
Feifei
quoted
-----邮件原件-----
发件人: Feifei Wang
发送时间: 2021年4月20日 13:54
收件人: Slava Ovsiienko [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
抄送: dev@dpdk.org; nd [off-list ref]; stable@dpdk.org;
Ruifeng
quoted
quoted
quoted
Wang
quoted
quoted
[off-list ref]; nd [off-list ref]; nd
[off-list ref]
quoted
quoted
quoted
quoted
quoted
quoted
主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
Region
quoted
quoted
quoted
cache

Hi, Slava

Thanks very much for your explanation.

I can understand the app can wait all mbufs are returned to
the memory pool, and then it can free this mbufs, I agree with
this.
quoted
quoted
quoted
quoted
quoted
quoted
As a result, I will remove the bug fix patch from this
series and just replace the smp barrier with C11 thread
fence. Thanks very much for your patient explanation again.

Best Regards
Feifei
quoted
-----邮件原件-----
发件人: Slava Ovsiienko [off-list ref]
发送时间: 2021年4月20日 2:51
收件人: Feifei Wang [off-list ref]; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]
抄送: dev@dpdk.org; nd [off-list ref]; stable@dpdk.org;
Ruifeng
quoted
quoted
quoted
Wang
quoted
quoted
[off-list ref]; nd [off-list ref]
主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
Memory Region cache

Hi, Feifei

Please, see below

....
quoted
quoted
Hi, Feifei

Sorry, I do not follow what this patch fixes. Do we
have some issue/bug with MR cache in practice?
This patch fixes the bug which is based on logical
deduction, and it doesn't actually happen.
quoted
Each Tx queue has its own dedicated "local" cache for
MRs to convert buffer address in mbufs being
transmitted to LKeys (HW-related entity
handle) and the "global" cache for all MR registered
on the
device.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
AFAIK, how conversion happens in datapath:
- check the local queue cache flush request
- lookup in local cache
- if not found:
- acquire lock for global cache read access
- lookup in global cache
- release lock for global cache

How cache update on memory freeing/unregistering
happens:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
- acquire lock for global cache write access
- [a] remove relevant MRs from the global cache
- [b] set local caches flush request
- free global cache lock

If I understand correctly, your patch swaps [a] and
[b], and local caches flush is requested earlier. What
problem does it
solve?
quoted
quoted
quoted
quoted
quoted
quoted
quoted
It is not supposed there are in datapath some mbufs
referencing to the memory being freed. Application
must ensure this and must not allocate new mbufs from
this memory regions
being freed.
quoted
quoted
quoted
quoted
quoted
Hence, the lookups for these MRs in caches should not
occur.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
For your first point that, application can take charge
of preventing MR freed memory being allocated to data path.

Does it means that If there is an emergency of MR
fragment, such as hotplug, the application must inform
thedata path in advance, and this memory will not be
allocated, and then the control path will free this
memory? If application  can do like this, I agree that
this bug
cannot happen.
quoted
Actually,  this is the only correct way for application to operate.
Let's suppose we have some memory area that application
wants to
free.
quoted
quoted
quoted
ALL references to this area must be removed. If we have
some mbufs allocated from this area, it means that we have
memory pool created
there.
quoted
quoted
What application should do:
- notify all its components/agents the memory area is
going to be freed
- all components/agents free the mbufs they might own
- PMD might not support freeing for some mbufs (for
example being sent and awaiting for completion), so app
should just wait
- wait till all mbufs are returned to the memory pool (by
monitoring available obj == pool size)

Otherwise - it is dangerous to free the memory. There are
just some mbufs still allocated, it is regardless to buf
address to MR translation. We just can't free the memory -
the mapping will be destroyed and might cause the
segmentation fault by SW or some HW issues on DMA access
to unmapped memory.  It is very generic safety approach -
do not free the memory that is still in
use.
quoted
quoted
quoted
quoted
quoted
Hence, at the moment of freeing and unregistering the MR,
there MUST BE NO any
mbufs in flight referencing to the addresses being freed.
quoted
No translation to MR being invalidated can happen.
quoted
quoted
For other side, the cache flush has negative effect -
the local cache is getting empty and can't provide
translation for other valid (not being removed) MRs,
and the translation has to look up in the global
cache, that is locked now for rebuilding, this causes
the delays in datapatch
on acquiring global cache lock.
quoted
So, I see some potential performance impact.
If above assumption is true, we can go to your second point.
I think this is a problem of the tradeoff between cache
coherence and
performance.
quoted
I can understand your meaning that though global cache
has been changed, we should keep the valid MR in local
cache as long as possible to ensure the fast searching speed.
In the meanwhile, the local cache can be rebuilt later
to reduce its waiting time for acquiring the global cache lock.

However,  this mechanism just ensures the performance
unchanged for the first few mbufs.
During the next mbufs lkey searching after 'dev_gen'
updated, it is still necessary to update the local cache.
And the performance can firstly reduce and then returns.
Thus, no matter whether there is this patch or not,  the
performance will jitter in a certain period of
time.
quoted
quoted
Local cache should be updated to remove MRs no longer valid.
But we just flush the entire cache.
Let's suppose we have valid MR0, MR1, and not valid MRX in
local
cache.
quoted
quoted
quoted
quoted
And there are traffic in the datapath for MR0 and MR1, and
no traffic for MRX anymore.

1) If we do as you propose:
a) take a lock
b) request flush local cache first - all MR0, MR1, MRX
will be removed on translation in datapath
c) update global cache,
d) free lock
All the traffic for valid MR0, MR1 ALWAYS will be blocked
on lock taken for cache update since point b) till point d).

2) If we do as it is implemented now:
a) take a lock
b) update global cache
c) request flush local cache
d) free lock
The traffic MIGHT be locked ONLY for MRs non-existing in
local cache (not happens for MR0 and MR1, must not happen
for MRX), and probability should be minor. And lock might
happen since
c) till
d)
- quite short period of time

Summary, the difference between 1) and 2)

Lock probability:
- 1) lock ALWAYS happen for ANY MR translation after b),
  2) lock MIGHT happen, for cache miss ONLY, after c)

Lock duration:
- 1) lock since b) till d),
  2) lock since c) till d), that seems to be  much shorter.
quoted
Finally, in conclusion, I tend to think that the bottom
layer can do more things to ensure the correct execution
of the program, which may have a negative impact on the
performance in a short time, but in the long run, the
performance will eventually
come back.
quoted
quoted
quoted
quoted
Furthermore, maybe we should pay attention to the
performance in the stable period, and try our best to
ensure the correctness of the program in case of
emergencies.

If we have some mbufs still allocated in memory being
freed
- there is nothing to say about correctness, it is totally
incorrect. In my opinion, we should not think how to
mitigate this incorrect behavior, we should not encourage
application developers to follow the wrong
approaches.
quoted
With best regards,
Slava
quoted
Best Regards
Feifei
quoted
With best regards,
Slava
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help