Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
From: Honnappa Nagarahalli <hidden>
Date: 2021-02-03 16:25:01
<snip>
Hi Feifei,quoted
Hi, Honnappa, Konstantin and Stephen Thanks very much for your attention of this patch. Based on your opinion, Ruifeng and I discuss about this and make a summary:__________________________________________________________ ____________quoted
___________________________________________ ___ main threadworker threadquoted
rte_eal_remote_launch: [ Honnappa focus ] To ensure f can load correct arg, arg store should before f lcore_config[worker_id].f = f; lcore_config[worker_id].arg =arg;quoted
wmb()? or store-relase on f?eal_thread_loop:quoted
pipeline_communication----------------------> pipeline_communicationquoted
if (lcore_config[lcore_id].f ==quoted
NULL)rte_panic("NULL functionquoted
pointer\n");fct_arg =quoted
lcore_config[lcore_id].arg;ret =quoted
lcore_config[lcore_id].f(fct_arg);__________________________________________________________ ____________quoted
___________________________________________ __ test_ring_stress: wmb()? [ Konstantin focus ]test_worker:quoted
Main thread can use wrk_cmd to Wrk_cmd =WRK_CMD_RUN;----------------------> Wrk_cmd == WRK_CMD_RUN;quoted
control multiple threads to start runningwmb()?quoted
at the same time as much as possiblering_dequeue;quoted
ring_enqueue;quoted
Wrk_cmd =WRK_CMD_STOP;----------------------> Wrk_cmd == WRK_CMD_STOP;quoted
__________________________________________________________ ____________quoted
___________________________________________ ___ rte_eal_wait_lcore:wmb()quoted
[ Honnappa focus ] lcore_config[lcore_id].state == FINISHED<--------------------- lcore_config[lcore_id].state =quoted
FINISHED Load-acquire and store-release are used on the variable “state” rmb();__________________________________________________________ ____________quoted
___________________________________________ ___ From the picture above, 1.First, for the underlying function rte_eal_remote_launch, Honnappa focuses on that, pipeline_communication cannot ensure ‘arg’ parameters is loaded correctly by the worker thread. This is because in weak memory order framework, maybe the main thread and worker thread firstly finish pipeline communication, and then the worker thread receive signal and execute the function ‘ f ’. However, it maybe load a wrong value of ‘arg’ due to that the main thread stores ‘arg’ after pipeline communication. So wmb or store_release isnecessary for ‘arg’.quoted
2.Second, for the upper-layer test_ring_stress, Konstantin foucese on that, Whether the main thread can use ‘wrk_cmd’ to control multiple threads to run at the same time as much as possible. Because rte_eal_remote_launch only can communicates with one worker thread at the same time. This means some worker thread maybe start working very early but other worker threads maybe need to wait a long time to start working if ‘wrk_cmd' is stored 'RUN' flag beforerte_remote_launch.quoted
At last, for unit test, this may cause that the test results are not stable. 3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as asynchronous bariable,quoted
we should add load-acquire and store-release on it. However, there have been rmb and wmb after and before ‘state’, So I’m not sure whetherwe should replace them.quoted
In summary, I think Honnappa and Konstantin have different concerns. For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can be loaded correctly in rte_eal_remote_launch. For Konstantin, we can add wmb and rmb to ensure the main thread can control the worker Threads to run at the same time, and then make the test results more accurate in the ring_stress_test.Agree with both.
Thanks Feifei, understood. I am just trying to take a step back and see what kind of ordering guarantees rte_eal_remote_launch should provide so that we do not have to deal with adding additional barriers in the applications. For ex: if we can avoid the barriers around 'wrk_cmd' (kind of use cases) it will benefit all the applications.
quoted
Best Regards Feifeiquoted
-----邮件原件----- 发件人: Honnappa Nagarahalli [off-list ref] 发送时间: 2021年1月30日 9:24 收件人: Stephen Hemminger [off-list ref] 抄送: Ananyev, Konstantin [off-list ref]; FeifeiWangquoted
quoted
[off-list ref]; dev@dpdk.org; nd [off-list ref]; RuifengWangquoted
quoted
[off-list ref]; Honnappa Nagarahalli [off-list ref]; nd [off-list ref] 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test <snip>quoted
quoted
quoted
quoted
quoted
Hi Feifei,quoted
The variable "wrk_cmd" is a signal to control threads from running and stopping. When worker lcores load "wrk_cmd ==WRK_CMD_RUN",quoted
quoted
quoted
theystartquoted
running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",theyquoted
stop. For the wmb in test_mt1, no storing operations must keep the order after storing "wrk_cmd". Thus the wmb isunnecessary.quoted
quoted
quoted
quoted
quoted
quoted
quoted
I think there is a bug in my original code, we should do smp_wmb() *before* setting wrk_cmd, not after: /* launch on all workers */ RTE_LCORE_FOREACH_WORKER(lc) { arg[lc].rng = r; arg[lc].stats = init_stat; rte_eal_remote_launch(test, &arg[lc], lc); } /* signal worker to start test */ + rte_smp_wmb(); wrk_cmd = WRK_CMD_RUN; - rte_smp_wmb(); usleep(run_time * US_PER_S); I still think we'd better have some synchronisation here. Otherwise what would prevent compiler and/or cpu to update wrk_cmd out of order (before _init_ phase is completed)? We probably can safely assume no reordering from the compiler here, as we have function calls straight before and after 'wrk_cmd =WRK_CMD_RUN;'quoted
quoted
But for consistency and easier maintenance, I still think it is better to have something here, after all it is not performance criticalpass.quoted
quoted
quoted
Agree that this is not performance critical. This is more about correctness (as usually people refer to code to understand the concepts). You can refer to video [1]. Essentially, the pthread_create has 'happens-before' behavior. i.e. all the memory operations before the pthread_create are visible to the newthread.quoted
quoted
quoted
The rte_smp_rmb() barrier in the thread function is not required as it reads thedata that was set before the thread was launched. rte_eal_remote_launch() doesn't call pthread_create(). All it does - updates global variable (lcore_config) and writes/reads to/from the pipe.Thanks for the reminder ☹ I think rte_eal_remote_launch and rte_eal_wait_lcore need to providebehavior similar to pthread_launch and pthread_join respectively.quoted
There is use of rte_smp_*mb in those functions as well. Those need to be fixedfirst and then look at these. Looks like you want __atomic_thread_fence() here.In the rte_eal_remote_launch case, all the memory operations before the API call need to be visible to the worker. If this is the only requirement, we can use the function pointer as the guard variable and use store-release. In the eal_thread_loop function we could do load-acquire on the function pointer. I do not think that there is a requirement to ensure that the memory operations after the API call do not happen before the worker thread starts running the function (As there is no guarantee on when the worker thread will run. If the main thread needs to know if the worker thread is running explicit hand-shaking needs to happen). The rte_eal_wait_lcore API needs to ensure that the memory operations in the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop are synchronizing using lcore_config[worker_id].state. I need to understand what else 'state' is used for. If there are no issues, we can do a store-release on'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.quoted
quoted
So, we do not have to use the __atomic_thread_fence.