Re: [PATCH v3 0/5] vhost: optimize enqueue
From: Jianbo Liu <hidden>
Date: 2016-09-26 05:39:06
On 26 September 2016 at 13:25, Wang, Zhihong [off-list ref] wrote:
quoted
-----Original Message----- From: Jianbo Liu [mailto:jianbo.liu@linaro.org] Sent: Monday, September 26, 2016 1:13 PM To: Wang, Zhihong <redacted> Cc: Thomas Monjalon <redacted>; dev@dpdk.org; Yuanhan Liu [off-list ref]; Maxime Coquelin [off-list ref] Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue On 25 September 2016 at 13:41, Wang, Zhihong [off-list ref] wrote:quoted
quoted
-----Original Message----- From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] Sent: Friday, September 23, 2016 9:41 PM To: Jianbo Liu <redacted> Cc: dev@dpdk.org; Wang, Zhihong <redacted>; Yuanhan Liu [off-list ref]; Maxime Coquelin [off-list ref]....quoted
This patch does help in ARM for small packets like 64B sized ones, this actually proves the similarity between x86 and ARM in terms of caching optimization in this patch. My estimation is based on: 1. The last patch are for mrg_rxbuf=on, and since you said it helps perf, we can ignore it for now when we discuss mrg_rxbuf=off 2. Vhost enqueue perf = Ring overhead + Virtio header overhead + Data memcpy overhead 3. This patch helps small packets traffic, which means it helps ring + virtio header operations 4. So, when you say perf drop when packet size larger than 512B, this is most likely caused by memcpy in ARM not working well with this patch I'm not saying glibc's memcpy is not good enough, it's just that this is a rather special use case. And since we see specialized memcpy + this patch give better performance than other combinations significantly on x86, we suggest to hand-craft a specialized memcpy for it. Of course on ARM this is still just my speculation, and we need to either prove it or find the actual root cause. It can be **REALLY HELPFUL** if you could help to test this patch on ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful to ARM at all, since mrg_rxbuf=on the more widely used cases.Actually it's worse than mrg_rxbuf=off.I mean compare the perf of original vs. original + patch with mrg_rxbuf turned on. Is there any perf improvement?
Yes, orig + patch + on is better than orig + on, but orig + patch + on is worse than orig + patch + off.