Thread (141 messages) 141 messages, 7 authors, 2016-10-26

Re: [PATCH v3 0/5] vhost: optimize enqueue

From: Jianbo Liu <hidden>
Date: 2016-09-26 05:39:06

On 26 September 2016 at 13:25, Wang, Zhihong [off-list ref] wrote:
quoted
-----Original Message-----
From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
Sent: Monday, September 26, 2016 1:13 PM
To: Wang, Zhihong <redacted>
Cc: Thomas Monjalon <redacted>; dev@dpdk.org; Yuanhan
Liu [off-list ref]; Maxime Coquelin
[off-list ref]
Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

On 25 September 2016 at 13:41, Wang, Zhihong [off-list ref]
wrote:
quoted
quoted
-----Original Message-----
From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
Sent: Friday, September 23, 2016 9:41 PM
To: Jianbo Liu <redacted>
Cc: dev@dpdk.org; Wang, Zhihong <redacted>; Yuanhan Liu
[off-list ref]; Maxime Coquelin
[off-list ref]
....
quoted
This patch does help in ARM for small packets like 64B sized ones,
this actually proves the similarity between x86 and ARM in terms
of caching optimization in this patch.

My estimation is based on:

 1. The last patch are for mrg_rxbuf=on, and since you said it helps
    perf, we can ignore it for now when we discuss mrg_rxbuf=off

 2. Vhost enqueue perf =
    Ring overhead + Virtio header overhead + Data memcpy overhead

 3. This patch helps small packets traffic, which means it helps
    ring + virtio header operations

 4. So, when you say perf drop when packet size larger than 512B,
    this is most likely caused by memcpy in ARM not working well
    with this patch

I'm not saying glibc's memcpy is not good enough, it's just that
this is a rather special use case. And since we see specialized
memcpy + this patch give better performance than other combinations
significantly on x86, we suggest to hand-craft a specialized memcpy
for it.

Of course on ARM this is still just my speculation, and we need to
either prove it or find the actual root cause.

It can be **REALLY HELPFUL** if you could help to test this patch on
ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
to ARM at all, since mrg_rxbuf=on the more widely used cases.
Actually it's worse than mrg_rxbuf=off.
I mean compare the perf of original vs. original + patch with
mrg_rxbuf turned on. Is there any perf improvement?
Yes, orig + patch + on is better than orig + on, but orig + patch + on
is worse than orig + patch + off.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help