Thread (37 messages) 37 messages, 7 authors, 2018-07-25

Re: [PATCH 08/21] udl-kms: avoid prefetch

From: Mikulas Patocka <mpatocka@redhat.com>
Date: 2018-06-06 15:46:20
Also in: dri-devel


On Wed, 6 Jun 2018, Alexey Brodkin wrote:
Hi Mikulas,

On Tue, 2018-06-05 at 11:30 -0400, Mikulas Patocka wrote:
quoted
On Tue, 5 Jun 2018, Alexey Brodkin wrote:
quoted
Hi Mikulas,

On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote:
quoted
Modern processors can detect linear memory accesses and prefetch data
automatically, so there's no need to use prefetch.
Not each and every CPU that's capable of running Linux has prefetch
functionality :)

Still read-on...
quoted
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/gpu/drm/udl/udl_transfer.c |    7 -------
 1 file changed, 7 deletions(-)

Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c
=================================> > > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c	2018-05-31 14:48:12.000000000 +0200
+++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c	2018-05-31 14:48:12.000000000 +0200
@@ -13,7 +13,6 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/fb.h>
-#include <linux/prefetch.h>
 #include <asm/unaligned.h>
 
 #include <drm/drmP.h>
@@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac
 	int start = width;
 	int end = width;
 
-	prefetch((void *) front);
-	prefetch((void *) back);
AFAIK prefetcher fetches new data according to a known history... i.e. based on previously
used pattern we'll trying to get the next batch of data.

But the code above is in the very beginning of the data processing routine where
prefetcher doesn't yet have any history to know what and where to prefetch.

So I'd say this particular usage is good.
At least those prefetches shouldn't hurt because typically it
would be just 1 instruction if those exist or nothing if CPU/compiler doesn't
support it.
See this post https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_444336_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656e
ViXO7breS55ytWkhpk5R81I&m¥RaqJtvajFkM1hL7bOKD5jV7cpFfTvG2Y1cYCdBPd0&s=w0W8wFtAgENp8TE6RzdPGhdKRasJc_otIn08V0EkgrY&e= where they measured that 
prefetch hurts performance. Prefetch shouldn't be used unless you have a 
proof that it improves performance.

The problem is that the prefetch instruction causes stalls in the pipeline 
when it encounters TLB miss and the automatic prefetcher doesn't.
Wow, thanks for the link.
I didn't know about that subtle issue with prefetch instructions on ARM and x86.

So OK in case of UDL these prefetches anyways make not not much sense I guess and there's
something worse still, see what I've got from WandBoard Quad running kmscube [1] application
with help of perf utility:
--------------------------->8-------------------------
# Overhead  Command  Shared Object            Symbol 
# ........  .......  .......................  ........................................
#
    92.93%  kmscube  [kernel.kallsyms]        [k] udl_render_hline
     2.51%  kmscube  [kernel.kallsyms]        [k] __divsi3
     0.33%  kmscube  [kernel.kallsyms]        [k] _raw_spin_unlock_irqrestore
     0.22%  kmscube  [kernel.kallsyms]        [k] lock_acquire
     0.19%  kmscube  [kernel.kallsyms]        [k] _raw_spin_unlock_irq
     0.17%  kmscube  [kernel.kallsyms]        [k] udl_handle_damage
     0.12%  kmscube  [kernel.kallsyms]        [k] v7_dma_clean_range
     0.11%  kmscube  [kernel.kallsyms]        [k] l2c210_clean_range
     0.06%  kmscube  [kernel.kallsyms]        [k] __memzero
--------------------------->8-------------------------

That said it's not even USB 2.0 which is a bottle-neck but
computations in the udl_render_hline().


[1] https://cgit.freedesktop.org/mesa/kmscube/

-Alexey
Try this patch 
http://people.redhat.com/~mpatocka/patches/kernel/udl/udlkms-avoid-division.patch

It is doing a lot of divisions - and WandBoard has Cortex-A9, that doesn't 
have division instruction.

BTW. the framebuffer UDL driver (not the modesetting driver) has 
performance counters in sysfs. Their location depends on the system, you 
can find them with find /sys -name "*metrics*"

The file "metrics_reset" resets the counters, so you can measure if the 
prefetch instructions improve performance or not.

Mikulas
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help