Re: [PATCH v2 03/13] sh: remove unaligned access for sh4a

From: Arnd Bergmann <arnd@kernel.org>
Date: 2021-05-15 20:11:44
Also in: linux-arch, lkml

On Sat, May 15, 2021 at 5:36 PM John Paul Adrian Glaubitz
[off-list ref] wrote:

On 5/14/21 2:22 PM, Arnd Bergmann wrote:

quoted

My Renesas SH4-Boards actually run an sh4a-Kernel, not an sh4-Kernel:

root@tirpitz:~> uname -a
Linux tirpitz 5.11.0-rc4-00012-g10c03c5bf422 #161 PREEMPT Mon Jan 18 21:10:17 CET 2021 sh4a GNU/Linux
root@tirpitz:~>

So, if this change reduces performance on sh4a, I would rather not merge it.

It only makes a difference in very specific scenarios in which unaligned
accesses are done in a fast path, e.g. when forwarding network packet
at a high rate on a big-endian kernel (little-endian kernels wouldn't run into
this on IP headers). If you have a use case for this machine on which the
you can show a performance regression, I can add a patch on top to put
the optimized sh4a get_unaligned_le32() back. Dropping this patch
altogether would make the series much more complex because most of
the associated code gets removed in the end.

Hmm, okay. But why does code which sits below arch/sh have to be removed anyway?

I don't fully understand why it poses any maintenance burden/

What  I'm removing is the part that lets architectures override the
generic version.

quoted

As I mentioned, supporting "movua" in the compiler likely has a much
larger impact on performance, as it would also help in user space, and
it should improve the networking case on little-endian kernels by replacing
the four separate byte loads/shift pairs with a movua plus a byteswap.

The problem is that - at least in Debian - we use the sh4 baseline while the kernel
supports both sh4 and sh4a, so we can't use any of these instructions in userland at
the moment.

I tried building an sh7785lcr_defconfig with and without the patch,
and found that
the only affected files are:

- in-kernel nfs client
- crc32c/sha1/sha256 hash functions
- device probing for libata, scsi-core, scsi-disk, hid, r8168
  (should not matter after boot)
- msdos partition parsing

Any nfs client performance difference is probably not even measurable even
at gigabit ethernet speed.
I see that the hash functions are notably different, but I don't know if the
output from the new generic code is actually better or worse than the
original. If you do think this is important, please try the version from

https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git
unaligned-sh4a

against the version without the last change in that series. If you can find
a relevant test case that exercises it, you may want to add a custom
implementation of the hash functions as well.

       Arnd

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help