Re: [PATCH v2 03/13] sh: remove unaligned access for sh4a
From: Arnd Bergmann <arnd@kernel.org>
Date: 2021-05-15 20:11:44
Also in:
linux-arch, lkml
On Sat, May 15, 2021 at 5:36 PM John Paul Adrian Glaubitz [off-list ref] wrote:
On 5/14/21 2:22 PM, Arnd Bergmann wrote:quoted
quoted
My Renesas SH4-Boards actually run an sh4a-Kernel, not an sh4-Kernel: root@tirpitz:~> uname -a Linux tirpitz 5.11.0-rc4-00012-g10c03c5bf422 #161 PREEMPT Mon Jan 18 21:10:17 CET 2021 sh4a GNU/Linux root@tirpitz:~> So, if this change reduces performance on sh4a, I would rather not merge it.It only makes a difference in very specific scenarios in which unaligned accesses are done in a fast path, e.g. when forwarding network packet at a high rate on a big-endian kernel (little-endian kernels wouldn't run into this on IP headers). If you have a use case for this machine on which the you can show a performance regression, I can add a patch on top to put the optimized sh4a get_unaligned_le32() back. Dropping this patch altogether would make the series much more complex because most of the associated code gets removed in the end.Hmm, okay. But why does code which sits below arch/sh have to be removed anyway? I don't fully understand why it poses any maintenance burden/
What I'm removing is the part that lets architectures override the generic version.
quoted
As I mentioned, supporting "movua" in the compiler likely has a much larger impact on performance, as it would also help in user space, and it should improve the networking case on little-endian kernels by replacing the four separate byte loads/shift pairs with a movua plus a byteswap.The problem is that - at least in Debian - we use the sh4 baseline while the kernel supports both sh4 and sh4a, so we can't use any of these instructions in userland at the moment.
I tried building an sh7785lcr_defconfig with and without the patch, and found that the only affected files are: - in-kernel nfs client - crc32c/sha1/sha256 hash functions - device probing for libata, scsi-core, scsi-disk, hid, r8168 (should not matter after boot) - msdos partition parsing Any nfs client performance difference is probably not even measurable even at gigabit ethernet speed. I see that the hash functions are notably different, but I don't know if the output from the new generic code is actually better or worse than the original. If you do think this is important, please try the version from https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-sh4a against the version without the last change in that series. If you can find a relevant test case that exercises it, you may want to add a custom implementation of the hash functions as well. Arnd