[RFC] arm: use built-in byte swap function
From: nico@fluxnic.net (Nicolas Pitre)
Date: 2013-02-20 03:17:55
Also in:
lkml
On Tue, 19 Feb 2013, Kim Phillips wrote:
On Fri, 8 Feb 2013 22:16:47 -0500 Nicolas Pitre [off-list ref] wrote:quoted
Not only that, but in many cases the results are wildly different given the same config:quoted
imx_v6_v7_defconfig: 7637605 7636935 -670 lart_defconfig: 2922550 2926600 4050 mxs_defconfig: 11071139 11070893 -246The mxs_defconfig became much better while lart_defconfig regressed a lot.quoted
Haven't looked at why.Would be a good idea since this is rather weird and gcc could benefit from your findings.The following is next-20130207 built with Linaro gcc 4.7.1 [1], and before and after the diff at the bottom of this email (and with normalized linux version string sizes): lart_defconfig: 2752106 120864 56444 2929414 2cb306 vmlinux lart_defconfig: 2756092 120864 56444 2933400 2cc298 vmlinux #builtin-bswap mxs_defconfig: 5229115 280572 5569648 11079335 a90ea7 vmlinux mxs_defconfig: 5228969 280552 5569648 11079169 a90e01 vmlinux #builtin-bswap imx_v6_v7_defconfig: 6935025 356172 360648 7651845 74c205 vmlinux imx_v6_v7_defconfig: 6934091 356180 360648 7650919 74be67 vmlinux #builtin-bswap so builtin-bswap improved mxs and imx_v6_v7 but in lart, it _added_ 3986 bytes to .text -> not good. Getting a closer look at lart, bloat-o-meter says the code actually shrunk: add/remove: 7/1 grow/shrink: 11/19 up/down: 298/-356 (-58) function old new delta inet_abc_len - 96 +96 __bswapdi2 - 52 +52 __bswapsi2 - 32 +32 icmp_unreach 472 492 +20 xfrm_selector_match 988 1000 +12 fib_table_insert 2176 2188 +12 __kstrtab___bswapsi2 - 11 +11 __kstrtab___bswapdi2 - 11 +11 __ksymtab___bswapsi2 - 8 +8 __ksymtab___bswapdi2 - 8 +8 vermagic 51 57 +6 linux_banner 230 236 +6 xfrm_replay_check_esn 320 324 +4 xfrm_replay_check_bmp 200 204 +4 xfrm_replay_check 152 156 +4 static.tcp_parse_aligned_timestamp 80 84 +4 fib_table_delete 708 712 +4 cookie_v4_check 1316 1320 +4 tcp_tso_segment 728 724 -4 tcp_options_write 724 720 -4 ip_rt_ioctl 1152 1148 -4 fib_trie_seq_show 724 720 -4 crc32_be 448 444 -4 xfrm_stateonly_find 640 632 -8 tcp_finish_connect 276 268 -8 static.tcp_v4_send_ack 480 472 -8 __xfrm_state_lookup 356 348 -8 __xfrm_state_bump_genids 436 428 -8 __find_acq_core 1256 1248 -8 cookie_v4_init_sequence 272 260 -12 __xfrm_state_insert 616 600 -16 sys_swapon 2500 2480 -20 xfrm_state_find 2420 2396 -24 xfrm_hash_resize 1620 1596 -24 fib_route_seq_show 560 536 -24 fib_table_dump 704 676 -28 devinet_ioctl 1856 1796 -60 static.inet_abc_len 80 - -80 Comparing System.maps, .rodata starts at the same address: c020a000 R __start_rodata c020a000 R __start_rodata #builtin-bswap however, changes including the __bswap[sd]i2 implementation pushes the .rodata section size just over the 4KiB alignment boundary specified in arm/kernel/vmlinux.lds: no builtin_bswap: c028ffc4 R __stop___modver c0290000 R __end_rodata c0290000 R __start___ex_table with builtin_bswap: c0290068 R __stop___modver c0291000 R __end_rodata c0291000 R __start___ex_table So, AFAICT, that's why we see a total increase in .text for lart, and, looking at both numbers being a little less than 4KiB, I suspect the same with whatever happened with mxs above.
OK. At least we do have a plausible explanation now. The actual code being smaller should compensate for section alignment loss.
ok, so to avoid recursion, I've enforced a -O2 on bswapsdi2.o.
Not only recursion, but the horrible assembly output from -Os.
quoted hunk ↗ jump to hunk
Here's the new diff: changes from last diff: - enforce -O2 for bswapsdi2.o - fix building out-of-source treediff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 4265a26..5e8b735 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig@@ -58,6 +58,7 @@ config ARM select CLONE_BACKWARDS select OLD_SIGSUSPEND3 select OLD_SIGACTION + select ARCH_USE_BUILTIN_BSWAP help The ARM series is a line of low-power-consumption RISC chip designs licensed by ARM Ltd and targeted at embedded applications anddiff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile index c9865f6..8ef97c4 100644 --- a/arch/arm/boot/compressed/Makefile +++ b/arch/arm/boot/compressed/Makefile@@ -111,12 +111,12 @@ endif targets := vmlinux vmlinux.lds \ piggy.$(suffix_y) piggy.$(suffix_y).o \ - lib1funcs.o lib1funcs.S ashldi3.o ashldi3.S \ + lib1funcs.o lib1funcs.S ashldi3.o ashldi3.S bswapsdi2.o \ font.o font.c head.o misc.o $(OBJS) # Make sure files are removed during clean extra-y += piggy.gzip piggy.lzo piggy.lzma piggy.xzkern \ - lib1funcs.S ashldi3.S $(libfdt) $(libfdt_hdrs) + lib1funcs.S ashldi3.S bswapsdi2.o $(libfdt) $(libfdt_hdrs) ifeq ($(CONFIG_FUNCTION_TRACER),y) ORIG_CFLAGS := $(KBUILD_CFLAGS)@@ -158,6 +158,12 @@ ashldi3 = $(obj)/ashldi3.o $(obj)/ashldi3.S: $(srctree)/arch/$(SRCARCH)/lib/ashldi3.S $(call cmd,shipped) +# For __bswapsi2, __bswapdi2 +bswapsdi2 = $(obj)/bswapsdi2.o + +$(obj)/bswapsdi2.o: $(obj)/../../../../arch/$(SRCARCH)/lib/bswapsdi2.o + $(call cmd,shipped) +
I don't think you can get away with this. The decompressor code is compiled with -fpic and the main kernel is not. Most toolchains do mark object files with some flags to prevent the link of incompatible objects together (normally pic and non pic objects are not compatible even if in this very simple case that would not matter). Maybe you are able to link zImage successfully simply because no references to __bswap* needed to be resolved and therefore the linker didn't need to search/include that object?
quoted hunk ↗ jump to hunk
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile index af72969..dbee639 100644 --- a/arch/arm/lib/Makefile +++ b/arch/arm/lib/Makefile@@ -13,7 +13,7 @@ lib-y := backtrace.o changebit.o csumipv6.o csumpartial.o \ ashldi3.o ashrdi3.o lshrdi3.o muldi3.o \ ucmpdi2.o lib1funcs.o div64.o \ io-readsb.o io-writesb.o io-readsl.o io-writesl.o \ - call_with_stack.o + call_with_stack.o bswapsdi2.o mmu-y := clear_user.o copy_page.o getuser.o putuser.o@@ -45,3 +45,5 @@ lib-$(CONFIG_ARCH_SHARK) += io-shark.o $(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S $(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S + +CFLAGS_bswapsdi2.o = -O2
Please insert a small comment to explain why this is done. Adding some more elaborate explanation in the commit log would be good too. Nicolas