Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
From: Joe Perches <hidden>
Date: 2018-07-25 15:34:00
Also in:
linux-arm-kernel, linux-crypto, linux-devicetree, linux-iio, linux-pm, linux-wireless, linuxppc-dev, lkml, lvs-devel, netfilter-devel
Subsystem:
checkpatch, the rest · Maintainers:
Andy Whitcroft, Joe Perches, Linus Torvalds
On Wed, 2018-07-25 at 15:12 +0200, Arnd Bergmann wrote:
tools/perf/tests/.gitignore: LLVM byte-codes, uncompressed On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton [off-list ref] wrote:quoted
On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches [off-list ref] wrote:quoted
On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote:quoted
On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann [off-list ref] wrote:quoted
Almost all files in the kernel are either plain text or UTF-8 encoded. A couple however are ISO_8859-1, usually just a few characters in a C comments, for historic reasons. This converts them all to UTF-8 for consistency.[]quoted
Will we be getting a checkpatch rule to keep things this way?How would that be done?I'm using this, seems to work. if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text" then echo $p: weird charset fiThere are a couple of files that my version of 'find' incorrectly identified as something completely different, like: Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt: SemOne archive data Documentation/devicetree/bindings/rtc/epson,rtc7301.txt: Microsoft Document Imaging Format Documentation/filesystems/nfs/pnfs-block-server.txt: PPMN archive data arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi: Sendmail frozen configuration - version = "host"; Documentation/networking/segmentation-offloads.txt: StuffIt Deluxe Segment (data) : gmentation Offloads in the Linux Networking Stack arch/sparc/include/asm/visasm.h: SAS 7+ arch/xtensa/kernel/setup.c: , init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020 drivers/cpufreq/powernow-k8.c: TI-XX Graphing Calculator (FLASH) tools/testing/selftests/net/forwarding/tc_shblocks.sh: Minix filesystem, V2 (big endian) tools/perf/tests/.gitignore: LLVM byte-codes, uncompressed All of the above seem to be valid ASCII or UTF-8 files, so the check above will lead to false-positives, but it may be good enough as they are the exception, and may be bugs in 'file'. Not sure if we need to worry about 'file' not being installed.
checkpatch works on patches so I think the test isn't really relevant. It has to use the appropriate email header that sets the charset. perhaps: --- scripts/checkpatch.pl | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 34e4683de7a3..57355fbd2d28 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl@@ -2765,9 +2765,13 @@ sub process { # Check if there is UTF-8 in a commit log when a mail header has explicitly # declined it, i.e defined some charset where it is missing. if ($in_header_lines && - $rawline =~ /^Content-Type:.+charset="(.+)".*$/ && - $1 !~ /utf-8/i) { - $non_utf8_charset = 1; + $rawline =~ /^Content-Type:.+charset="?([^\s;"]+)/) { + my $charset = $1; + $non_utf8_charset = 1 if ($charset !~ /^utf-8$/i); + if ($charset !~ /^(?:us-ascii|utf-8|iso-8859-1)$/) { + WARN("PATCH_CHARSET", + "Unpreferred email header charset '$charset'\n" . $herecurr); + } } if ($in_commit_log && $non_utf8_charset && $realfile =~ /^$/ &&