Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list

[PATCH 0/3] Linux with musl libc improvement · Doan Tran Cong Danh <hidden> · 2019-10-31
[PATCH 1/3] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · Jeff King <hidden> · 2019-10-31
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · Danh Doan <hidden> · 2019-11-01
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · brian m. carlson <hidden> · 2019-10-31
[PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Jeff King <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · brian m. carlson <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Danh Doan <hidden> · 2019-11-01
[PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Johannes Schindelin <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-01
[PATCH v2 0/3] Linux with musl libc improvement · Doan Tran Cong Danh <hidden> · 2019-11-01
[PATCH v2 1/3] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 1/3] t0028: eliminate non-standard usage of printf · Jeff King <hidden> · 2019-11-01
[PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Jeff King <hidden> · 2019-11-01
Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Danh Doan <hidden> · 2019-11-02
[PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-01
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-02
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-02
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-05
[PATCH v3 0/8] Correct internal working and output encoding · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 1/8] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 5/8] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 6/8] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 7/8] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-06
Re: [PATCH v3 7/8] sequencer: reencode old merge-commit message · Eric Sunshine <hidden> · 2019-11-06
[PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v4 0/8] Correct internal working and output encoding · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 1/8] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-11
[PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-07
[PATCH v4 6/8] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 6/8] sequencer: reencode squashing commit's message · Jeff King <hidden> · 2019-11-07
[PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list · Jeff King <hidden> · 2019-11-07
[PATCH v4 7/8] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Danh Doan <hidden> · 2019-11-07
[PATCH v5 0/9] Improve odd encoding integration · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 1/9] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 5/9] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 6/9] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 7/9] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v6 0/9] sequencer: handle other encoding better · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 1/9] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 5/9] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 6/9] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 7/9] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Jeff King <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Phillip Wood <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Johannes Schindelin <hidden> · 2019-11-11
Re: [PATCH v6 0/9] sequencer: handle other encoding better · Jeff King <hidden> · 2019-11-11

From: Jeff King <hidden>
Date: 2019-10-31 19:26:53

On Thu, Oct 31, 2019 at 11:38:14AM +0100, Johannes Schindelin wrote:

On Thu, 31 Oct 2019, Doan Tran Cong Danh wrote:

quoted

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900 is failing on Linux with musl libc.

Reencode to utf-8 before arranging rebase's todo list.

Since the re-encoded commit messages are only used for figuring out the
relationships between the `fixup!`/`squash!` commits and their targets,
but are not used in any of the lines that are written out, this change
looks good to me.

I'm confused about a few things here, though. I agree with you that the
subjects here are only used for finding the fixup/squash relationships.
But I don't understand the musl connection.

Wouldn't failure to reencode here always be a problem? E.g., if I do:

  for encoding in utf-8 iso-8859-1; do
    # commit using the encoding
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
    # and then fixup without it
    echo "$encoding fixed" >file && git add file
    git commit --fixup HEAD
  done
  
  GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash

then the resulting todo-list output (on my glibc system) is:

  pick 3a5bace éñcödèd with utf-8
  fixup aa9f09c fixup! éñcödèd with utf-8
  pick 6e85d32 éñcödèd with iso-8859-1
  pick 3ceac05 fixup! éñcödèd with iso-8859-1

I.e., we don't actually match up the second pair, and I think we
probably ought to.

I guess the test in t3900 is less exotic; it uses the same encoding for
both commits. And it's just that "foo" and "!fixup foo" can (and do in
musl) end up with different encodings (because of the specific language,
and the vagaries of each iconv implementation).

Would we have similar problems in all of the other functions which use
get_commit_buffer() without reencoding? For instance if I do this:

  echo base >file && git add file && git commit -m base
  for encoding in utf-8 iso-8859-1; do
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
  done
  git checkout -b side HEAD~2
  git cherry-pick master master^
  cat .git/sequencer/todo

then the resulting todo file has a mix of iso-8859-1 and utf-8.

It seems to me that we should always be working with the subjects in a
single encoding internally, and likewise outputting in that format
(which should probably be git_log_output_encoding(), for the instances
where we show it to the user).

I.e., we should always call logmsg_reencode() instead of
get_commit_buffer().

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help