Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list

[PATCH 0/3] Linux with musl libc improvement · Doan Tran Cong Danh <hidden> · 2019-10-31
[PATCH 1/3] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · Jeff King <hidden> · 2019-10-31
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · Danh Doan <hidden> · 2019-11-01
Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf · brian m. carlson <hidden> · 2019-10-31
[PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Jeff King <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · brian m. carlson <hidden> · 2019-10-31
Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Danh Doan <hidden> · 2019-11-01
[PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Johannes Schindelin <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-10-31
Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-01
[PATCH v2 0/3] Linux with musl libc improvement · Doan Tran Cong Danh <hidden> · 2019-11-01
[PATCH v2 1/3] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 1/3] t0028: eliminate non-standard usage of printf · Jeff King <hidden> · 2019-11-01
[PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Jeff King <hidden> · 2019-11-01
Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary · Danh Doan <hidden> · 2019-11-02
[PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-01
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-01
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-02
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Danh Doan <hidden> · 2019-11-02
Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-05
[PATCH v3 0/8] Correct internal working and output encoding · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 1/8] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 5/8] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 6/8] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v3 7/8] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-06
Re: [PATCH v3 7/8] sequencer: reencode old merge-commit message · Eric Sunshine <hidden> · 2019-11-06
[PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-06
[PATCH v4 0/8] Correct internal working and output encoding · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 1/8] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding · Jeff King <hidden> · 2019-11-11
[PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list · Jeff King <hidden> · 2019-11-07
[PATCH v4 6/8] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 6/8] sequencer: reencode squashing commit's message · Jeff King <hidden> · 2019-11-07
[PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list · Jeff King <hidden> · 2019-11-07
[PATCH v4 7/8] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-07
[PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Danh Doan <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Jeff King <hidden> · 2019-11-07
Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch · Danh Doan <hidden> · 2019-11-07
[PATCH v5 0/9] Improve odd encoding integration · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 1/9] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 5/9] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 6/9] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 7/9] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list · Doan Tran Cong Danh <hidden> · 2019-11-08
[PATCH v6 0/9] sequencer: handle other encoding better · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 1/9] t0028: eliminate non-standard usage of printf · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 5/9] sequencer: reencode revert/cherry-pick's todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 6/9] sequencer: reencode squashing commit's message · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 7/9] sequencer: reencode old merge-commit message · Doan Tran Cong Danh <hidden> · 2019-11-11
[PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Doan Tran Cong Danh <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Jeff King <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Phillip Wood <hidden> · 2019-11-11
Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list · Johannes Schindelin <hidden> · 2019-11-11
Re: [PATCH v6 0/9] sequencer: handle other encoding better · Jeff King <hidden> · 2019-11-11

From: Jeff King <hidden>
Date: 2019-11-05 08:00:13

On Sat, Nov 02, 2019 at 08:02:15AM +0700, Danh Doan wrote:

Anyway, if we're going to working with a single encoding internally,
can we take other extreme approach: reencode the commit message to
utf-8 before writing the commit object? (Is there any codepoint in
other encoding that can't be reencoded to utf-8?)

That's normally what we do. The only cases we're covering here are when
somebody has explicitly asked that the commit object be stored in
another encoding. Presumably they'd also be using a matching
i18n.logOutputEncoding in that case, in which case logmsg_reencode()
would be a noop. I think the only reasons to do that are:

  1. You're stuck on some legacy encoding for your terminal. But in that
     case, I think you'd still be better off storing utf-8 and
     translating on the fly, since whatever encoding you do store is
     baked into your objects for all time (so accept some slowness now,
     but eventually move to utf-8).

  2. Your preferred language is bigger in utf-8 than in some specific
     encoding, and you'd rather save some bytes. I'm not sure how big a
     deal this is, given that commit messages don't tend to be that big
     in the first place (compared to trees and blobs). And the zlib
     deflation on the result might help remove some of the redundancy,
     too.

So I'd actually expect very few people to be using this feature at all
these days (which is part of why I would not be all that broken up if we
just fix the test and move on, if nobody is reporting real-world
problems).

Since git-log and friends are doing 2 steps conversion for commit
message for now (reencode to utf-8 first, then reencode again to
get_log_output_encoding()). With this new approach, first step is
likely a noop (but must be kept for backward compatible).

Interesting. Traditionally we did a single step conversion to the output
format, and it looks like most output formats still do that (i.e.,
everything in pretty_print_commit() except FMT_USERFORMAT, which is what
powers "--pretty=format:%s", etc).

The two-part user-format thing goes back to 7e77df39bf (pretty: two
phase conversion for non utf-8 commits, 2013-04-19). It does seem like
it would be cheaper to convert the format string into the output
encoding (it would need to be an ascii superset, but that's already the
case, since we expect to parse "author", etc out of the re-encoded
commit object). But again, I have trouble caring too much about the
performance of this case, as I consider it to be mostly legacy at this
point. But I also don't write in (say) Japanese, so maybe I'm being too
narrow-minded about whether people really want to avoid utf-8.

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help