Thread (20 messages) 20 messages, 5 authors, 2019-06-04

Re: git archive generates tar with malformed pax extended attribute

From: René Scharfe <hidden>
Date: 2019-05-25 13:27:18

Am 24.05.19 um 10:13 schrieb Jeff King:
On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
quoted
quoted
I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
(from Debian's bsdtar package). What does your "tar --version" say?
bsdtar 2.8.3 - libarchive 2.8.3
Interesting. I wonder if there was a libarchive bug that was fixed
between 2.8.3 and 3.3.3.
quoted
quoted
Git does write a pax header with the commit id in it as a comment.
Presumably that's what it's complaining about (but it is not malformed
according to any tar I've tried). If you feed git-archive a tree rather
than a commit, that is omitted. What does:

  git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null

say? If it doesn't complain, then we know it's indeed the pax comment
field.
It also complains

  $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
  tar: Ignoring malformed pax extended attribute
  tar: Error exit delayed from previous errors.
Ah, OK. So it's not the comment field at all, but some other entry.
quoted
Some more context: I work at Sourcegraph.com We mirror a lot of repos
from github.com. We usually interact with a working copy by running
git archive on it in our infrastructure. This is the first repository
that I have noticed which produces this error. An interesting thing to
note is the commit metadata contains a lot of non-ascii text which was
my guess at what my be tripping up the tar creation.
Yeah, though the only thing that makes it into the tarfile is the actual
tree entries. I'd imagine the file content is not likely to be a source
of problems, as it's common to see binary gunk there. Most of the
filenames are pretty mundane, but this symlink destination is a little
funny:

  $ git archive ... | tar tvf - | grep nicovideo4as.swc
  lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024

That's not the full story, though. It is indeed a symlink in the
tree:

  $ git ls-tree -r HEAD libs/nicovideo4as.swc
  120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc

But the contents of that blob, which should be the destination filename,
are definitely not:

  $ git cat-file blob ec3137b5f | wc -c
  57804
  $ git cat-file blob ec3137b5f | xxd | head -1
  00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..

There's quite a bit more data there. And what tar showed us goes up to
the first NUL, which does not seem surprising.
That (the symlink target) is a ZIP file with the following contents:

 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   39733  Defl:N     3403  91% 2019-03-09 13:10 489e1be1  catalog.xml
   54131  Defl:N    54151   0% 2019-03-09 13:10 32f57322  library.swf
--------          -------  ---                            -------
   93864            57554  39%                            2 files

And link targets longer than 100 characters are encoded in an extended
Pax header.

(Usually symlink targets are paths, not file contents.)
It's possible Git is doing the wrong thing on the writing side, but
given that newer versions of bsdtar handle it fine, I'd guess that the
old one simply had problems consuming poorly formed symlink filenames.
Git preserves symlink targets with embedded NULs in the repository and
in generated tar files.  Not sure if GNU tar and bsdtar truncating them
at the first NUL is a bug.  I'm also not sure if there is a platform
that would allow creating such a symlink in the file system, or how one
is supposed to use it.

We could truncate symlink targets at the first NUL as well in git
archive -- but that would be a bit sad, as the archive formats allow
storing the "real" target from the repo, with NUL and all.  We could
make git fsck report such symlinks.

Can Unicode symlink targets contain NULs?  We wouldn't want to damage
them even if we decide to truncate.

René
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help