Thread (28 messages) 28 messages, 4 authors, 2017-09-08

Re: [PATCH 08/10] files_ref_store: use a transaction to update packed refs

From: Michael Haggerty <hidden>
Date: 2017-09-08 12:45:06

On 09/08/2017 09:38 AM, Jeff King wrote:
On Tue, Aug 29, 2017 at 10:20:32AM +0200, Michael Haggerty wrote:
quoted
First, the old code didn't obtain the `packed-refs` lock until
`files_transaction_finish()`. This means that a failure to acquire the
`packed-refs` lock (e.g., due to contention with another process)
wasn't detected until it was too late (problems like this are supposed
to be detected in the "prepare" phase). The new code acquires the
`packed-refs` lock in `files_transaction_prepare()`, the same stage of
the processing when the loose reference locks are being acquired,
removing another reason why the "prepare" phase might succeed and the
"finish" phase might nevertheless fail.
That means we're holding the packed-refs lock for a slightly longer
period. I think this could mean worse lock contention between otherwise
unrelated transactions over the packed-refs file. I wonder if the
lock-retry timeout might need to be increased to accommodate this. On
the other hand, it looks like we take it after getting the individual
locks, which I'd think would be the expensive part.
That was my thinking, yes. While the packed-refs lock is held, the
references being created/updated have their reflogs written and are
renamed into place. I don't see how that can be shortened without
compromising on correctness (in particular, that we want to process
creates/updates before deletions to try to preserve reachability as much
as possible during the transaction). As an added optimization, the
packed-refs lock is not acquired at all if no references are being deleted.
Are there other callers who take the packed-refs and individual locks in
the reverse order? I think git-pack-refs might. Could we "deadlock" with
pack-refs? There are timeouts so it would resolve itself quickly, but I
wonder if we'd have a case that would deadlock-until-timeout that
otherwise could succeed.
That's not true. `files_pack_refs()` does the following:

1. Lock the `packed-refs` file.

2. Start a packed ref-store transaction.

3. Iterate over the loose ref cache. For each reference that should
   be packed:
   * add it to the packed-refs transaction as an update to set it
     to the loose value (without specifying an old_sha1)
   * if pruning is on, schedule the loose reference to be pruned
     (in an internal data structure, without locking the file).

4. Commit the packed ref-store transaction.

5. Release the packed-refs lock.

6. For each to-prune loose ref:
   * lock the loose reference
   * verify that it still has the value that was just written to
     the `packed-refs` file
   * if so, delete the loose version of the reference
   * unlock the loose reference

The packed-refs file and loose references are never locked at the same
time during pack-refs, so no deadlock is possible.

But you are right to assume that it *should* be so. The algorithm
written above is a tiny bit unsafe (and has been for years). It is
possible, though admittedly very unlikely, for the following to happen
in the gap between steps 5 and 6:

1. One process updates the value of branch "foo" from A to B.
   (Remember that A is the value that was just written to the
   `packed-refs` file by step 4.)

2. `pack-refs` is run *again* (while the first `pack-refs` is out
   on its lunch break), writes value B to the `packed-refs` file
   for "foo", and deletes the loose version of "foo".

3. Yet *another* process changes "foo" back from B to A. This
   creates a loose version of "foo" with value A, which overwrites
   the packed version with value B.

4. The first `pack-refs` process resumes at step 6, sees that
   "foo" "still" has the value "A", so deletes the loose reference.

This would leave "foo" at the obsolete value "B" (i.e., the value
written to the `packed-refs` file for it by the nested `pack-refs` process.

I think that fixing this problem would require the `packed-refs` lock to
be held while `pack-refs` is pruning the loose references. But given how
unlikely that chain of events seems, and that fixing it would increase
contention on the `packed-refs` file and allow the deadlock that you
described, I lean towards leaving it as-is. Though admittedly,
contention over a loose reference lock could make the race more likely
to be hit.

I also just noticed that the `refs_to_prune` linked list is leaked here
(as has also been the case for many years). I'll fix that separately.
quoted
Second, the old code deleted the loose version of a reference before
deleting any packed version of the same reference. This left a moment
when another process might think that the packed version of the
reference is current, which is incorrect. (Even worse, the packed
version of the reference can be arbitrarily old, and might even point
at an object that has since been garbage-collected.)

Third, if a reference deletion fails to acquire the `packed-refs` lock
altogether, then the old code might leave the repository in the
incorrect state (possibly corrupt) described in the previous
paragraph.
And to think I had the hubris to claim a few weeks ago that we had
probably weeded out all of the weird packed-refs inconsistency and
race-condition bugs. :)
Haha, job security.

Michael
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help