Re: Question relate to collaboration on git monorepo

From: Emily Shaffer <hidden>
Date: 2022-09-20 18:53:49

On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu [off-list ref] wrote:

Hey, guys,

If two users of git monorepo are working on different sub project
/project1 and /project2 by partial-clone and sparse-checkout ,
if user one push first, then user two want to push too, he must
pull some blob which pushed by user one. I guess their repo size
will gradually increase by other's project's objects, so is there any way
to delete unnecessary blobs out of working project (sparse-checkout
filterspec), or just git pull don't really fetch these unnecessary blobs?

This is exactly what the combination of partial clone and sparse
checkout is for!

Dev A is working on project1/, and excludes project2/ from her sparse
filter; she also cloned with `--filter=blob:none`.
Dev B is working on project2/, and excludes project1/ from his sparse
filter, and similarly  is using blob:none partial clone filter.

Assuming everybody is contributing by direct push, and not using a
code review tool or something else which handles the push for them...
Dev A finishes first, and pushes.
Dev B needs to pull, like you say - but during that pull he doesn't
need to fetch the objects in project1, because they're excluded by the
combination of his partial clone filter and his sparse checkout
pattern. The pull needs to happen because there is a new commit which
Dev B's commit needs to treat as a parent, and so Dev B's client needs
to know the ID of that commit.

The large number of interruptions in git push may be another
problem, if thousands of probjects are in one monorepo, and
no one else has any code that would conflict with me in any way,
but I need pull everytime? Is there a way to make improvements
here?

The typical improvement people make here is to use some form of
automation or tooling to perform the push and merge for them. That
usually falls to the code review tool. We can call the history like
this: "S" is the source commit which both A and B branched from, and
"A" and "B" are the commits by their respective owners. Because of the
order of push, we want the final commit history to look like "S -> A
-> B". Dev A's local history looks like "S -> A" and Dev B's local
history looks like "S -> B".

If we're using the GitHub PR model, then GitHub may do merge commits
for us, and it creates those merge commits automatically at the time
someone pushes "Merge PR" (or whatever the button is called). So our
history probably looks like:
o  (merge B)
|   \
o   |  (merge A)
|\  |
| | B
| A |
| / /
S

In this case, neither A or B need to know about each other, because
the merge commit is being created by the code review tool.

With tooling like Gerrit, or other tooling that uses the rebase
strategy (rather than merge), pretty much the same thing happens -
both devs can push without knowing about their own commit because the
review tool's automation performs the rebase (that is, the "git pull"
you described) for them.

But if you're not using tooling, yeah, Dev B needs to know which
commit should come before his own commit, so he needs to fetch latest
history, even though the only changes are from Dev A who was working
somewhere else in the monorepo.

Here's an example of how two users constrain each other when git push.

#!/bin/sh

rm -rf mono-repo
git init mono-repo -b main
(
  cd mono-repo
  mkdir project1
  mkdir project2
  for ((i=0;i<10;i++))
  do
    echo $i >> project1/file1
    echo $i >> project2/file2
  done
  git add .
  git commit -m init
)

rm -rf mono-repo.git
git clone --bare mono-repo

# user1
rm -rf m1
git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1
(
  cd m1
  git sparse-checkout set project1
  git checkout main
  for ((i=0;i<10;i++))
  do
    echo "data1-$i" >> project1/file1
    git add .
    git commit -m "c1 $i"
  done
)

# user2
rm -rf m2
git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m2
(
  cd m2
  git sparse-checkout set project2
  git checkout main
  for ((i=0;i<10;i++))
  do
    echo "data2-$i" >> project2/file2
    git add .
    git commit -m "c2 $i"
  done
)

# user1 push
(
  cd m1
  git push
)

# user2 push failed, then pull user1's blob
(
  cd m2
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count1
  git push
  git -c pull.rebase=false pull --no-edit #no conflict
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count2
  diff blob_count1 blob_count2
)

--
ZheNing Hu

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help