Thread (20 messages) 20 messages, 8 authors, 2025-10-09

Re: [PATCH v2] SubmittingPatches: add section about AI

From: Christian Couder <hidden>
Date: 2025-10-08 08:38:07

On Sat, Oct 4, 2025 at 12:20 AM brian m. carlson
[off-list ref] wrote:
On 2025-10-03 at 20:48:40, Elijah Newren wrote:
quoted
Would this mean that you wanted to ban contributions like d12166d3c8bb
(Merge branch 'en/docfixes', 2023-10-23), available on the list over
at https://lore.kernel.org/git/pull.1595.git.1696747527.gitgitgadget@gmail.com/ (local)
?   We don't need to go theoretical, I've already contributed such a
patch series before -- 2 years ago -- and it was merged.  Granted,
that was entirely documentation, and I called out the usage of AI in
the cover letter, and I manually checked every change (discarding many
of them) and split it into commits on my own, could easily explain any
change and why it was good, etc.  And I was upfront about all of it.
I think the main problem here is that we don't know the copyright
status of LLM outputs.
It's very unlikely that whatever is decided about the copyright status
of LLM outputs will fundamentally change copyright law. So for example
small changes, or changes where a human has been involved a lot, or
changes that are very specific, and so on, are very likely acceptable.
It is not uncommon for them to produce output
that reflects their training input and we see evidence of that in, for
instance, the New York Times lawsuit against OpenAI.
You might say something very similar about people contributing proprietary code:

"It is not uncommon to have people copy-paste some proprietary code
into an open source project and we see evidence of that in such and
such incidents."

So it's just fine to accept some degree of risk. We have to accept it
anyway. Saying "we will ban everything AI generated" will not make the
risk disappear either.
As I said, the situation is very unclear legally, with active litigation
in multiple countries, and we have to comply with pretty much every
country's laws in this situation.  Whether something is legal in the
United States, where you're located, is completely irrelevant to whether
it is legal in Canada, where I'm located, or Germany or the UK, where we
have other contributors.  We also have to consider whether it's legal in
all of the countries that Git is distributed in, which includes every
country in which Debian has a mirror[0], even countries under
international sanctions, such as Iran, Russia, and Belarus.
I don't quite agree with this. Theoretically if the official mirrors
are only in a few countries, then only the laws in these few countries
(+ US law as the Conservancy is US based) might be really legally
relevant for the project. Then it's the responsibility of
distributions or people cloning/downloading the software to check that
it's legal in the countries they distribute or clone/download it.

In practice we should pay attention a bit to make sure we don't create
obvious legal problems for too many people, but if some countries
decide to have laws that are too stupid and ban too many things, we
could decide that we should definitely not pay attention to those
laws.
It doesn't matter if the person using AI has indemnification, either,
since that only covers civil matters, and at least in the U.S. and
Canada, knowingly violating copyright is also a criminal offence.

The sign-off process is designed to clearly state that a person has the
ability to contribute code under the license and I don't think, as
things stand, it's possible to make that assertion with code or
documentation generated from an LLM except in very limited
circumstances.
I think in practice those "very limited circumstances" can cover a lot
of different things though. Do we really want to enter into a legal
debate over what
https://en.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire means for
software for example? Or about allowing or disallowing translation of
documentation or commit messages based on the fact that the tools used
for translation use an LLM or not?

I have given a lot of examples of what is very likely acceptable.
Elijah has given a very good concrete example showing why we should
not outright ban AI too. If you think they are not good examples
please tell it clearly. Otherwise I think you cannot keep saying that
they are related to "very limited circumstances".
I don't allow LLM-generated code in my personal projects
that require sign-off for that reason, and neither does QEMU[1].  I
don't think I could honestly assert either (a) or (b) in the DCO with
LLM-generated code because it's not clear to me whether "I have the
right to submit it under the…license."

To quote the QEMU policy:

  To satisfy the DCO, the patch contributor has to fully understand the
  copyright and license status of content they are contributing to QEMU. With AI
  content generators, the copyright and license status of the output is
  ill-defined with no generally accepted, settled legal foundation.

  Where the training material is known, it is common for it to include large
  volumes of material under restrictive licensing/copyright terms. Even where
  the training material is all known to be under open source licenses, it is
  likely to be under a variety of terms, not all of which will be compatible
  with QEMU's licensing requirements.
The QEMU policy was discussed in the previous version already.
I remember the SCO situation with Linux and how it really created a lot
of uncertainty with Linux because SCO created FUD around Linux licensing
and how that led to the DCO being created.  I am aware of the fact that
many open source contributors are very unhappy that their code has been
used to train LLMs without retaining credits and copyright notices or
honouring the license terms[2].
I don't think it's very relevant for your position on this. On the
contrary, if LLMs have been trained mostly with open source code, then
if they produce copyrighted output, that output is more likely to be
compatible with the GPL. It has even been suggested (and discussed in
this thread) that some AIs should be trained only with open source
material (for example MIT licensed material?) so that we could stop
worrying about including it. If that happens, there would be no reason
to outright ban AI generated content, right?
And I have spent many years working
with non-profits[3], where I have always been taught that we should
avoid even the appearance of impropriety.
Adding a section restricting AI use, even if it doesn't go as far as
you would like, is already a first step in the direction you want. If
this gets merged, you can always send patches on top to make it more
restrictive.
It may matter less what the situation actually ends up being legally
(although it could end up being quite bad) and more whether someone can
imply or suggest that Git is not being distributed in compliance with
the license or contains infringing code, which could effectively make it
undistributable because nobody wants to take that risk.  And litigation,
even if Git and its contributors are successful, can be extraordinarily
expensive.
There are already legal risks anyway (see above).
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help