Re: [PATCH v2] SubmittingPatches: add section about AI
From: Christian Couder <hidden>
Date: 2025-10-08 08:38:07
On Sat, Oct 4, 2025 at 12:20 AM brian m. carlson [off-list ref] wrote:
On 2025-10-03 at 20:48:40, Elijah Newren wrote:quoted
Would this mean that you wanted to ban contributions like d12166d3c8bb (Merge branch 'en/docfixes', 2023-10-23), available on the list over at https://lore.kernel.org/git/pull.1595.git.1696747527.gitgitgadget@gmail.com/ (local) ? We don't need to go theoretical, I've already contributed such a patch series before -- 2 years ago -- and it was merged. Granted, that was entirely documentation, and I called out the usage of AI in the cover letter, and I manually checked every change (discarding many of them) and split it into commits on my own, could easily explain any change and why it was good, etc. And I was upfront about all of it.I think the main problem here is that we don't know the copyright status of LLM outputs.
It's very unlikely that whatever is decided about the copyright status of LLM outputs will fundamentally change copyright law. So for example small changes, or changes where a human has been involved a lot, or changes that are very specific, and so on, are very likely acceptable.
It is not uncommon for them to produce output that reflects their training input and we see evidence of that in, for instance, the New York Times lawsuit against OpenAI.
You might say something very similar about people contributing proprietary code: "It is not uncommon to have people copy-paste some proprietary code into an open source project and we see evidence of that in such and such incidents." So it's just fine to accept some degree of risk. We have to accept it anyway. Saying "we will ban everything AI generated" will not make the risk disappear either.
As I said, the situation is very unclear legally, with active litigation in multiple countries, and we have to comply with pretty much every country's laws in this situation. Whether something is legal in the United States, where you're located, is completely irrelevant to whether it is legal in Canada, where I'm located, or Germany or the UK, where we have other contributors. We also have to consider whether it's legal in all of the countries that Git is distributed in, which includes every country in which Debian has a mirror[0], even countries under international sanctions, such as Iran, Russia, and Belarus.
I don't quite agree with this. Theoretically if the official mirrors are only in a few countries, then only the laws in these few countries (+ US law as the Conservancy is US based) might be really legally relevant for the project. Then it's the responsibility of distributions or people cloning/downloading the software to check that it's legal in the countries they distribute or clone/download it. In practice we should pay attention a bit to make sure we don't create obvious legal problems for too many people, but if some countries decide to have laws that are too stupid and ban too many things, we could decide that we should definitely not pay attention to those laws.
It doesn't matter if the person using AI has indemnification, either, since that only covers civil matters, and at least in the U.S. and Canada, knowingly violating copyright is also a criminal offence. The sign-off process is designed to clearly state that a person has the ability to contribute code under the license and I don't think, as things stand, it's possible to make that assertion with code or documentation generated from an LLM except in very limited circumstances.
I think in practice those "very limited circumstances" can cover a lot of different things though. Do we really want to enter into a legal debate over what https://en.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire means for software for example? Or about allowing or disallowing translation of documentation or commit messages based on the fact that the tools used for translation use an LLM or not? I have given a lot of examples of what is very likely acceptable. Elijah has given a very good concrete example showing why we should not outright ban AI too. If you think they are not good examples please tell it clearly. Otherwise I think you cannot keep saying that they are related to "very limited circumstances".
I don't allow LLM-generated code in my personal projects that require sign-off for that reason, and neither does QEMU[1]. I don't think I could honestly assert either (a) or (b) in the DCO with LLM-generated code because it's not clear to me whether "I have the right to submit it under the…license." To quote the QEMU policy: To satisfy the DCO, the patch contributor has to fully understand the copyright and license status of content they are contributing to QEMU. With AI content generators, the copyright and license status of the output is ill-defined with no generally accepted, settled legal foundation. Where the training material is known, it is common for it to include large volumes of material under restrictive licensing/copyright terms. Even where the training material is all known to be under open source licenses, it is likely to be under a variety of terms, not all of which will be compatible with QEMU's licensing requirements.
The QEMU policy was discussed in the previous version already.
I remember the SCO situation with Linux and how it really created a lot of uncertainty with Linux because SCO created FUD around Linux licensing and how that led to the DCO being created. I am aware of the fact that many open source contributors are very unhappy that their code has been used to train LLMs without retaining credits and copyright notices or honouring the license terms[2].
I don't think it's very relevant for your position on this. On the contrary, if LLMs have been trained mostly with open source code, then if they produce copyrighted output, that output is more likely to be compatible with the GPL. It has even been suggested (and discussed in this thread) that some AIs should be trained only with open source material (for example MIT licensed material?) so that we could stop worrying about including it. If that happens, there would be no reason to outright ban AI generated content, right?
And I have spent many years working with non-profits[3], where I have always been taught that we should avoid even the appearance of impropriety.
Adding a section restricting AI use, even if it doesn't go as far as you would like, is already a first step in the direction you want. If this gets merged, you can always send patches on top to make it more restrictive.
It may matter less what the situation actually ends up being legally (although it could end up being quite bad) and more whether someone can imply or suggest that Git is not being distributed in compliance with the license or contains infringing code, which could effectively make it undistributable because nobody wants to take that risk. And litigation, even if Git and its contributors are successful, can be extraordinarily expensive.
There are already legal risks anyway (see above).