Thread (16 messages) 16 messages, 7 authors, 2024-01-24

Re: [DISCUSS] Introducing Rust into the Git project

From: Elijah Newren <hidden>
Date: 2024-01-11 00:12:35

On Wed, Jan 10, 2024 at 12:18 PM Taylor Blau [off-list ref] wrote:
Over the holiday break at the end of last year I spent some time
thinking on what it would take to introduce Rust into the Git project.
I'm very happy to see this email.
There is significant work underway to introduce Rust into the Linux
kernel (see [1], [2]). Among their stated goals, I think there are a few
which could be potentially relevant to the Git project:

  - Lower risk of memory safety bugs, data races, memory leaks, etc.
    thanks to the language's safety guarantees.

  - Easier to gain confidence when refactoring or introducing new code
    in Rust (assuming little to no use of the language's `unsafe`
    feature).

  - Contributing to Git becomes easier and accessible to a broader group
    of programmers by relying on a more modern language.

Given the allure of these benefits, I think it's at least worth
considering and discussing how Rust might make its way into Junio's
tree.
I think there are other benefits as well; I'll list them at the end of
the email to avoid side-tracking too much[6].
I imagine that the transition state would involve some parts of the
project being built in C and calling into Rust code via FFI (and perhaps
vice-versa, with Rust code calling back into the existing C codebase).
Luckily for us, Rust's FFI provides a zero-cost abstraction [3], meaning
there is no performance impact when calling code from one language in
the other.
I agree with the zero-cost abstraction, but there is a funny caveat
with measuring it if anyone is curious[7].
Some open questions from me, at least to get the discussion going are:

  1. Platform support. The Rust compiler (rustc) does not enjoy the same
     widespread availability that C compilers do. For instance, I
     suspect that NonStop, AIX, Solaris, among others may not be
     supported.

     One possible alternative is to have those platforms use a Rust
     front-end for a compiler that they do support. The gccrs [4]
     project would allow us to compile Rust anywhere where GCC is
     available. The rustc_codegen_gcc [5] project uses GCC's libgccjit
     API to target GCC from rustc itself.
Another alternative (as discussed at Git Merge when we were last
talking about Rust[8]), is requiring all Rust code to be optional for
now.  If we choose to go that route, I think that means that (a) for
existing components, we have both a Rust and a C implementation
available, and (b) for new components (e.g. new top-level commands
like git-replay), they can be Rust-only and those compiling without
Rust just don't get them.
  2. Migration. What parts of Git are easiest to convert to Rust? My
     hunch is that the answer is any stand-alone libraries, like
     strbuf.h. I'm not sure how we should identify these, though, and in
     what order we would want to move them over.
If we're happy to allow Rust, I'd like to rewrite git-replay in Rust
as a testcase.  It's almost certainly not "easiest", but I think it's
an interesting testcase because it's a new top-level command that
hasn't appeared in any release yet.  Further, it is currently only
designed for server-side usecases, so would likely not be affected by
more limited platform support.  (I haven't started on this; my
previous experiments were with diffcore-delta.)
I'm curious to hear what others think about this. I think that this
would be an exciting and worthwhile direction for the project. Let's
see!
:-)
Thanks,
Taylor

[1]: https://rust-for-linux.com/
[2]: https://lore.kernel.org/rust-for-linux/20210414184604.23473-1-ojeda@kernel.org/ (local)
[3]: https://blog.rust-lang.org/2015/04/24/Rust-Once-Run-Everywhere.html#c-talking-to-rust
[4]: https://github.com/Rust-GCC/gccrs
[5]: https://github.com/rust-lang/rustc_codegen_gcc
[6] Here are some additional benefits I see:

 - Parallel performance.  We avoid making things parallel in Git because
   debugging/maintaining/reviewing parallel code in C often isn't worth
   the squeeze.  Rust was designed to greatly reduce this effort (the
   whole "fearless concurrency" thing).

 - Single-threaded Performance.  Multiple factors:

   - We had (and might still have) O(N^2) stuff in a lot of places in
     our codebase, because we tend to over-use arrays.  (e.g. with
     string_list, or with insertions and deletions into the index
     during a merge, etc.)

   - Relatedly, using hashes in C is quite onerous, to the point that
     we often simply avoid it.  I know I have, and I also know that
     even after I introduced strmap and tried to use it outside of
     merge-ort, that I got pushback because "string hash-maps are not
     really typical for a C program. I'm sure they are the best choice
     for an advanced merge algorithm but they are not really necessary
     [here; let's use sorted arrays instead]..."  I then had to go
     through multiple rounds of responses and ended up reimplementing
     everything as suggested (before finally convincing others to just
     use the strmap implementation after all).

   - We use QSORT() which basically calls libc's qsort().  Due to the
     design of this function (where the comparator is a separate
     function call), it is slow.  When languages avoid making the
     comparator a separate function call, they can speed sorts up by a
     factor of 2 (or even by 3 when an unstable sort is good enough
     and the platform's qsort() is stable).

   - Difficulty of incorporating other libraries.  For example, our
     hashmap.[ch] make use of FNV, but picking something else is a big
     amount of effort.  Now, while FNV is faster than Rust's default
     of SipHash, cargo makes it easy to pull in alternatives like
     FnvHashMap or FxHashMap, which we can then use where it matters.

I'm also tempted to include bullet points for having a unit testing
framework built in, and potentially fewer platform-dependent issues
(e.g. forgetting to use STABLE_QSORT when required since qsort is
stable in some libc implementations, since rust defines those more
carefully to be consistent across platforms), but I'm not sure these
additional advantages are big enough to merit a full bullet point.

[7] If you ignore Rust for a moment, and simply divide your files into
different libraries (e.g. introducing a new.c file, moving some
functions to it, and then compiling new.c into a new library,
libnew.a, and linking both libgit.a and libnew.a into git), you can
sometimes measure some small performance differences.  At least, I
did.  What this scenario has to do with Rust is that if we start
moving some code to Rust, that will naturally likely result in a
different division of files into libraries.  Thus, for me to verify
that Rust did provide zero-cost abstractions with my experiments, in
order to compare the performance of my Rust changes, I had to compare
to a version of git where I split some functions out into a separate
library.  When I did that, the performance overhead was actually 0.
Otherwise, there was a tiny performance degradation in the particular
splitting I employed.  However, while splitting did give me a small
performance drop, it was completely outweighed by the performance
advantages I got elsewhere in the things I converted to Rust.

[8] https://lore.kernel.org/git/ZRrfN2lbg14IOLiK@nand.local/ (local)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help