Re: fsmonitor deadlock / macOS CI hangs
From: Koji Nakamaru <hidden>
Date: 2024-10-02 01:46:16
On Tue, Oct 1, 2024 at 4:46 AM Jeff King [off-list ref] wrote:
I did some more digging on the hangs we sometimes see when running the test suite on macOS. I'm cc-ing Patrick as somebody who dug into this before, and Johannes as the only still-active person mentioned in the relevant code. For those just joining, you can reproduce the issue by running t9211 with --stress on macOS. Some earlier notes are here: https://lore.kernel.org/git/20240517081132.GA1517321@coredump.intra.peff.net/ (local) but the gist of it is that we end up with Git processes waiting to read from fsmonitor, but fsmonitor hanging.
Perhaps I found the cause. fsmonitor_run_daemon_1() starts the fsevent
listener thread before with_lock__wait_for_cookie() is called.
/*
* Start the fsmonitor listener thread to collect filesystem
* events.
*/
if (pthread_create(&state->listener_thread, NULL,
fsm_listen__thread_proc, state)) {
ipc_server_stop_async(state->ipc_server_data);
err = error(_("could not start fsmonitor listener thread"));
goto cleanup;
}
listener_started = 1;
fsm_listen__thread_proc() starts the following:
fsm_listen__loop(state);
which is defined as below for darwin:
void fsm_listen__loop(struct fsmonitor_daemon_state *state)
{
struct fsm_listen_data *data;
data = state->listen_data;
pthread_mutex_init(&data->dq_lock, NULL);
pthread_cond_init(&data->dq_finished, NULL);
data->dq = dispatch_queue_create("FSMonitor", NULL);
FSEventStreamSetDispatchQueue(data->stream, data->dq);
data->stream_scheduled = 1;
if (!FSEventStreamStart(data->stream)) {
error(_("Failed to start the FSEventStream"));
goto force_error_stop_without_loop;
}
data->stream_started = 1;
...
Normally FSEventStreamStart() is called before
with_lock__wait_for_cookie() creates a cookie file, but this is not
guaranteed. We can reproduce the issue easily if we modify
fsm_listen__loop() as below:
--- a/compat/fsmonitor/fsm-listen-darwin.c
+++ b/compat/fsmonitor/fsm-listen-darwin.c
@@ -510,6 +510,7 @@ void fsm_listen__loop(struct
fsmonitor_daemon_state *state)
FSEventStreamSetDispatchQueue(data->stream, data->dq);
data->stream_scheduled = 1;
+ sleep(1);
if (!FSEventStreamStart(data->stream)) {
error(_("Failed to start the FSEventStream"));
goto force_error_stop_without_loop;
Koji Nakamaru