Re: [PATCH v2] Add /proc/pid_gen

From: Daniel Colascione <hidden>
Date: 2018-11-21 23:21:56
Also in: linux-doc, lkml

On Wed, Nov 21, 2018 at 2:50 PM Andrew Morton [off-list ref] wrote:

On Wed, 21 Nov 2018 14:40:28 -0800 Daniel Colascione [off-list ref] wrote:

quoted

On Wed, Nov 21, 2018 at 2:12 PM Andrew Morton [off-list ref] wrote:

quoted

On Wed, 21 Nov 2018 12:54:20 -0800 Daniel Colascione [off-list ref] wrote:

quoted

Trace analysis code needs a coherent picture of the set of processes
and threads running on a system. While it's possible to enumerate all
tasks via /proc, this enumeration is not atomic. If PID numbering
rolls over during snapshot collection, the resulting snapshot of the
process and thread state of the system may be incoherent, confusing
trace analysis tools. The fundamental problem is that if a PID is
reused during a userspace scan of /proc, it's impossible to tell, in
post-processing, whether a fact that the userspace /proc scanner
reports regarding a given PID refers to the old or new task named by
that PID, as the scan of that PID may or may not have occurred before
the PID reuse, and there's no way to "stamp" a fact read from the
kernel with a trace timestamp.

This change adds a per-pid-namespace 64-bit generation number,
incremented on PID rollover, and exposes it via a new proc file
/proc/pid_gen. By examining this file before and after /proc
enumeration, user code can detect the potential reuse of a PID and
restart the task enumeration process, repeating until it gets a
coherent snapshot.

PID rollover ought to be rare, so in practice, scan repetitions will
be rare.

In general, tracing is a rather specialized thing.  Why is this very
occasional confusion a sufficiently serious problem to warrant addition
of this code?

I wouldn't call tracing a specialized thing: it's important enough to
justify its own summit and a whole ecosystem of trace collection and
analysis tools. We use it in every day in Android. It's tremendously
helpful for understanding system behavior, especially in cases where
multiple components interact in ways that we can't readily predict or
replicate. Reliability and precision in this area are essential:
retrospective analysis of difficult-to-reproduce problems involves
puzzling over trace files and testing hypothesis, and when the trace
system itself is occasionally unreliable, the set of hypothesis to
consider grows. I've tried to keep the amount of kernel infrastructure
needed to support this precision and reliability to a minimum, pushing
most of the complexity to userspace. But we do need, from the kernel,
reliable process disambiguation.

Besides: things like checkpoint and restart are also non-core
features, but the kernel has plenty of infrastructure to support them.
We're talking about a very lightweight feature in this thread.

I'm still not understanding the seriousness of the problem.  Presumably
you've hit problems in real-life which were serious and frequent enough
to justify getting down and writing the code.  Please share some sob stories
with us!

The problem here is the possibility of confusion, even if it's rare.
Does the naive approach of just walking /proc and ignoring the
possibility of PID reuse races work most of the time? Sure. But "most
of the time" isn't good enough. It's not that there are tons of sob
stories: it's that without completely robust reporting, we can't rule
out of the possibility that weirdness we observe in a given trace is
actually just an artifact from a kinda-sort-working best-effort trace
collection system instead of a real anomaly in behavior. Tracing,
essentially, gives us deltas for system state, and without an accurate
baseline, collected via some kind of scan on trace startup, it's
impossible to use these deltas to robustly reconstruct total system
state at a given time. And this matters, because errors in
reconstruction (e.g., assigning a thread to the wrong process because
the IDs happen to be reused) can affect processing of the whole trace.
If it's 3am and I'm analyzing the lone trace from a dogfooder
demonstrating a particularly nasty problem, I don't want to find out
that the trace I'm analyzing ended up being useless because the
kernel's trace system is merely best effort. It's very cheap to be
100% reliable here, so let's be reliable and rule out sources of
error.

quoted

Which userspace tools will be using pid_gen?  Are the developers of
those tools signed up to use pid_gen?

I'll be changing Android tracing tools to capture process snapshots
using pid_gen, using the algorithm in the commit message.

Which other tools could use this and what was the feedback from their
developers?

I'm going to have Android's systrace and Perfetto use this approach.
Exactly how many tools signed up to use this feature do you need?

Those people are the intended audience and the
best-positioned reviewers so let's hear from them?

I'm writing plenty of trace analysis tools myself, so I'm part of this
intended audience. Other tracing tool authors have told me about
out-of-tree hacks for process atomic snapshots via ftrace events. This
approach avoids the necessity of these more-invasive hacks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help