Re: [PATCH 1/4] exec: inherit HWCAPs from the parent process

From: Andrei Vagin <hidden>
Date: 2026-03-28 00:21:39
Also in: linux-arm-kernel, linux-fsdevel, linux-mm, lkml

On Fri, Mar 27, 2026 at 9:06 AM Mark Rutland [off-list ref] wrote:

On Tue, Mar 24, 2026 at 03:19:49PM -0700, Andrei Vagin wrote:

quoted

Hi Mark and Will,

Thanks for the feedback. Please read the inline comments.

On Tue, Mar 24, 2026 at 3:28 AM Will Deacon [off-list ref] wrote:

quoted

On Mon, Mar 23, 2026 at 06:21:22PM +0000, Mark Rutland wrote:

quoted

On Mon, Mar 23, 2026 at 05:53:37PM +0000, Andrei Vagin wrote:

quoted

Introduces a mechanism to inherit hardware capabilities (AT_HWCAP,
AT_HWCAP2, etc.) from a parent process when they have been modified via
prctl.

To support C/R operations (snapshots, live migration) in heterogeneous
clusters, we must ensure that processes utilize CPU features available
on all potential target nodes. To solve this, we need to advertise a
common feature set across the cluster.

This patch adds a new mm flag MMF_USER_HWCAP, which is set when the
auxiliary vector is modified via prctl(PR_SET_MM, PR_SET_MM_AUXV).  When
execve() is called, if the current process has MMF_USER_HWCAP set, the
HWCAP values are extracted from the current auxiliary vector and stored
in the linux_binprm structure. These values are then used to populate
the auxiliary vector of the new process, effectively inheriting the
hardware capabilities.

The inherited HWCAPs are masked with the hardware capabilities supported
by the current kernel to ensure that we don't report more features than
actually supported. This is important to avoid unexpected behavior,
especially for processes with additional privileges.

At a high level, I don't think that's going to be sufficient:

* On an architecture with other userspace accessible feature
  identification mechanism registers (e.g. ID registers), userspace
  might read those. So you might need to hide stuff there too, and
  that's going to require architecture-specific interfaces to manage.

  It's possible that some code checks HWCAPs and others check ID
  registers, and mismatch between the two could be problematic.

* If the HWCAPs can be inherited by a more privileged task, then a
  malicious user could use this to hide security features (e.g. shadow
  stack or pointer authentication on arm64), and make it easier to
  attack that task. While not a direct attack, it would undermine those
  features.

I agree with Mark that only a privileged process have to be able to mask
certain hardware features. Currently, PR_SET_MM_AUXV is guarded by
CAP_SYS_RESOURCE, but PR_SET_MM_MAP allows changing the auxiliary vector
without specific capabilities. This is definitely the issue. To address
this, I think we can consider to introduce a new prctl command to enable
HWCAP inheritance explicitly.

quoted

Yeah, this looks like a non-starter to me on arm64. Even if it was
extended to apply the same treatment to the idregs, many of the hwcap
features can't actually be disabled by the kernel and so you still run
the risk of a task that probes for the presence of a feature using
something like a SIGILL handler or, perhaps more likely, assumes that
the presence of one hwcap implies the presence of another. And then
there are the applications that just base everything off the MIDR...

The goal of this mechanism is not to provide strict architectural
enforcement or to trap the use of hardware features; rather, it is to
provide a consistent discovery interface for applications. I chose the
HWCAP vector because it mirrors the existing behavior of running an
older kernel on newer hardware: while ID registers might report a
feature as physically present, the HWCAPs will omit it if the kernel
lacks support.

On arm64, the view of the ID registers that userspace gets *only*
exposes features that the kernel knows about, as userspace reads of
those registers are trapped+emulated by the kernel. On arm64 it's
not true to say that something appears in those but not the HWCAPs.

I understand that might be different on other architectures, and so
maybe this approach is sufficient on other architectures, but it is not
sufficient on arm64.

quoted

Applications are generally expected to treat HWCAPs as
the source of truth for which features are safe to use, even if the
underlying hardware is technically capable of more.

I'm fairly certain that there are arm64 applications (and libraries)
which check only the ID register values, and not the HWCAPs.

Architecturally, there are features which are detected via other
mechanisms (e.g. CHKFEAT), for which HWCAPs are also irrelevant. Even if
that happens to be ok today, there are almost certainly future uses that
will not be compatible with the scheme you propose.

I don't think we can say "applications must check the HWCAPs", when we
know that applications and libraries legitimately don't always do that.

quoted

Another significant advantage of using HWCAPs is that many
applications already rely on them for feature detection. This interface
allows these applications to work correctly "out-of-the-box" in a
migrated environment without requiring any userspace modifications.  I
understand that some apps may use other detection methods; however, there
it no gurantee that these applications will work correctly after
migration to another machine.

I think the existince of applications that detect features by other
(legitimate!) means implies that there's no guarantee that this feature
is useful and will remain useful going forwards.

For example, what do you plan to do if an application or library starts
doing something legitimate that causes it to become incompatible with
this scheme?

I don't want to be in a position where userspace is asked to steer clear
of legitimate mechanisms, or where architecture code suddently has to
pick up a lot of complexity to make this work.

quoted

There's also kvm, which provides a roundabout way to query some features
of the underlying hardware.

You're probably better off using/extending the idreg overrides we have
in arch/arm64/kernel/pi/idreg-override.c so that you can make your
cluster of heterogeneous machines look alike.

IIRC, idreg-override/cpuid-masking usually works for an entire machine.
We actually need to have a mechanism that will work on a per-container
basis. Workloads inside one cluster can have different
migration/snapshot requirements. Some are pinned to a specific node,
others are never migrated, while others need to be migratable across a
cluster or even between clusters. We need a mechanism that can be
tunable on a per-container/per-process basis.

I think that's theoretically possible, BUT it will require substantially
more complexity, to address the issues that Will and I have mentioned. I
don't think people are very happy to pick up that complexity.

There are many other aspects that are going to be problematic for
heterogeneous migration. Even if you hide the HWCAP for a stateful
feature (e.g. SME), it might appear in one machine's signal frames (and
be mandatory there), but might not appear in anothers, and so migration
might not work either way. Likewise, that state can appear via ptrace.

Hi Mark,

I understand all these points and they are valid. However, as I
mentioned, we are not trying to introduce a mechanism that will strictly
enforce feature sets for every container. While we would like to have
that functionality, as you and will mentioned, it would require
substantially more complexity to address, and maintainers would unlikely
to pick up that complexity. Even masking ID registers on a per-container
basis would introduce extra complexity that could make architecture
maintainers unhappy. There were a few attempts to introduce container
CPUID masking on x86_64 in the past.

In CRIU, we are not aiming to handle every possible workload. Our goal
is to target workloads where developers are ready to cooperate and
willing to make adjustments to be C/R compatible. The goal here is to
provide developers with clear instructions on what they can do to ensure
their applications are C/R compatible. When I say "workloads", I mean
this in a broad sense. A container might pack a set of tools with
different runtimes (Go, Java, libc-based). All these runtimes should
detect only allowed features.

Returning to the subject of this patchset: this series extends the role
of hwcaps. With this change, we would establish that hwcaps is the
"source of truth" for which features an application can safely use. Any
other features available on the current CPU would not be guaranteed to
remain available after migration to another machine.

After this discussion, I found that the current version missed one major
thing: there should be a signal indicating that hwcaps must be used for
feature detection. Since we will need to integrate this interface into
libc, Go, and other runtimes, they definitely should not rely just on
hwcaps by default, especially in the early stages. This can be solved
via the prctl command.  Libraries like libc would call
prctl(PR_USER_HWCAP_ENABLED). If this returns true, the runtime knows
that only the features explicitly listed in hwcaps should be used.

You are right, the controlled feature set will be limited to features
the kernel knows about. And yes, we would need to report CPU features in
hwcaps even if the kernel isn't directly involved in handling them.
Honestly, I am not certain if this is the "right" interface for that,
and I would be happy to consider other ideas. I understand that these
hwcaps will not work right out of the box, but we need a way to solve
this problem. Having a centralized API for CPU/kernel feature detection
seems like the right direction.

As for signal frame size and extended states like SVE/SME, we aware
about this problem.  However, it is partly mitigated by the fact that if
an application does not use some features, those states are not placed
in the signal frame. In the future, when we construct/reload a signal
frame, we could look at a process feature set for a process and generate
a frame according to those features...

Thanks,
Andrei

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help