Thread (63 messages) 63 messages, 9 authors, 2024-07-06

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

From: David Woodhouse <dwmw2@infradead.org>
Date: 2024-03-08 12:33:35
Also in: linux-arm-kernel, linux-rtc, lkml, virtualization

On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
On 07.03.24 15:02, David Woodhouse wrote:
quoted
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
quoted
RFC v3 updates
--------------

This series implements a driver for a virtio-rtc device conforming to spec
RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
the PTP clock driver already present before.

This patch series depends on the patch series "treewide: Use clocksource id
for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
series on top of mainline.

Overview
--------

This patch series adds the virtio_rtc module, and related bugfixes. The
virtio_rtc module implements a driver compatible with the proposed Virtio
RTC device specification [1]. The Virtio RTC (Real Time Clock) device
provides information about current time. The device can provide different
clocks, e.g. for the UTC or TAI time standards, or for physical time
elapsed since some past epoch. 
Hm, should we allow UTC? If you tell me the time in UTC, then
(sometimes) I still don't actually know what the time is, because some
UTC seconds occur twice. UTC only makes sense if you provide the TAI
offset, surely? Should the virtio_rtc specification make it mandatory
to provide such?

Otherwise you're just designing it to allow crappy hypervisors to
expose incomplete information.
Hi David,

(adding virtio-comment@lists.oasis-open.org for spec discussion),

thank you for your insightful comments. I think I take a broadly similar
view. The reason why the current spec and driver is like this is that I
took a pragmatic approach at first and only included features which work
out-of-the-box for the current Linux ecosystem.

The current virtio_rtc features work similar to ptp_kvm, and therefore can
work out-of-the-box with time sync daemons such as chrony.

As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
well, I am afraid that

- in some (embedded) scenarios, the TAI clock may not be available

- crappy hypervisors will pass off the UTC clock as the TAI clock.

For the same reasons, I am also not sure about adding a *mandatory* TAI
offset to each readout. I don't know user-space software which would
leverage this already (at least not through the PTP clock interface). And
why would such software not go straight for the TAI clock instead?

How about adding a requirement to the spec that the virtio-rtc device
SHOULD expose the TAI clock whenever it is available - would this address
your concerns?
I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.



quoted
quoted
PTP clock interface
-------------------

virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
If both the Virtio RTC device and this driver have special support for the
current clocksource, time synchronization programs can use
cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
with single-digit ns precision is possible with a quiescent reference clock
(from the Virtio RTC device). This works even when the Virtio device
response is slow compared to ptp_kvm hypercalls.
Is PTP the right mechanism for this? As I understand it, PTP is a way
to precisely synchronize one clock with another. But in the case of
virt guests synchronizing against the host, it isn't really *another*
clock. It really is the *same* underlying clock. As the host clock
varies with temperature, for example, so does the guest clock. The only
difference is an offset and (on x86 perhaps) a mathematical scaling of
the frequency.

I was looking at this another way, when I came across this virtio-rtc
work.

My idea was just for the hypervisor to expose its own timekeeping
information — the counter/TSC value and TAI time at a given moment,
frequency of the counter, and the precision of both that frequency
(±PPM) and the TAI timestamp (±µs).

By putting that in a host/guest shared data structure with a seqcount
for lockless updates, we can update it as time synchronization on the
host is refined, and we can even cleanly handle live migration where
the guest ends up on a completely different host. It allows for use
cases which *really* care (e.g. timestamping financial transactions) to
ensure that there is never even a moment of getting *wrong* timestamps
if they haven't yet resynced after a migration.
I considered a similar approach as well, but integrating that with the
kernel timekeeping seemed too much effort for the first step. However,
reading the clock from user space would be much simpler.
Right. In fact my *first* use case was userspace, specifically in the
context of https://github.com/aws/clock-bound — but anything we design
for this absolutely has to be usable for kernel timekeeping too.

It's also critical to solve the Live Migration problem.

But is it so hard to integrate into the kernel timekeeping? My plan
would have given us effectively an infinite number of cross-reads of
the realtime clock vs. TSC. You don't have to actually read from a
virtio device; you just read the TSC and do the maths, using the values
in the shared memory region. Couldn't that be used to present a PTP
device to the guest kernel just the same as you do at the moment?

You could probably even simulate PPS with it. Typically with PPS we
have to catch the hardware interrupt and then read the TSC as soon as
possible thereafter. With this, you'd be able to *calculate* the TSC
value at the start of the next second, and wouldn't have to suffer the
real hardware latency :)
quoted
Now I'm trying to work out if I should attempt to reconcile with your
existing virtio-rtc work, or just decide that virtio-rtc isn't trying
to solve the actual problem that we have, and go ahead with something
different... ?
We are certainly interested into the discussed, say, "virtual timekeeper"
mechanism, which would also solve a lot of problems for us (especially if
it would be integrated with kernel timekeeping). Even without Linux kernel
timekeeping, the virtual timekeeper would be useful to us for guests with
simpler timekeeping, and potentially for user space applications.

Our current intent is to at first try to upstream the current (RFC spec v3)
feature set. I think the virtual timekeeper would be suitable as an
optional feature of virtio_rtc (with Virtio, this could easily be added
after initial upstreaming). It is also possible to have a virtio-rtc device
only implement the virtual timekeeper, but not the other clock reading
methods, if these are of no interest.
Yeah, that might make sense. I was thinking of a simple ACPI/DT device
exposing a page of memory and *maybe* an interrupt for when an update
happens. (With the caveat that the interrupt would always occur too
late by definition, so it's no substitute for using the seqlock
correctly in applications that *really* care and are going to get fined
millions of dollars for mis-timestamping their transactions.)

But using the virtio-rtc device as the vehicle for that shared memory
page is reasonable too. It's not even mutually exclusive; we could
expose the *same* data structure in memory via whatever mechanisms we
wanted.

One other thing to note is I think we're being very naïve about the TSC
on x86 hosts. Theoretically, the TSC for every vCPU might run at a
different frequency, and even if they run at the same frequency they
might be offset from each other. I'm happy to be naïve but I think we
should be *explicitly* so, and just say for example that it's defined
against vCPU0 so if other vCPUs are different then all bets are off. 

We *can* cope with TSC frequencies changing. Fundamentally, that's the
whole *point*; NTP calibrates itself as the underlying frequency does
change due to temperature changes, etc. — so a deliberate frequency
scaling, or even a live migration, are just a slightly special case of
the same thing.

One thing I have added to the memory region is a migration counter. In
the ideal case, guests will be happy just to use the hypervisor's
synchronization. But in some cases the guests may want to do NTP (or
PPS, PTP or something else) for themselves, to have more precise
timekeeping than the host. Even if the host is advertising itself as
stratum 16, the guest still needs to know of *migration*, because it
has to consider itself unsynchronized when that happens.

Attachments

  • smime.p7s [application/pkcs7-signature] 5965 bytes
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help