[RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot

From: David Woodhouse <dwmw2@infradead.org>
Date: 2026-06-19 00:33:39
Also in: lkml
Subsystem: networking drivers, ptp hardware clock support, ptp vmclock support, the rest, timekeeping, clocksource core, ntp, alarmtimer · Maintainers: Andrew Lunn, "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Richard Cochran, David Woodhouse, Linus Torvalds, John Stultz, Thomas Gleixner

As far as I can tell, the only (remaining?) reason that CONFIG_NTP_PPS
doesn't work with NO_HZ_COMMON is because the real time snapshots that
pps_get_ts() uses are not sufficiently accurate, so the phase
correction wouldn't work very well.

The inaccuracy happens because of the way the kernel's timekeeping
sawtooths around the 'ideal' time line, by choosing between adjacent
values of 'mult' and 'mult+1' from one tick to the next. But with a
tickless kernel, of course the correction *doesn't* happen each tick,
and the time reported as CLOCK_REALTIME diverges further from the
correct time.

The thing is... since 
https://lore.kernel.org/all/20260614144032.534706-1-dwmw2@infradead.org/ (local)
we know *precisely* how far from the truth our CLOCK_REALTIME value is,
and we can just put that information into the system_time_snapshot for
the caller to use as it sees fit. If the caller doesn't care about
monotonicity, it can just add the known 'error' to the snapshot.systime
value, and have a completely accurate snapshot even under nohz.

If I run my vmclock reference test on a tickless kernel, I see the
kernel's timekeeping vary by ±15ns around the ideal. The correction
below clamps it back to the ±1ns that I see with a periodic tick.

I think that's enough to enable CONFIG_NTP_PPS too, right? I'll have to
revive the hack at
https://lore.kernel.org/all/87cb97d5a26d0f4909d2ba2545c4b43281109470.camel@infradead.org/ (local)
to test it...

Am I missing some other reason for the dependency? Aside from the phase
error, it *does* seem to work. The dependency on !NO_HZ goes all the
way back to the original introduction of hardpps support in commit
025b40abe7, which doesn't explain *why* it didn't work on tickless
kernels.

From: David Woodhouse <redacted>
Date: Fri, 19 Jun 2026 00:00:29 +0100
Subject: [PATCH] timekeeping: Extrapolate ntp_error into snapshots

ktime_get_snapshot_id() is a lockless reader: it interpolates the
clocksource forward from cycle_last at a fixed mult but never runs the
timekeeping accumulation, so tk->ntp_error is only current as of the
last update. Between updates the read accrues the per-cycle deviation
from the NTP-ideal rate; on a NO_HZ kernel that span can be many ticks,
widening the sawtooth between the snapshot's disciplined CLOCK_REALTIME
and the ideal NTP line. This is the obstacle to accurate in-kernel PPS,
which today depends on !NO_HZ_COMMON.

Carry that deviation in the snapshot as a signed nanosecond offset that
a consumer adds directly to ::systime to land on the ideal line. It sums
four terms in ns << NTP_SCALE_SHIFT before converting:

  - tk->ntp_error, the deviation as of the last update;
  - (cycle_delta * ntp_err_frac), the fractional-mult drift accrued
    since then (cycle_delta is at most a tick on a tickful kernel, but
    many ticks' worth under NO_HZ);
  - (cycle_delta * ntp_err_mult), subtracting the applied +1 mult dither
    over the same span;
  - the sub-nanosecond fraction dropped when ::systime was truncated to
    whole ns (low shift bits of the read, exact despite overflow).

Only the mono-based clocks (REALTIME/MONOTONIC/BOOTTIME) carry this; RAW
is undisciplined and AUX has its own discipline. The residual is then a
single clocksource cycle, the same bound as a tickful kernel.

NOT-FOR-UPSTREAM: also includes a temporary ptp_vmclock debug hack that
prints the offset and applies it to the returned timestamp, for
validating the field against the host vmclock reference under QEMU.

Signed-off-by: David Woodhouse <redacted>
Assisted-by: Kiro:claude-opus-4.8
---
 drivers/ptp/ptp_vmclock.c           |  2 ++
 include/linux/timekeeper_internal.h |  6 ++++
 include/linux/timekeeping.h         |  9 +++++
 kernel/time/timekeeping.c           | 56 +++++++++++++++++++++++++++--
 4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index c09ae06d7f68..37a9c8390055 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c

@@ -140,7 +140,9 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 			ptp_read_system_prets(sts);
 			if (sts->pre_sts.cs_id == st->cs_id) {
 				cycle = sts->pre_sts.cycles;
+				sts->pre_sts.systime += sts->pre_sts.ntp_error;
 				sts->post_sts = sts->pre_sts;
+				pr_info("vmclock pre error %lld\n", sts->pre_sts.ntp_error);
 			} else if (sts->pre_sts.hw_csid == st->cs_id &&
 				   sts->pre_sts.hw_cycles) {
 				cycle = sts->pre_sts.hw_cycles;

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 5dc7f8bf2740..b487e7d925fe 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h

@@ -97,6 +97,11 @@ struct tk_read_base {
  * @ntp_error_shift:		Shift conversion between clock shifted nano seconds and
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
+ * @ntp_err_frac:		Fractional part of the per-cycle NTP-ideal mult that the
+ *				integer @mult truncates, as a fraction of 2^32 in
+ *				clock-shifted nanoseconds per cycle. Used to
+ *				extrapolate @ntp_error to an arbitrary cycle count in
+ *				the lockless snapshot readers (ktime_get_snapshot_id).
  * @cs_tick_adj:		Per-second adjustment handed to NTP via ntp_clear()
  *				accounting for the difference between the nominal
  *				NTP interval and the real time taken by the

@@ -187,6 +192,7 @@ struct timekeeper {
 	s64			ntp_error;
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
+	u64			ntp_err_frac;
 	s64			cs_tick_adj;
 	u32			skip_second_overflow;
 	s64			skew_delta;

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 984a866d293b..e53be1952021 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h

@@ -283,6 +283,14 @@ static inline bool ktime_get_aux_ts64(clockid_t id, struct timespec64 *kt) { ret
  *			which @cycles was derived
  * @systime:		The system time of the selected CLOCK ID
  * @monoraw:		Monotonic raw system time
+ * @ntp_error:		Signed nanosecond offset of @systime from the ideal
+ *			NTP-disciplined time at @cycles. Extrapolated to @cycles
+ *			(so it is exact even when many cycles have elapsed since the
+ *			last timekeeping update, e.g. on a NO_HZ kernel) and includes
+ *			the sub-nanosecond fraction dropped when @systime was
+ *			truncated to whole ns. A consumer lands on the ideal line by
+ *			adding @ntp_error directly to @systime. Only meaningful for
+ *			CLOCK_REALTIME/CLOCK_MONOTONIC.
  * @cs_id:		Clocksource ID
  * @hw_csid:		Clocksource ID of the underlying hardware counter for derived
  *			clocksources which implement the read_snapshot() callback.

@@ -295,6 +303,7 @@ struct system_time_snapshot {
 	u64			hw_cycles;
 	ktime_t			systime;
 	ktime_t			monoraw;
+	s64			ntp_error;
 	enum clocksource_ids	cs_id;
 	enum clocksource_ids	hw_csid;
 	unsigned int		clock_was_set_seq;

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index a67d2f27c73e..e319eca307ee 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c

@@ -407,6 +407,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr_mono.mult = clock->mult;
 	tk->tkr_raw.mult = clock->mult;
 	tk->ntp_err_mult = 0;
+	tk->ntp_err_frac = 0;
 	tk->skip_second_overflow = 0;
 	tk->skew_delta = 0;

@@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
 
 		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
 		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
+
+		/*
+		 * For the NTP-disciplined mono-based clocks, report how far
+		 * @systime is from the ideal NTP time at @now, in signed ns,
+		 * so a caller can land on the ideal line by adding it. Four
+		 * terms, summed in ns << NTP_SCALE_SHIFT before converting:
+		 *
+		 *  - tk->ntp_error, the deviation as of the last update;
+		 *  - (cycle_delta * ntp_err_frac), the fractional-mult drift
+		 *    accrued since then (cycle_delta is at most a tick on a
+		 *    tickful kernel, but many ticks' worth under NO_HZ);
+		 *  - (cycle_delta * ntp_err_mult), subtracting the applied +1
+		 *    mult dither over the same span;
+		 *  - the sub-ns fraction @systime dropped when the read was
+		 *    truncated to whole ns (low @shift bits, exact despite the
+		 *    multiply overflowing).
+		 *
+		 * RAW is undisciplined and AUX has its own discipline, so they
+		 * carry no ntp_error.
+		 */
+		if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
+		    clock_id == CLOCK_BOOTTIME) {
+			u32 nes = tk->ntp_error_shift;
+			u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
+					  tk->tkr_mono.mask;
+			s64 err = tk->ntp_error +
+				(((s64)mul_u64_u64_shr(cycle_delta,
+						       tk->ntp_err_frac, 32) -
+				  (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
+
+			err += (s64)((cycle_delta * tk->tkr_mono.mult +
+				      tk->tkr_mono.xtime_nsec) &
+				     ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
+			systime_snapshot->ntp_error =
+				(err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
+				NTP_SCALE_SHIFT;
+		} else {
+			systime_snapshot->ntp_error = 0;
+		}
 	} while (read_seqcount_retry(&tkd->seq, seq));
 
 	systime_snapshot->cycles = now;

@@ -2432,6 +2472,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
 	u64 ntp_tl = ntp_tick_length(tk->id);
 	s64 skew = ntp_get_skew_delta(tk->id);
+	u64 dividend;
 	u32 mult;
 
 	/*

@@ -2452,8 +2493,19 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 		 * scale it back up to the full per-tick rate for the mult bias.
 		 */
 		skew *= NTP_INTERVAL_FREQ;
-		mult = div64_u64((tk->ntp_tick + skew) >> tk->ntp_error_shift,
-				 tk->cycle_interval);
+		dividend = (tk->ntp_tick + skew) >> tk->ntp_error_shift;
+		mult = div64_u64(dividend, tk->cycle_interval);
+		/*
+		 * Stash the fractional part of the per-cycle ideal mult that
+		 * the integer @mult discards, scaled by 2^32, in clock-shifted
+		 * ns per cycle. The lockless snapshot readers use it to
+		 * extrapolate @ntp_error forward over the cycles accumulated
+		 * since the last tick (which on a NO_HZ kernel may be many
+		 * ticks' worth).
+		 */
+		tk->ntp_err_frac = div64_u64((dividend - (u64)mult *
+					      tk->cycle_interval) << 32,
+					     tk->cycle_interval);
 	}
 
 	/*

-- 
2.43.0

Attachments

smime.p7s [application/pkcs7-signature] 5069 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help