Re: [PATCH v7] posix-timers: add clock_compare system call

From: Sagi Maimon <hidden>
Date: 2024-03-20 14:42:43
Also in: linux-arch, lkml, netdev

On Thu, Mar 14, 2024 at 8:08 PM Thomas Gleixner [off-list ref] wrote:

Sagi!

On Thu, Mar 14 2024 at 14:19, Sagi Maimon wrote:

quoted

On Thu, Mar 14, 2024 at 1:12 PM Thomas Gleixner [off-list ref] wrote:

quoted

On Thu, Mar 14 2024 at 11:05, Sagi Maimon wrote:

quoted

Some user space applications need to read a couple of different clocks.
Each read requires moving from user space to kernel space.
Reading each clock separately (syscall) introduces extra
unpredictable/unmeasurable delay. Minimizing this delay contributes to user
space actions on these clocks (e.g. synchronization etc).

I asked for a proper description of the actual problem several times now
and you still provide some handwaving blurb. Feel free to ignore me, but
then please don't be surprised if I ignore you too.

Nobody is ignoring your notes, and I address any notes given by any
maintainer in the most serious way.
As far as explaining the actual problem this is the best that I can,
but let me try to explain better:
We did many tests with different CPU loading and compared sampling the
same clock twice,
once in user space and once by using the system call.
We have noticed an improvement up to hundreds of nanoseconds while
using the system call.
Those results improved our ability to sync different PHCs

So let me express how I understand the problem - as far as I decoded it
from your writeups:

  Synchronizing two PHCs requires to read timestamps from both and
  correlate them. Currently this requires several seperate system calls.
  This is subject to unmeasurable delays due to system call overhead,
  preemption and interrupts which makes the correlation imprecise.

  Therefore you want a system call, which samples two clocks at once, to
  make the correlation more precise.

Right? For the further comments I assume this is what you are trying to
say and to solve

You are right.

So far so good, except that I do not agree with that reasoning at all:

   1. The delays are measurable and as precise as the cross time stamp
      mechanism (hardware or software based) allows.

Most of the PHCs do not support crosstime stamps.

   2. The system call overhead is completely irrelevant.

You are right in case of long preemption, but in other cases it is relevant.

   3. The time deltas between the sample points are irrelevant within a
      reasonable upper bound to the time delta between the two outer
      sample points.

See below

   4. The alledged higher precision is based on a guesstimate and not on
      correctness. Just because it behaves slightly better in testing
      does not make it any more correct.

See below

   5. The problem can be solved with maximal possible accuracy by using
      the existing PTP IOCTLs.

See below.

quoted

Also why does reading two random clocks make any sense at all? Your code
allows to read CLOCK_TAI and CLOCK_THREAD_CPUTIME_ID. What for?

Initially we needed to sync some different PHCs for our user space
application, that is why we came with this idea.
The first idea was an IOCTL that returned samples of several PHCs for
the need of synchronization.
Richard Cochran suggested a system call instead, which will add the
ability to get various system clocks, while this
implementation is more complex then IOCTL, I think that he was right,
for future usage.

Which future usage? We are not introducing swiss army knife interfaces
just because there might be an illusional use case somewhere in the
unspecified future.

Adding a system call needs a proper design and justification. Handwaving
future usage is not enough.

Documentation/process/adding-syscalls.rst is very clear about what is
required for a new system call.

quoted

This needs to be split up into:

     1) Infrastructure in posix-timers.c
     2) Wire up the syscall in x86
     3) Infrastructure in posix-clock.c
     4) Usage in ptp_clock.c

and not as a big lump of everything.

I know, but I think the benefit worth it

It's worth it because it makes review easier. It's well documented in
the process documentation that patches should do one thing and not a
whole lump of changes.

No problem it will be split into two different patches.

quoted

+             if (!error) {
+                     if (clock_b == CLOCK_MONOTONIC_RAW) {
+                             ts_b = ktime_to_timespec64(xtstamp_a1.sys_monoraw);
+                             ts_a1 = ktime_to_timespec64(xtstamp_a1.device);
+                             goto out;
+                     } else if (clock_b == CLOCK_REALTIME) {
+                             ts_b = ktime_to_timespec64(xtstamp_a1.sys_realtime);
+                             ts_a1 = ktime_to_timespec64(xtstamp_a1.device);
+                             goto out;
+                     } else {
+                             crosstime_support_a = true;

Right. If clock_b is anything else than CLOCK_MONOTONIC_RAW or
CLOCK_REALTIME then this is true.

quoted

+                     }
+             }

So in case of an error, this just keeps going with an uninitialized
xtstamp_a1 and if the clock_b part succeeds it continues and operates on
garbage.

On error  xtstamp_a1 will be taken again using clock_get_crosstimespec
so no one will be operating on garbage.

It will not, because crosstime_support_a == false. It will therefore
fall back to kc_a->clock_get_timespec(), no?

Sorry, I misread the code vs. using the uninitialized value, but this is
just unneccesary hard to follow.

quoted

+     if (crosstime_support_a)
+             error = kc_a->clock_get_crosstimespec(clock_a, &xtstamp_a2);
+     else
+             error = kc_a->clock_get_timespec(clock_a, &ts_a2);
+
+     if (error)
+             return error;

The logic and the code flow here are unreadable garbage and there are
zero comments what this is supposed to do.

I will add comments.
please no need to use negative words like "garbage" (not the first time),
please keep it professional and civilized.

Let me rephrase:

The code and the logic is incomprehensible unless I waste an unjustified
amount of time to decode it. Sorry, I don't have that time.

quoted

+     if (crosstime_support_a) {
+             ktime_a = ktime_sub(xtstamp_a2.device, xtstamp_a1.device);
+             ts_offs_err = ktime_divns(ktime_a, 2);
+             ktime_a = ktime_add_ns(xtstamp_a1.device, (u64)ts_offs_err);
+             ts_a1 = ktime_to_timespec64(ktime_a);

This is just wrong.

     read(a1);
     read(b);
     read(a2);

You _CANNOT_ assume that (a1 + ((a2 - a1) / 2) is anywhere close to the
point in time where 'b' is read. This code is preemtible and
interruptible. I explained this to you before.

Your explanation in the comment above the function is just wishful
thinking.

you explained it before, but still it is better then two consecutive
user space calls which are also preemptible
and the userspace to kernel context switch time is added.

It might be marginally better, but it is still just _pretending_ that it
does the right thing, is correct and better than the existing IOCTLs.

If your user space implementation has the same algorithm, then I'm
absolutely not surprised that the results are not useful. Why?

You simply cannot use the midpoint of the outer samples if you want to
have precise results if there is no guarantee that b was sampled exactly
in the midpoint of a1 and a2. A hardware implementation might give that
guarantee, but the kernel cannot.

But why using the midpoint in the first place?

There is absolutely no reason to do so because the sampling points a1, b
and a2 can be precisely determined with the precision of the cross time
stamp mechanism, which is best with a hardware based cross time stamp
obviously.

The whole point of ptp::info::getcrosststamp() is to get properly
correlated clock samples of

      1) PHC clock
      2) CLOCK_MONOTONIC_RAW
      3) CLOCK_REALTIME

So if you take 3 samples:

   get_cross_timestamp(a1);
   get_cross_timestamp(b);
   get_cross_timestamp(a2);

then each of them provides:

    - device time
    - correlated CLOCK_MONOTONIC_RAW
    - correlated CLOCK_REALTIME

Ergo the obvious thing to do is:

    d1 = b.sys_monoraw - a1.sys_monoraw;
    d2 = a2.sys_monoraw - a1.sys_monoraw;

    tsa = a1.device + ((a2.device - a1.device) * d1) / d2;

Which is maximaly precise under the assumption that in the time between
the sample points a1 and a2 neither the system clock nor the PCH clocks
are changing their frequency significantly. That is a valid assumption
when you put a reasonable upper bound on d2.

You are right.
In fact, we are running this calculation on a user space application.
We use the new system call to get pairs of mono and PHC and then run
that calculation in user space.
That is why the system call returns pairs of clock samples and not the
diff between them.

Even when the device does not implement getcrosststamp() then loop based
sampling like it is implemented in the PTP_SYS_OFFSET[_EXTENDED] IOTCLs
is providing reasonably accurate results to the extent possible.

Your algorithm is imprecise by definition and you can apply as much
testing as you want, it won't become magically correct. It's still a
guesstimate, i.e. an estimate made without using adequate or complete
information.

Now why a new syscall?

This can be done from user space with existing interfaces and the very
same precision today:

     ioctl(fda, PTP_SYS_OFFSET*, &a1);
     ioctl(fdb, PTP_SYS_OFFSET*, &b);
     ioctl(fda, PTP_SYS_OFFSET*, &a2);

     u64 d1 = timespec_delta_ns(b.sys_monoraw, a1.sys_monoraw);
     u64 d2 = timespec_delta_ns(a2.sys_monoraw, a1.sys_monoraw);
     u64 td = (timespec_delta_ns(a2.device, a1.device) * d1) / d2

     tsa = timespec_add_ns(a1.device, td);
     tsb = b.device;

with the extra benefit of:

     1) The correct CLOCK_REALTIME at that sample point,
        i.e. b.sys_realtime

     2) The correct CLOCK_MONOTONIC_RAW at that sample point,
        i.e. b.sys_monoraw

If PTP_SYS_OFFSET IOCTL returns sys_monoraw, then you are right, but
unfortunately the only IOCTL that returns sys_monoraw is
PTP_SYS_OFFSET_PRECISE (getcrosststamp)
And most of the drivers does not support it.

It works with PTP_SYS_OFFSET_PRECISE and PTP_SYS_OFFSET[_EXTENDED], with
the obvious limitations of PTP_SYS_OFFSET[_EXTENDED], which are still
vastly superior to your proposed (a2 - a1) / 2 guestimate which is just
reading the PCH clocks with clock_get_timespec().

It only works with PTP_SYS_OFFSET_PRECISE (which most of the NIC
drivers does not support and I take it under consideration in my
system call),
PTP_SYS_OFFSET_EXTENDED ioctl returns system time before, PHC, system
time after , and no monotic raw.

It is completely independent of the load, the syscall overhead and the
actual time delta between the sample points when you apply a reasonable
upper bound for d2, i.e. the time delta between the sample points a1 and
a2 to eliminate the issue that system clock and/or the PCH clocks change
their frequency significantly during that time. You'd need to do that in
the kernel too.

The actual frequency difference between PCH A and system clock is
completely irrelevant when the frequencies of both are stable accross
the sample period.

You might still argue that the time delta between the sample points a1
and a2 matters and is slightly shorter in the kernel, but that is a
non-argument because:

  1) The kernel implementation does not guarantee atomicity of the
     consecutive samples either. The time delta is just statistically
     better, which is obviously useless when you want to make
     guarantees.

  2) It does not matter when the time delta is slightly larger because
     you need a large frequency change of the involved clocks in the
     sample interval between the sample points a1 and a2 to make an
     actual difference in the resulting accuracy.

     A typical temperature drift of a high quality cyrstal is less than
     1ppm per degree Celsius and even if you assume that the overall
     system drift is 10ppm per degree Celsius then still the actual
     error for a bound time delta between the sample points a1 and a2 is
     just somewhere in the irrelevant noise, unless you manage to blow
     torch or ice spray your crystals during the sample interval.

     If your clocks are not stable enough then nothing can cure it and
     you cannot do high precision timekeeping with them.

You are right in case of long preemption (which still the system call
is better), but in other cases it is relevant.

So what is your new syscall solving that can't be done with the existing
IOCTLs other than providing worse precision results based on
guesstimates and some handwavy future use for random clock ids?

Nothing as far as I can tell, but I might be missing something important
here.

Few points to consider:
1) Most of the PHCs do not support cross time stamping.
2) Users can implement your suggested code in user space while using
the new system call to get pairs of mono and PHC
     . This is what we did already in user space.
3) User with less tight requirement will benefit high accuracy with
the new system call

Thanks,

        tglx
---
arch/x86/kernel/tsc.c:119: "Math is hard, let's go shopping." - John Stultz

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help