Thread (11 messages) 11 messages, 5 authors, 2023-03-24

Re: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files

From: Andres Freund <hidden>
Date: 2023-03-15 19:15:15
Also in: linux-mm, lkml

Hi,

On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote:
On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
quoted
A while ago I asked about the security implications - could cachestat()
be used to figure out what parts of a file another user is reading.
This also applies to mincore(), but cachestat() newly permits user A to
work out which parts of a file user B has *written* to.
The caller of cachestat() must have the file open for reading. If they
can read the contents that B has written, is the fact that they can
see dirty state really a concern?
Random idea: Only fill ->dirty/writeback if the fd is open for writing.

quoted
Secondly, I'm not seeing description of any use cases.  OK, it's faster
and better than mincore(), but who cares?  In other words, what
end-user value compels us to add this feature to Linux?
Years ago there was a thread about adding dirty bits to mincore(), I
don't know if you remember this:

https://lkml.org/lkml/2013/2/10/162

In that thread, Rusty described a usecase of maintaining a journaling
file alongside a main file. The idea for testing the dirty state isn't
to call sync but to see whether the journal needs to be updated.

The efficiency of mincore() was touched on too. Andres Freund (CC'd,
hopefully I got the email address right) mentioned that Postgres has a
usecase for deciding whether to do an index scan or query tables
directly, based on whether the index is cached. Postgres works with
files rather than memory regions, and Andres mentioned that the index
could be quite large.
This is still relevant, FWIW. And not just for deciding on the optimal query
plan, but also for reporting purposes. We can show the user what part of the
query has done how much IO, but that can end up being quite confusing because
we're not aware of how much IO was fullfilled by the page cache.

Most recently, the database team at Meta reached out to us and asked
about the ability to query dirty state again. The motivation for this
was twofold. One was simply visibility into the writeback algorithm,
i.e. trying to figure out what it's doing when investigating
performance problems.

The second usecase they brought up was to advise writeback from
userspace to manage the tradeoff between integrity and IO utilization:
if IO capacity is available, sync more frequently; if not, let the
work batch up. Blindly syncing through the file in chunks doesn't work
because you don't know in advance how much IO they'll end up doing (or
how much they've done, afterwards.) So it's difficult to build an
algorithm that will reasonably pace through sparsely dirtied regions
without the risk of overwhelming the IO device on dense ones. And it's
not straight-forward to do this from the kernel, since it doesn't know
the IO headroom the application needs for reading (which is dynamic).
We ended up building something very roughly like that in userspace - each
backend tracks the last N writes, and once the numbers reaches a certain
limit, we sort and collapse the outstanding ranges and issue
sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have
different limits. Without that latency in write heavy workloads is ... not
good (to this day, but to a lesser degree than 5-10 years ago).

Another query we get almost monthly is service owners trying to
understand where their memory is going and what's causing unexpected
pressure on a host. They see the cache in vmstat, but between a
complex application, shared libraries or a runtime (jvm, hhvm etc.)
and a myriad of host management agents, there is so much going on on
the machine that it's hard to find out who is touching which
files. When it comes to disk usage, the kernel provides the ability to
quickly stat entire filesystem subtrees and drill down with tools like
du. It sure would be useful to have the same for memory usage.
+1

Greetings,

Andres Freund
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help