Thread (51 messages) 51 messages, 10 authors, 15d ago

Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number

From: "Theodore Tso" <tytso@mit.edu>
Date: 2026-05-27 01:22:17
Also in: linux-f2fs-devel, linux-fsdevel, linux-mm, lkml

On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote:
quoted
It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?
I can't tell the exact size tho, roughly it's between 1GB and
4GB. And, per lots of test results with various tunings, it turned
out memory allocation speed was the culprit. If we use 4KB page, we
couldn't get the full bandwidth unless we set the biggest core
running the highest frequency.
OK, if we assume that the model file that you want to load is is 2GB
then the number of 4k pages that you need is a bit over half a million
(524288).  So if it take 1 second with large folios (2 GB/s as you
stated above), and half-second without (4 GB/s), then you're basically
saying that it was costing you half-second to allocate 524288
singleton pages.  And the whole point of this exercise is to save that
half second?

And I assume that these timing was using a performance cores, and part
of the goal here is to be able to use an efficiency core instead.

Did I get that right?
quoted
But the problem with using small folios is that if you want to
actually *use* the memory, unless you want to segment out the memory
so it can't be used for anything other than the AI models (e.g., by
using somthing like hugetlbfs) it's just going to break up the memory
into smaller folios.  So that's not actually going to *help* in actual
real life use cases.  It might help for your artificial benchmarks /
experiments, but in the real life case where Android applications are
running and fragmenting all of the device memory, the large folios
won't be available *anyway*.
Agreed it's hard to get this done perfectly tho, as the best effort on this
particular AI model case, I focused on two timings when loading the models:
1) right after device boot, 2) dynamic loading when required. To secure high
order pages, for 1), I disabled the large folio consumed by EROFS, while for
2), I tried to call compact_memory before loading the model. Both of cases,
I could observe we could get fair amount of large folios. Yes, not 100% tho.
If (1) is a common case in real life, the thing to do would be grab
2GB of large folios early in the startup sequence, and then letting
erofs do its thing --- and then at the end of the startup, right before you
load the model, you can release the 2GB worth of large folios.

(That being said, I'm guessing #1 is actually not that interesting,
since as a percentage of the time that it takes for an Android device
to startup, is adding an extra half-second *really* going to be
noticeable by the user?)

But for case #2, that's the much more challenging case.  If you don't
call compact_memory() you're going to burn half a second to allocate
the 4k pages, since the large folios won't be available.  But if you
*do* call compact_memory() in a production ROM, depending fragmented the
memory is and how much memory have, calling compat_memory() could take
**minutes**.  So what's the point?

The bottom line is if it's right after device boot, there are simple
techniques that don't require hacking up the f2fs.  But in the
demand-loaded case, calling compact_memory() is the last thing you'll
want to do.  You're better either asking the mm to allocate the 4k
pages, or do whatever compaction it can do to just free up 2GB worth
of folios.  (Calling compact_memory() is overkill, and only makes
sense in the context of benchmark / proof of concept demo.)

Either way, trying to get file systems to avoid using large folios in
the hopes that this will speed up large AI model loading.... doesn't
seem to make sense.

If the problem is fundamentally about making 2GB worth of large folios
available in a way that takes significantly less time that just
allocating the model using half-million 4k pages, that's the question
that we should be asking Matthew and the mm folks.  Which is why it
was too bad we didn't raise this issue at LSF/MM earlier this month.
Indeed, I was off from LSF/MM for years due to various product issues, not
related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
if I can get the budget from company.
Next time, as a suggestion, feel free to raise the issue when the
LSF/MM CFP goes out, even if you don't think it's likely you will get
an invite.  Indeed, with a sufficiently interesting topic, that's the
way to *get* an invitation.  It will require breaking down the
technical requires as you and I have done for the last few messages on
this thread.

Even if you can't attend LSF/MM due to time or budget reasons, there
are a number of your colleagues who are attending, who could raise the
question on your behalf.  I've been known to do that once or twice on
behalf of other Google teams.  But it does require that you approach
the usual LSF/MM suspects a good 2-3 months before the conference so
we can help you craft the an appropriate response to the CFP.

Cheers,

					- Ted
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help