Thread (47 messages) 47 messages, 9 authors, 1d ago

Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number

From: "Theodore Tso" <tytso@mit.edu>
Date: 2026-05-26 13:43:13
Also in: linux-f2fs-devel, linux-fsdevel, linux-mm, lkml

On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
Background
----------
The primary use case is accelerating AI model loading, which demands
exceptionally high sequential read speeds. In our benchmarks on embedded
systems:
 - Using high-order page allocations allows the system to saturate the
   Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
   medium-to-low CPU frequencies.
 - In contrast, standard small folios cap performance at 2 GB/s.
So you're interested in optimizing the I/O speeds.  And apparenty, on
your hardware, the UFS controller has limits on scatter-gather entries
--- UFS seems to call this Physical Region Description (PRD) table
entries.  Per Gemini:

    1. PRD Segment & Length Limits
	
	Maximum PRD Entries: Hardware limits typically cap the number
	    of PRD entries (or segments) to 255 or 256 per transfer
	    request.
	
	Maximum Transfer Length: Each individual PRD entry typically
	    allows a maximum transfer size of (65,535 bytes) per segment.

    2. Host Controller Hardware Limits (UFSHCI)
    
	Transfer Queue Depth: A UFS controller supports a predefined
	    number of outstanding task request entries. This is often
	    hard-capped at 32 concurrent transfer requests (slots) by the
	    doorbell register array.
	
	Descriptor Pre-fetch: Some UFS host controllers are
	   pre-configured to pre-fetch multiple PRD entries sequentially
	   before requiring main memory reads.

Is this an accurate description of the limits that you are trying to
work with?  How much data are you trying to read?  Looking at Gemma 4
models, E2B is about 10GB or 3GB for the 4-bit quantized version.  E4B
is 15GB, or 5GB for the 4-bit quantized version.  Is that about right?

It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?
Problem Statement
-----------------
High-order pages become heavily fragmented and scarce shortly after
device boot.  We cannot afford to deplete these limited resources on
default filesystem operations using large folios. Instead, we need a
mechanism to strictly prioritize and reserve high-order allocations
for specific, critical payloads—specifically, large AI model files.
There's a fundamental assumption here, which is that the only use of
high order pages is the page cache.  This doesn't take into account
anonymous pages used by programs that isn't backed by files.  Nor does
it take into account kernel memory allocations.

But that being said, you seem to be assuming that you can reduce the
pressure on high order pages by only using large folios for these AI
model files.

But the problem with using small folios is that if you want to
actually *use* the memory, unless you want to segment out the memory
so it can't be used for anything other than the AI models (e.g., by
using somthing like hugetlbfs) it's just going to break up the memory
into smaller folios.  So that's not actually going to *help* in actual
real life use cases.  It might help for your artificial benchmarks /
experiments, but in the real life case where Android applications are
running and fragmenting all of the device memory, the large folios
won't be available *anyway*.
Q: Why is deregistering the inode number linked to inode deletion?
A: We need the high-order allocation hint to persist even if the inode is
 temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
 list of hinted inode numbers. When a file is permanently deleted, its hint
 becomes obsolete, requiring us to deregister it from the list to prevent memory
 leaks or identifier reuse conflicts.
Assuming that the high-order allocation hint is a good thing, why not
just make it persistent?  e.g., just a *real* extended attribute
(which is more wateful of space), or grab a flag in the on-disk f2fs
inode?  Then you don't need to have an in-memory list of hinted
inodes; instead, you can just have the Android package manager set
that flag indicating that you want that special treatment.  This is
all assuming that we need an explicit hint, though....
Massive AI model loading is a long-term architectural
paradigm. Providing a targeted VFS/filesystem hint to optimize read
bandwidth for specific large datasets is a highly practical,
repeatable pattern that addresses a systemic bottleneck in embedded
AI deployments.
It's really too bad you didn't propose this as a LSF/MM topic, and
presented this at a session at Zagreb two weeks ago.  That would have
been a much more upstream-friendly way of collaborating, and it might
have allowed the mm experts to give you some more dynamic, real-time
feedback.

Cheers,

					- Ted
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help