Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: "Theodore Tso" <tytso@mit.edu>
Date: 2026-05-26 13:43:13
Also in:
linux-f2fs-devel, linux-fsdevel, linux-mm, lkml
On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
Background ---------- The primary use case is accelerating AI model loading, which demands exceptionally high sequential read speeds. In our benchmarks on embedded systems: - Using high-order page allocations allows the system to saturate the Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at medium-to-low CPU frequencies. - In contrast, standard small folios cap performance at 2 GB/s.
So you're interested in optimizing the I/O speeds. And apparenty, on your hardware, the UFS controller has limits on scatter-gather entries
--- UFS seems to call this Physical Region Description (PRD) tableentries. Per Gemini:
1. PRD Segment & Length Limits
Maximum PRD Entries: Hardware limits typically cap the number
of PRD entries (or segments) to 255 or 256 per transfer
request.
Maximum Transfer Length: Each individual PRD entry typically
allows a maximum transfer size of (65,535 bytes) per segment.
2. Host Controller Hardware Limits (UFSHCI)
Transfer Queue Depth: A UFS controller supports a predefined
number of outstanding task request entries. This is often
hard-capped at 32 concurrent transfer requests (slots) by the
doorbell register array.
Descriptor Pre-fetch: Some UFS host controllers are
pre-configured to pre-fetch multiple PRD entries sequentially
before requiring main memory reads.
Is this an accurate description of the limits that you are trying to
work with? How much data are you trying to read? Looking at Gemma 4
models, E2B is about 10GB or 3GB for the 4-bit quantized version. E4B
is 15GB, or 5GB for the 4-bit quantized version. Is that about right?
It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?
Problem Statement ----------------- High-order pages become heavily fragmented and scarce shortly after device boot. We cannot afford to deplete these limited resources on default filesystem operations using large folios. Instead, we need a mechanism to strictly prioritize and reserve high-order allocations for specific, critical payloads—specifically, large AI model files.
There's a fundamental assumption here, which is that the only use of high order pages is the page cache. This doesn't take into account anonymous pages used by programs that isn't backed by files. Nor does it take into account kernel memory allocations. But that being said, you seem to be assuming that you can reduce the pressure on high order pages by only using large folios for these AI model files. But the problem with using small folios is that if you want to actually *use* the memory, unless you want to segment out the memory so it can't be used for anything other than the AI models (e.g., by using somthing like hugetlbfs) it's just going to break up the memory into smaller folios. So that's not actually going to *help* in actual real life use cases. It might help for your artificial benchmarks / experiments, but in the real life case where Android applications are running and fragmenting all of the device memory, the large folios won't be available *anyway*.
Q: Why is deregistering the inode number linked to inode deletion? A: We need the high-order allocation hint to persist even if the inode is temporarily evicted from the VFS cache. To achieve this, we maintain a tracking list of hinted inode numbers. When a file is permanently deleted, its hint becomes obsolete, requiring us to deregister it from the list to prevent memory leaks or identifier reuse conflicts.
Assuming that the high-order allocation hint is a good thing, why not just make it persistent? e.g., just a *real* extended attribute (which is more wateful of space), or grab a flag in the on-disk f2fs inode? Then you don't need to have an in-memory list of hinted inodes; instead, you can just have the Android package manager set that flag indicating that you want that special treatment. This is all assuming that we need an explicit hint, though....
Massive AI model loading is a long-term architectural paradigm. Providing a targeted VFS/filesystem hint to optimize read bandwidth for specific large datasets is a highly practical, repeatable pattern that addresses a systemic bottleneck in embedded AI deployments.
It's really too bad you didn't propose this as a LSF/MM topic, and presented this at a session at Zagreb two weeks ago. That would have been a much more upstream-friendly way of collaborating, and it might have allowed the mm experts to give you some more dynamic, real-time feedback. Cheers, - Ted