Thread (42 messages) 42 messages, 7 authors, 2015-12-05

Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

From: Minchan Kim <hidden>
Date: 2015-11-05 01:48:50
Also in: linux-mm, lkml

On Wed, Nov 04, 2015 at 05:29:57PM -0800, Andy Lutomirski wrote:
On Wed, Nov 4, 2015 at 4:56 PM, Minchan Kim [off-list ref] wrote:
quoted
On Wed, Nov 04, 2015 at 04:42:37PM -0800, Andy Lutomirski wrote:
quoted
On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim [off-list ref] wrote:
quoted
On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
quoted
On Nov 3, 2015 5:30 PM, "Minchan Kim" [off-list ref] wrote:
quoted
Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).
[...]
quoted
How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.
What happens if you MADV_FREE something that's MAP_SHARED or isn't
ordinary anonymous memory?  There's a long history of MADV_DONTNEED on
such mappings causing exploitable problems, and I think it would be
nice if MADV_FREE were obviously safe.
It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
with vma_is_anonymous.
quoted
Does this set the write protect bit?
No.
quoted
What happens on architectures without hardware dirty tracking?  For
that matter, even on architecture with hardware dirty tracking, what
happens in multithreaded processes that have the dirty TLB state
cached in a different CPU's TLB?

Using the dirty bit for these semantics scares me.  This API creates a
page that can have visible nonzero contents and then can
asynchronously and magically zero itself thereafter.  That makes me
nervous.  Could we use the accessed bit instead?  Then the observable
Access bit is used by aging algorithm for reclaim. In addition,
we have supported clear_refs feacture.
IOW, it could be reset anytime so it's hard to use marker for
lazy freeing at the moment.
That's unfortunate.  I think that the ABI would be much nicer if it
used the accessed bit.

In any case, shouldn't the aging algorithm be irrelevant here?  A
MADV_FREE page that isn't accessed can be discarded, whereas we could
hopefully just say that a MADV_FREE page that is accessed gets moved
to whatever list holds recently accessed pages and also stops being a
candidate for discarding due to MADV_FREE?
I meant if we use access bit as indicator for lazy-freeing page,
we could discard valid page which is never hinted by MADV_FREE but
just doesn't mark access bit in page table by aging algorithm.
Oh, is the rule that the anonymous pages that are clean are discarded
instead of swapped out?  That is, does your patch set detect that an
The page swapped-in after swapped-out has clean pte and swap device
has valid data if the page isn't touch so VM discards the page rather
than swapout. Of course, pte should point out the swap slot.
If VM decide to remove the page from swap slot, it should be marked
PG_dirty.
anonymous page can be discarded if it's clean and that the lack of a
dirty bit is the only indication that the page has been hit with
MADV_FREE?
No dirty bit, exactly speaking, PG_Dirty
because the page I mentioned above has clean pte but will have PG_dirty.
If so, that seems potentially error prone -- I had assumed that pages
that were swapped in but not written since swap-in would also be
clean, and I don't see how you distinguish them.
I hope above will answer.
--Andy
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help