Re: [PATCH -mm v6 5/6] proc: add kpageidle file

From: Andres Lagar-Cavilla <hidden>
Date: 2015-07-10 19:11:01
Also in: cgroups, linux-mm, lkml

Hi Vladimir,

On Thu, Jul 9, 2015 at 6:19 AM, Vladimir Davydov [off-list ref] wrote:

Hi Andres,

On Wed, Jul 08, 2015 at 04:01:13PM -0700, Andres Lagar-Cavilla wrote:

quoted

On Fri, Jun 12, 2015 at 2:52 AM, Vladimir Davydov

[...]

quoted

@@ -275,6 +276,179 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to page_referenced(), which is essential for
+ * idle page tracking. With such an indicator of user pages we can skip
+ * isolated pages, but since there are not usually many of them, it will hardly
+ * affect the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+       struct page *page;
+       struct zone *zone;
+
+       if (!pfn_valid(pfn))
+               return NULL;
+
+       page = pfn_to_page(pfn);
+       if (!page || !PageLRU(page))

Isolation can race in while you're processing the page, after these
checks. This is ok, but worth a small comment.

Agree, will add one.

quoted

+               return NULL;
+       if (!get_page_unless_zero(page))
+               return NULL;
+
+       zone = page_zone(page);
+       spin_lock_irq(&zone->lru_lock);
+       if (unlikely(!PageLRU(page))) {
+               put_page(page);
+               page = NULL;
+       }
+       spin_unlock_irq(&zone->lru_lock);
+       return page;
+}
+
+/*
+ * This function calls page_referenced() to clear the referenced bit for all
+ * mappings to a page. Since the latter also clears the page idle flag if the
+ * page was referenced, it can be used to update the idle flag of a page.
+ */
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+       unsigned long dummy;
+
+       if (page_referenced(page, 0, NULL, &dummy, NULL))

Because of pte/pmd_clear_flush_young* called in the guts of
page_referenced_one, an N byte write or read to /proc/kpageidle will
cause N * 64 TLB flushes.

Additionally, because of the _notify connection to mmu notifiers, this
will also cause N * 64 EPT TLB flushes (in the KVM Intel case, similar
for other notifier flavors, you get the point).

The solution is relatively straightforward: augment
page_referenced_one with a mode marker or boolean that determines
whether tlb flushing is required.

Frankly, I don't think that tlb flushes are such a big deal in the scope
of this feature, because one is not supposed to write to kpageidle that
often. However, I agree we'd better avoid overhead we can easily avoid,
so I'll add a new flag to differentiate between kpageidle and reclaim
path in page_referenced, as you suggested.

Yes, it's a performance optimization, but fairly critical. Once you
open up a user-space interface, it will take off. What prevents people
from writing a daemon that will scan the entire host space every N
seconds (N=10? 60? 90? 120?). That means tens or hundreds of millions
of individual TLB flushes, which will hurt performance.

The KVM issue is not minor.

quoted

For an access pattern tracker such as the one you propose, flushing is
not strictly necessary: the next context switch will take care. Too
bad if you missed a few accesses because the pte/pmd was loaded in the
TLB. Not so easy for MMU notifiers, because each secondary MMU has its
own semantics. You could arguably throw the towel in there, or try to
provide a framework (i.e. propagate the flushing flag) and let each
implementation fill the gaps.

quoted

+               /*
+                * We cleared the referenced bit in a mapping to this page. To
+                * avoid interference with the reclaimer, mark it young so that
+                * the next call to page_referenced() will also return > 0 (see
+                * page_referenced_one())
+                */
+               set_page_young(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+                             size_t count, loff_t *ppos)
+{
+       u64 __user *out = (u64 __user *)buf;
+       struct page *page;
+       unsigned long pfn, end_pfn;
+       ssize_t ret = 0;
+       u64 idle_bitmap = 0;
+       int bit;
+
+       if (*ppos & KPMMASK || count & KPMMASK)
+               return -EINVAL;
+
+       pfn = *ppos * BITS_PER_BYTE;
+       if (pfn >= max_pfn)
+               return 0;
+
+       end_pfn = pfn + count * BITS_PER_BYTE;
+       if (end_pfn > max_pfn)
+               end_pfn = ALIGN(max_pfn, KPMBITS);
+
+       for (; pfn < end_pfn; pfn++) {
+               bit = pfn % KPMBITS;
+               page = kpageidle_get_page(pfn);
+               if (page) {
+                       if (page_is_idle(page)) {
+                               /*
+                                * The page might have been referenced via a
+                                * pte, in which case it is not idle. Clear
+                                * refs and recheck.
+                                */
+                               kpageidle_clear_pte_refs(page);
+                               if (page_is_idle(page))
+                                       idle_bitmap |= 1ULL << bit;
+                       }
+                       put_page(page);
+               }
+               if (bit == KPMBITS - 1) {
+                       if (put_user(idle_bitmap, out)) {
+                               ret = -EFAULT;
+                               break;
+                       }
+                       idle_bitmap = 0;
+                       out++;
+               }
+       }
+
+       *ppos += (char __user *)out - buf;
+       if (!ret)
+               ret = (char __user *)out - buf;
+       return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,

Your reasoning for a host wide /proc/kpageidle is well argued, but I'm
still hesitant.

mincore() shows how to (relatively simply) resolve unmapped file pages
to their backing page cache destination. You could recycle that code
and then you'd have per process idle/idling interfaces. With the
advantage of a clear TLB flush demarcation.

Hmm, I still don't see how we could handle page cache that does not
belong to any process in the scope of sys_mincore.

You're correct.

I wasn't asking to use mincore, just pointing out an extant code
pattern that could get you beyond the concerns re unmapping (and which
can be implemented as a proc file).

My view is that the key pieces of infrastructure (the flags, the
interactions with page_referenced_one, clear_refs, mark_page_accessed)
your patchset brings along then can be reused in many ways. Michel
Lespinasse's kernel thread can reuse them, or proc/smaps can be
augmented (or a new proc entry), to get per process idle maps.

So /proc/kpageidle is fine with me, but not crazy appealing.

Besides, it'd be awkward to reuse sys_mincore for idle page tracking,
because we need two operations, set idle and check idle, while the
sys_mincore semantic implies only getting information from the kernel,
not vice versa.

Of course, we could introduce a separate syscall, say sys_idlecore, but
IMO it is not a good idea to add a syscall for such a specific feature,
which can be compiled out. I think a proc file suits better for the
purpose, especially counting that we have a bunch of similar files
(pagemap, kpageflags, kpagecount).

Anyway, I'm open for suggestions. If you have a different user API
design in mind, which in your opinion would fit better, please share.

quoted

+                              size_t count, loff_t *ppos)
+{
+       const u64 __user *in = (const u64 __user *)buf;
+       struct page *page;
+       unsigned long pfn, end_pfn;
+       ssize_t ret = 0;
+       u64 idle_bitmap = 0;
+       int bit;
+
+       if (*ppos & KPMMASK || count & KPMMASK)
+               return -EINVAL;
+
+       pfn = *ppos * BITS_PER_BYTE;
+       if (pfn >= max_pfn)
+               return -ENXIO;
+
+       end_pfn = pfn + count * BITS_PER_BYTE;
+       if (end_pfn > max_pfn)
+               end_pfn = ALIGN(max_pfn, KPMBITS);
+
+       for (; pfn < end_pfn; pfn++) {

Relatively straight forward to teleport forward 512 (or more
correctly: 1 << compound_order(page)) pages for THP pages, once done
with a THP head, and avoid 511 fruitless trips down rmap.c for each
tail.

Right, will fix.

quoted

+               bit = pfn % KPMBITS;
+               if (bit == 0) {
+                       if (get_user(idle_bitmap, in)) {
+                               ret = -EFAULT;
+                               break;
+                       }
+                       in++;
+               }
+               if (idle_bitmap >> bit & 1) {
+                       page = kpageidle_get_page(pfn);
+                       if (page) {
+                               kpageidle_clear_pte_refs(page);
+                               set_page_idle(page);

In the common case this will make a page both young and idle. This is
fine. We will come back to it below.

quoted

+                               put_page(page);
+                       }
+               }
+       }
+
+       *ppos += (const char __user *)in - buf;
+       if (!ret)
+               ret = (const char __user *)in - buf;
+       return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+       .llseek = mem_lseek,
+       .read = kpageidle_read,
+       .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+       return true;
+}
+struct page_ext_operations page_idle_ops = {
+       .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 static int __init proc_page_init(void)
 {
        proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);

[...]

quoted

@@ -798,6 +798,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
                pte_unmap_unlock(pte, ptl);

This is not in your patch, but further up in page_referenced_one there
is the pmd case.

So what happens on THP split? That was a leading question: you should
propagate the young and idle flags to the split-up tail pages.

Good catch! I completely forgot about THP slit. Will fix in the next
iteration.

quoted

        }

+       if (referenced && page_is_idle(page))
+               clear_page_idle(page);

Is it so expensive to just call clear without the test .. ?

This function is normally called from a relatively cold path - memory
reclaim, where we modify page->flags anyway, so I think it won't make
any difference if we drop this check.

quoted

+
+       if (page_is_young(page)) {
+               clear_page_young(page);

referenced += test_and_clear_page_young(page) .. ?

Yeah, that does look better.

quoted

+               referenced++;
+       }
+

Invert the order. A page can be both young and idle -- we noted that
closer to the top of the patch.

So young bumps referenced up, and then the final referenced value is
used to clear idle.

I don't think it'd work. Look, kpageidle_write clears pte references and
sets the idle flag. If the page was referenced it also sets the young
flag in order not to interfere with the reclaimer. When kpageidle_read
is called afterwards, it must see the idle flag set iff the page has not
been referenced since kpageidle_write set it. However, if
page_referenced was not called on the page from the reclaim path, it
will still be young no matter if it has been referenced or not and
therefore will always be identified as not idle, which is incorrect.

You're right. Thanks!
Andres

quoted

        if (referenced) {

At this point, if you follow my suggestion of augmenting
page_referenced_one with a mode indicator (for TLB flushing), you can
set page young here. There is the added benefit of holding the
mmap_mutex lock or vma_lock, which prevents reclaim, try_to_unmap,
migration, from exploiting a small window where page young is not set
but should.

Yeah, if we go with the page_referenced mode switcher you suggested
above, it's definitely worth moving set_page_young here.

Thank you for the review!

Vladimir

quoted

                pra->referenced++;
                pra->vm_flags |= vma->vm_flags;

diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c

@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
        } else if (!PageReferenced(page)) {
                SetPageReferenced(page);
        }
+       if (page_is_idle(page))
+               clear_page_idle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help