Thread (67 messages) 67 messages, 4 authors, 2017-02-13

Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed

From: Kirill A. Shutemov <hidden>
Date: 2017-02-13 15:17:51
Also in: linux-block, linux-fsdevel, linux-mm, lkml

On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote:
On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote:
quoted
Later we can add logic to accumulate information from shadow entires to
return to caller (average eviction time?).
I would say minimum rather than average.  That will become the refault
time of the entire page, so minimum would probably have us making better
decisions?
Yes, makes sense.
quoted
+	/* Wipe shadow entires */
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+			page->index) {
+		if (iter.index >= page->index + hpage_nr_pages(page))
+			break;
 
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
-		if (!radix_tree_exceptional_entry(p))
+		if (!p)
+			continue;
Just FYI, this can't happen.  You're holding the tree lock so nobody
else gets to remove things from the tree.  radix_tree_for_each_slot()
only gives you the full slots; it skips the empty ones for you.  I'm
OK if you want to leave it in out of an abundance of caution.
I'll drop it.
quoted
+		__radix_tree_replace(&mapping->page_tree, iter.node, slot, NULL,
+				workingset_update_node, mapping);
I may add an update_node argument to radix_tree_join at some point,
so you can use it here.  Or maybe we don't need to do that, and what
you have here works just fine.
quoted
 		mapping->nrexceptional--;
... because adjusting the exceptional count is going to be a pain.
Yeah..
quoted
+	error = __radix_tree_insert(&mapping->page_tree,
+			page->index, compound_order(page), page);
+	/* This shouldn't happen */
+	if (WARN_ON_ONCE(error))
+		return error;
A lesser man would have just ignored the return value from
__radix_tree_insert.  I salute you.
quoted
@@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page;
+	pgoff_t hoffset;
 	int ret;
 
 	do {
-		page = __page_cache_alloc(gfp_mask|__GFP_COLD);
+		page = page_cache_alloc_huge(mapping, offset, gfp_mask);
+no_huge:
+		if (!page)
+			page = __page_cache_alloc(gfp_mask|__GFP_COLD);
 		if (!page)
 			return -ENOMEM;
 
-		ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL);
-		if (ret == 0)
+		if (PageTransHuge(page))
+			hoffset = round_down(offset, HPAGE_PMD_NR);
+		else
+			hoffset = offset;
+
+		ret = add_to_page_cache_lru(page, mapping, hoffset,
+				gfp_mask & GFP_KERNEL);
+
+		if (ret == -EEXIST && PageTransHuge(page)) {
+			put_page(page);
+			page = NULL;
+			goto no_huge;
+		} else if (ret == 0) {
 			ret = mapping->a_ops->readpage(file, page);
-		else if (ret == -EEXIST)
+		} else if (ret == -EEXIST) {
 			ret = 0; /* losing race to add is OK */
+		}
 
 		put_page(page);
If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop
again trying the huge page again, even if the huge page didn't work
the first time.  I would tend to think that if the huge page failed the
first time, we shouldn't try it again, so I propose this:
AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on
second iteration. Hm?
        struct address_space *mapping = file->f_mapping;
        struct page *page;
        pgoff_t index;
        int ret;
        bool try_huge = true;

        do {
                if (try_huge) {
                        page = page_cache_alloc_huge(gfp_mask|__GFP_COLD);
                        if (page)
                                index = round_down(offset, HPAGE_PMD_NR);
                        else
                                try_huge = false;
                }

                if (!try_huge) {
                        page = __page_cache_alloc(gfp_mask|__GFP_COLD);
                        index = offset;
                }

                if (!page)
                        return -ENOMEM;

                ret = add_to_page_cache_lru(page, mapping, index,
                                                        gfp_mask & GFP_KERNEL);
                if (ret < 0) {
                        if (try_huge) {
                                try_huge = false;
                                ret = AOP_TRUNCATED_PAGE;
                        } else if (ret == -EEXIST)
                                ret = 0; /* losing race to add is OK */
                } else {
                        ret = mapping->a_ops->readpage(file, page);
                }

                put_page(page);
        } while (ret == AOP_TRUNCATED_PAGE);

But ... maybe it's OK to retry the huge page.  I mean, not many
filesystems return AOP_TRUNCATED_PAGE, and they only do so rarely.
What about this:

	struct address_space *mapping = file->f_mapping;
	struct page *page;
	pgoff_t hoffset;
	int ret;
	bool try_huge = true;

	do {
		if (try_huge) {
			page = page_cache_alloc_huge(mapping, offset, gfp_mask);
			hoffset = round_down(offset, HPAGE_PMD_NR);

			/* Try to allocate huge page once */
			try_huge = false;
		}

		if (!page) {
			page = __page_cache_alloc(gfp_mask|__GFP_COLD);
			hoffset = offset;
		}

		if (!page)
			return -ENOMEM;

		ret = add_to_page_cache_lru(page, mapping, hoffset,
				gfp_mask & GFP_KERNEL);

		if (ret == -EEXIST && PageTransHuge(page)) {
			/* Retry with small page */
			put_page(page);
			page = NULL;
			continue;
		} else if (ret == 0) {
			ret = mapping->a_ops->readpage(file, page);
		} else if (ret == -EEXIST) {
			ret = 0; /* losing race to add is OK */
		}

		put_page(page);
	} while (ret == AOP_TRUNCATED_PAGE);

	return ret;

Anyway, I'm fine with the patch going in as-is.  I just wanted to type out
my review notes.

Reviewed-by: Matthew Wilcox <redacted>
Thanks!

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help