Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed
From: Kirill A. Shutemov <hidden>
Date: 2017-02-13 15:17:51
Also in:
linux-block, linux-fsdevel, linux-mm, lkml
On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote:
On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote:quoted
Later we can add logic to accumulate information from shadow entires to return to caller (average eviction time?).I would say minimum rather than average. That will become the refault time of the entire page, so minimum would probably have us making better decisions?
Yes, makes sense.
quoted
+ /* Wipe shadow entires */ + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, + page->index) { + if (iter.index >= page->index + hpage_nr_pages(page)) + break; p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); - if (!radix_tree_exceptional_entry(p)) + if (!p) + continue;Just FYI, this can't happen. You're holding the tree lock so nobody else gets to remove things from the tree. radix_tree_for_each_slot() only gives you the full slots; it skips the empty ones for you. I'm OK if you want to leave it in out of an abundance of caution.
I'll drop it.
quoted
+ __radix_tree_replace(&mapping->page_tree, iter.node, slot, NULL, + workingset_update_node, mapping);I may add an update_node argument to radix_tree_join at some point, so you can use it here. Or maybe we don't need to do that, and what you have here works just fine.quoted
mapping->nrexceptional--;... because adjusting the exceptional count is going to be a pain.
Yeah..
quoted
+ error = __radix_tree_insert(&mapping->page_tree, + page->index, compound_order(page), page); + /* This shouldn't happen */ + if (WARN_ON_ONCE(error)) + return error;A lesser man would have just ignored the return value from __radix_tree_insert. I salute you.quoted
@@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask) { struct address_space *mapping = file->f_mapping; struct page *page; + pgoff_t hoffset; int ret; do { - page = __page_cache_alloc(gfp_mask|__GFP_COLD); + page = page_cache_alloc_huge(mapping, offset, gfp_mask); +no_huge: + if (!page) + page = __page_cache_alloc(gfp_mask|__GFP_COLD); if (!page) return -ENOMEM; - ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL); - if (ret == 0) + if (PageTransHuge(page)) + hoffset = round_down(offset, HPAGE_PMD_NR); + else + hoffset = offset; + + ret = add_to_page_cache_lru(page, mapping, hoffset, + gfp_mask & GFP_KERNEL); + + if (ret == -EEXIST && PageTransHuge(page)) { + put_page(page); + page = NULL; + goto no_huge; + } else if (ret == 0) { ret = mapping->a_ops->readpage(file, page); - else if (ret == -EEXIST) + } else if (ret == -EEXIST) { ret = 0; /* losing race to add is OK */ + } put_page(page);If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop again trying the huge page again, even if the huge page didn't work the first time. I would tend to think that if the huge page failed the first time, we shouldn't try it again, so I propose this:
AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on second iteration. Hm?
struct address_space *mapping = file->f_mapping;
struct page *page;
pgoff_t index;
int ret;
bool try_huge = true;
do {
if (try_huge) {
page = page_cache_alloc_huge(gfp_mask|__GFP_COLD);
if (page)
index = round_down(offset, HPAGE_PMD_NR);
else
try_huge = false;
}
if (!try_huge) {
page = __page_cache_alloc(gfp_mask|__GFP_COLD);
index = offset;
}
if (!page)
return -ENOMEM;
ret = add_to_page_cache_lru(page, mapping, index,
gfp_mask & GFP_KERNEL);
if (ret < 0) {
if (try_huge) {
try_huge = false;
ret = AOP_TRUNCATED_PAGE;
} else if (ret == -EEXIST)
ret = 0; /* losing race to add is OK */
} else {
ret = mapping->a_ops->readpage(file, page);
}
put_page(page);
} while (ret == AOP_TRUNCATED_PAGE);
But ... maybe it's OK to retry the huge page. I mean, not many
filesystems return AOP_TRUNCATED_PAGE, and they only do so rarely.
What about this:
struct address_space *mapping = file->f_mapping;
struct page *page;
pgoff_t hoffset;
int ret;
bool try_huge = true;
do {
if (try_huge) {
page = page_cache_alloc_huge(mapping, offset, gfp_mask);
hoffset = round_down(offset, HPAGE_PMD_NR);
/* Try to allocate huge page once */
try_huge = false;
}
if (!page) {
page = __page_cache_alloc(gfp_mask|__GFP_COLD);
hoffset = offset;
}
if (!page)
return -ENOMEM;
ret = add_to_page_cache_lru(page, mapping, hoffset,
gfp_mask & GFP_KERNEL);
if (ret == -EEXIST && PageTransHuge(page)) {
/* Retry with small page */
put_page(page);
page = NULL;
continue;
} else if (ret == 0) {
ret = mapping->a_ops->readpage(file, page);
} else if (ret == -EEXIST) {
ret = 0; /* losing race to add is OK */
}
put_page(page);
} while (ret == AOP_TRUNCATED_PAGE);
return ret;
Anyway, I'm fine with the patch going in as-is. I just wanted to type out my review notes. Reviewed-by: Matthew Wilcox <redacted>
Thanks! -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>