Re: [RFC PATCH 0/7] evacuate struct page from the block layer

From: Boaz Harrosh <hidden>
Date: 2015-03-24 09:41:37
Also in: linux-arch, linux-fsdevel, lkml

On 03/23/2015 05:19 PM, Rik van Riel wrote:

quoted

Michael Tsirkin and I have been doing some thinking about what
it would take to allocate struct pages per 2MB area permanently,
and allocate additional struct pages for 4kB pages on demand,
when a 2MB area is broken up into 4kB pages.

My thoughts as well, this need *not* be a huge evasive change. Is however
a careful surgery in very core code. And lots of sleepless scary nights
and testing to make sure all the side effects are wrinkled out.

Even the above IS a huge invasive change, and I do not see it
as much better than the work Dan and Matthew are doing.

You lost me again. Sorry for my slowness. The code I envision is not
invasive at all. Nothing is touched at all, except a few core places
at the page level.

The contract with Kernel stays the same:
	page_to_pfn, pfn_to_page, page_address (which is kmap_atomic in 64bit)
	virt_to_page, page_get/put and so on...

So none of the Kernel code need change at all. You were saying that we
might have a 2M page and on demand we can allocate a 4k page shove it down the
stack, which does not change at all, and once back from io, the 4k pages can be
freed and recycled for reuse with other IO. This is what I thought you said.

This is doable, and not that much work and for the life of me I do not see any
"invasive". (Yes a few core headers that make everything compile ;-))

That said I do not even think we need that (2M split to 4k on demand) we can even
do better and make sure 2M pages just work as is. It is very possible today
(Tested) to push a 2M page into a bio and write to a bdev. Yes lots of side
code will break, but the core path is clean. Let us fix that then.

(Need I send code to show you how a 2M page is written with a single
 bvec?)

quoted

If we want copy-less, we need a common memory descriptor career. Today this
is page-struct. So for me your above statement means:
	"still not convinced I care about copy-less pmem"

Otherwise you either enhance what you have today or devise a new
system, which means change the all Kernel.

We do not necessarily need a common descriptor, as much as
one that abstracts out what is happening. Something like a
struct bio could be a good I/O descriptor, and releasing the
backing memory after IO completion could be a function of the
bio freeing function itself.

Lost me again sorry. What backing memory. struct bio is already
an I/O descriptor which gets freed after use. How is that relevant
to pfn vs page ?

quoted

Lastly: Why does pmem need to wait out-of-tree. Even you say above that
machines with lots of DRAM can enjoy the HUGE-to-4k split. So why
not let pmem waist 4k pages like everyone else and fix it as above
down the line, both for pmem and ram. And save both ways.
Why do we need to first change the all Kernel, then have pmem. Why not
use current infra structure, for good or for worth, and incrementally
do better.

There are two things going on here:

1) You want to keep using struct page for now, while there are
   subsystems that require it. This is perfectly legitimate.

2) Matthew and Dan are changing over some subsystems to no longer
   require struct page. This is perfectly legitimate.

How is this legitimate when you need to Interface the [1] subsystems
under the [2] subsystem? A subsystem that expects pages is now not
usable by [2].

Today *All* the Kernel subsystems are [1] Period. How does it become
legitimate to now start *two* competing, do the same differently, abstraction,
in our kernel. We have two much diversity not to little.

I do not understand why either of you would have to object to what
the other is doing. There is room to keep using struct page until
the rest of the kernel no longer requires it.

So this is your vision "until the rest of the kernel no longer requires
pages" Really? Sigh, coming from other Kernels I thought pages were
a breeze of fresh air. I thought it was very clever. And BTW good luck
with that.

BTW: you have not solved the basic problem yet. for one pfn_kmap() given a
pfn what is its virtual address. would you like to loop through the
Kernel's range tables to look for the registered ioremap ? its a long
annoying loop. The page was invented exactly for this reason, to go through
the section object. And actually it is not that easy because if it is an ioremap
pointer it is in one list and if a page it is another way, and on top of
all this, it is ARCH dependent. And you are trashing highmem, because the
state and locks of that are at the page level. Not that I care about highmem
but I hate double coding. For god sake what do you guys have with poor old
pages, they were invented to exacly do this, abstract away management of a single
pfn-to-virt.
All I see is complains about page being 4K well it need not be. page can be any
size, and hell it can be variable size. (And no we do not need to add an extra size
member, all we need is the one bit)

Cheers
Boaz

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help