Thread (35 messages) 35 messages, 5 authors, 2020-10-21

Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)

From: Dan Williams <hidden>
Date: 2020-10-16 22:12:01
Also in: linux-nvme, lkml, nvdimm

On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) [off-list ref] wrote:
On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig [off-list ref] wrote:
quoted
I don't think this belongs into the kernel.  It is a classic case for
infrastructure that should be built in userspace.  If anything is
missing to implement it in userspace with equivalent performance we
need to improve out interfaces, although io_uring should cover pretty
much everything you need.
Hi Christoph,

We previously considered moving the mpool object store code to user-space.
However, by implementing mpool as a device driver, we get several benefits
in terms of scalability, performance, and functionality. In doing so, we relied
only on standard interfaces and did not make any changes to the kernel.

(1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
a collection of logically related objects with a single system call. The objects in
such a collection are created at different times, physically disparate, and may
even reside on different media class volumes.

For our HSE storage engine application, there are commonly 10's to 100's of
objects in a given mcache map, and 75,000 total objects mapped at a given time.

Compared to memory-mapping objects individually, the mcache map facility
scales well because it requires only a single system call and single vm_area_struct
to memory-map a complete collection of objects.
Why can't that be a batch of mmap calls on io_uring?
(2) The mcache map reaper mechanism proactively evicts object data from the page
cache based on object-level metrics. This provides significant performance benefit
for many workloads.

For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
against our HSE storage engine using the mpool driver in a 5.9 kernel.
For each workload, we ran with the reaper turned-on and turned-off.

For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
latency for reads by 39% and updates by 99%. For workload C, the reaper increased
throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
improvements are even more dramatic with earlier kernels.
What metrics proved useful and can the vanilla page cache / page
reclaim mechanism be augmented with those metrics?
(3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
created using the Zone Append command. This patch set does not support ZNS, but
that work is in progress and we will be demonstrating our HSE storage engine
running on mpool with ZNS drives at FMS 2020.

(4) mpool's immutable object model allows the driver to support concurrent reading
of object data directly and memory-mapped without a performance penalty to verify
coherence. This allows background operations, such as LSM-tree compaction, to
operate efficiently and without polluting the page cache.
How is this different than existing background operations / defrag
that filesystems perform today? Where are the opportunities to improve
those operations?
(5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
convenient mechanism for controlling access to and managing the multiple storage
volumes, and in the future pmem devices, that may comprise an logical mpool.
Christoph and I have talked about replacing the pmem driver's
dependence on device-mapper for pooling. What extensions would be
needed for the existing driver arch?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help