Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
From: Christoph Hellwig <hch@lst.de>
Date: 2019-06-28 13:38:43
Also in:
linux-nvme, linux-pci, linux-rdma, lkml
On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote:
quoted
It is not. (c) is fundamentally very different as it is not actually an operation that ever goes out to the wire at all, and which is why the actual physical address on the wire does not matter at all. Some interfaces like NVMe have designed it in a way that it the commands used to do this internal transfer look like (b2), but that is just their (IMHO very questionable) interface design choice, that produces a whole chain of problems.quoted
From the mapping device's driver's perspective yes, but from theperspective of a submitting driver they would be the same.
With your dma_addr_t scheme it won't be the same, as you'd need a magic way to generate the internal addressing and stuff it into the dma_addr_t. With a phys_addr_t based scheme they should basically be all the same.
Yes, you did suggest them. But what I'm trying to suggest is we don't
*necessarily* need the lookup. For demonstration purposes only, a
submitting driver could very roughly potentially do:
struct bio_vec vec;
dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
if (dist < 0) {
/* use regular memory */
vec.bv_addr = virt_to_phys(kmalloc(...));
vec.bv_flags = 0;
} else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
vec.bv_flags = BVEC_MAP_RESOURCE;
} else {
vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
vec.bv_flags = BVEC_MAP_BUS_ADDR;
}That doesn't look too bad, except..
-- And a mapping driver would roughly just do:
dma_addr_t dma_addr;
if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off)) {
/* case (c) */
/* program the DMA engine with bar and off */Why bother with that here if we could also let the caller handle that? pci_p2pdma_dist() should be able to trivially find that out based on provider_dev == mapping_dev.
The real difficulty here is that you'd really want all the above handled by a dma_map_bvec() so it can combine every vector hitting the IOMMU into a single continuous IOVA -- but it's hard to fit case (c) into that equation. So it might be that a dma_map_bvec() handles cases (a), (b1) and (b2) and the mapping driver has to then check each resulting DMA vector for pci_bus_addr_in_bar() while it is programming the DMA engine to deal with case (c).
I'd do it the other way around. pci_p2pdma_dist is used to find
the p2p type. The p2p type is stuff into the bio_vec, and we then:
(1) manually check for case (c) in driver for drivers that want to
treat it different from (b)
(2) we then have a dma mapping wrapper that checks the p2p type
and does the right thing for the rest.