Re: [RFC]: map 4K iommu pages even on 64K largepage systems.
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: 2006-10-24 02:22:38
On Mon, 2006-10-23 at 19:25 -0500, Linas Vepstas wrote:
Subject: [RFC]: map 4K iommu pages even on 64K largepage systems. The 10Gigabit ethernet device drivers appear to be able to chew up all 256MB of TCE mappings on pSeries systems, as evidenced by numerous error messages: iommu_alloc failed, tbl c0000000010d5c48 vaddr c0000000d875eff0 npages 1 Some experimentaiton indicates that this is essentially because one 1500 byte ethernet MTU gets mapped as a 64K DMA region when the large 64K pages are enabled. Thus, it doesn't take much to exhaust all of the available DMA mappings for a high-speed card.
There is much to be said about using a 1500MTU and no TSO on a 10G link :) But appart from that, I agree, we have a problem.
This patch changes the iommu allocator to work with its own unique, distinct page size. Although the patch is long, its actually quite simple: it just #defines distinct IOMMU_PAGE_SIZE and then uses this in al the places tha matter. The patch boots on pseries, untested in other places. Haven't yet thought if this is a good long-term solution or not, whether this kind of thing is desirable or not. That's why its an RFC. Comments?
It's probably a good enough solution for RHEL, but we should do something different long term. There are a few things I have in mind: - We could have a page size field in the iommu_table and have the iommu allocator use that. Thus we can have a per iommu table instance page size. That would allow Geoff to deal with his crazy hypervisor by basically having one iommu table instance per device. It would also allow us to keep using large iommu page sizes on platform where the system gives us more than a pinhole for iommu space :) - In the long run, I'm thinking about the interest in supporting two page sizes for the fine and coarse allocation regions of the table. We would need to get a bit more infos from the HW backend to do that, but for example, on native Cell, we can have a page size per 256Mb region, thus we could have the iommu space dividied in 4k pages for small mappings and 64k pages for full page or more mappings. So I reckon we should first audit and make sure your current patch works fine on everything as a crash-fix for 2.6.19 and backportable to RHEL. Then, we can implement my first option for 2.6.20 and possibly debate about the interest of my second option, unless somebody else comes up with better ideas of course :) Ben.