Re: Folios give an 80% performance win
From: Matthew Wilcox <willy@infradead.org>
Date: 2021-07-24 18:50:29
Also in:
linux-fsdevel, lkml
On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote:
On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:quoted
On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:quoted
On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:quoted
What blows me away is the 80% performance improvement for PostgreSQL. I know they use the page cache extensively, so it's plausibly real. I'm a bit surprised that it has such good locality, and the size of the win far exceeds my expectations. We should probably dive into it and figure out exactly what's going on.Since none of the other tested databases showed more than a 3% improvement, this looks like an anomalous result specific to something in postgres ... although the next biggest db: mariadb wasn't part of the tests so I'm not sure that's definitive. Perhaps the next step should be to t est mariadb? Since they're fairly similar in domain (both full SQL) if mariadb shows this type of improvement, you can safely assume it's something in the way SQL databases handle paging and if it doesn't, it's likely fixing a postgres inefficiency.I think the thing that's specific to PostgreSQL is that it's a heavy user of the page cache. My understanding is that most databases use direct IO and manage their own page cache, while PostgreSQL trusts the kernel to get it right.That's testable with mariadb, at least for the innodb engine since the flush_method is settable.
We're still not communicating well. I'm not talking about writes, I'm talking about reads. Postgres uses the page cache for reads. InnoDB uses O_DIRECT (afaict). See articles like this one: https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ : The first and most obvious type of IO are pages reads and writes from : the tablespaces. The pages are most often read one at a time, as 16KB : random read operations. Writes to the tablespaces are also typically : 16KB random operations, but they are done in batches. After every batch, : fsync is called on the tablespace file handle. (the current folio patch set does not create multi-page folios for writes, only for reads) I downloaded the mariadb source package that's in Debian, and from what I can glean, it does indeed set O_DIRECT on data files in Linux, through os_file_set_nocache().
quoted
Regardless of whether postgres is "doing something wrong" or not, do you not think that an 80% performance win would exert a certain amount of pressure on distros to do the backport?Well, I cut the previous question deliberately, but if you're going to force me to answer, my experience with storage tells me that one test being 10x different from all the others usually indicates a problem with the benchmark test itself rather than a baseline improvement, so I'd wait for more data.
... or the two benchmarks use Linux in completely different ways such that one sees a huge benefit while the other sees none. Which is what you'd expect for a patchset that improves the page cache and using a benchmark that doesn't use the page cache.