Thread (389 messages) 389 messages, 13 authors, 2021-08-21

Re: Folios give an 80% performance win

From: Andres Freund <hidden>
Date: 2021-07-24 18:46:15
Also in: linux-fsdevel, lkml

Hi,

On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
quoted
On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
quoted
On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
quoted
What blows me away is the 80% performance improvement for
PostgreSQL. I know they use the page cache extensively, so it's
plausibly real. I'm a bit surprised that it has such good
locality, and the size of the win far exceeds my
expectations.  We should probably dive into it and figure out
exactly what's going on.
Since none of the other tested databases showed more than a 3%
improvement, this looks like an anomalous result specific to
something in postgres ... although the next biggest db: mariadb
wasn't part of the tests so I'm not sure that's
definitive.  Perhaps the next step should be to t
est mariadb?  Since they're fairly similar in domain (both full
SQL) if mariadb shows this type of improvement, you can
safely assume it's something in the way SQL databases handle paging
and if it doesn't, it's likely fixing a postgres inefficiency.
I think the thing that's specific to PostgreSQL is that it's a heavy
user of the page cache.  My understanding is that most databases use
direct IO and manage their own page cache, while PostgreSQL trusts
the kernel to get it right.
That's testable with mariadb, at least for the innodb engine since the
flush_method is settable. 
quoted
Regardless of whether postgres is "doing something wrong" or not,
do you not think that an 80% performance win would exert a certain
amount of pressure on distros to do the backport?
Well, I cut the previous question deliberately, but if you're going to
force me to answer, my experience with storage tells me that one test
being 10x different from all the others usually indicates a problem
with the benchmark test itself rather than a baseline improvement, so
I'd wait for more data.
I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).

Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?

Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.

Regards,
Andres
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help