Re: [RFC PATCH] ext4: fix 50% disk write performance regression
From: Bill Fink <hidden>
Date: 2010-08-31 03:27:21
Possibly related (same subject, not in this thread)
- 2010-08-31 · Re: [RFC PATCH] ext4: fix 50% disk write performance regression · Bill Fink <hidden>
- 2010-08-31 · Re: [RFC PATCH] ext4: fix 50% disk write performance regression · Bill Fink <hidden>
- 2010-08-31 · Re: [RFC PATCH] ext4: fix 50% disk write performance regression · Justin Maggard <hidden>
- 2010-08-31 · Re: [RFC PATCH] ext4: fix 50% disk write performance regression · Ted Ts'o <tytso@mit.edu>
- 2010-08-30 · Re: [RFC PATCH] ext4: fix 50% disk write performance regression · Eric Sandeen <hidden>
[adding linux-ext4 back in] On Mon, 30 Aug, Eric Sandeen wrote:
Bill Fink wrote:quoted
On Mon, 30 Aug 2010, Eric Sandeen wrote:quoted
Bill Fink wrote:quoted
On Mon, 30 Aug 2010, Eric Sandeen wrote:quoted
Bill Fink wrote:quoted
On Mon, 30 Aug 2010, Ted Ts'o wrote:quoted
On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:quoted
A 50% ext4 disk write performance regression was introduced in 2.6.32 and still exists in 2.6.35, although somewhat improved from 2.6.32. Read performance was not affected).Thanks for reporting it. I'm going to have to take a closer look at why this makes a difference. I'm going to guess though that what's going on is that we're posting writes in such a way that they're no longer aligned or ending at the end of a RAID5 stripe, causing a read-modify-write pass. That would easily explain the write performance regression.I'm not sure I understand. How could calling or not calling ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel) affect the write alignment? I was wondering if the locking being done in ext4_num_dirty_pages() could somehow be affecting the performance. I did notice from top that in the patched 2.6.35 kernel, the I/O wait time was generally in the 60-65% range, while in the unpatched 2.6.35 kernel, it was at a higher 75-80% range. However, I don't know if that's just a result of the lower performance, or a possible clue to its cause.Using oprofile might also show you how much time is getting spent there..quoted
quoted
The interesting thing is that we don't actually do anything in ext4_da_writepages() to assure that we are making our writes are appropriate aligned and sized. We do pay attention to make sure they are alligned correctly in the allocator, but _not_ in the writepages code. So the fact that apparently things were well aligned in 2.6.32 seems to be luck... (or maybe the writes are perfectly aligned in 2.6.32; they're just much worse with 2.6.35, and with explicit attention paid to the RAID stripe size, we could do even better :-)It was 2.6.31 that was good. The regression was in 2.6.32. And again how does the write alignment get modified simply by whether or not ext4_num_dirty_pages() is called?writeback is full of deep mysteries ... :)quoted
quoted
If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with your patch, that would be really helpful to confirm my hypothesis. Is that something that wouldn't be too much trouble?I'd be glad to if you explain how one runs blktraces.Probably the easiest thing to do is to use seekwatcher to invoke blktrace, if it's easily available for your distro. Then it's just mount debugfs on /sys/kernel/debug, and: # seekwatcher -d /dev/whatever -t tracename -o tracename.png -p "your dd command" It'll leave tracename.* blktrace files, and generate a graph of the IO in the PNG file. (this causes an abbreviated trace, but it's probably enough to see what boundaries the IO was issued on)Thanks for the info. How would you like me to send the blktraces? Even using bzip2 they're 2.6 MB. I could send them to you and Ted via private e-mail or I can hunt around and try and find somewhere I can post them. I'm attaching the PNG files (2.6.35 is unpatched and 2.6.35+ is patched).Private email is fine I think, I don't mind a 2.6MB attachment and doubt Ted would either. :)OK. It's two such attachments (2.6.35 is unpatched and 2.6.35+ is patched).(fwiw I had to create *.0 and *.1 files to make blktrace happy, it didn't like starting with *.2 ...? oh well)
The .0 and .1 files were 0 bytes. There was also a 56 byte .3 file in each case. Did you also want to see those?
[sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35-trace | tail -n 13
CPU2 (ext4dd-2.6.35-trace):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 249, 996KiB Writes Completed: 270,621, 30,544MiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 0
IO unplugs: 0 Timer unplugs: 0
Throughput (R/W): 13KiB/s / 409,788KiB/s
Events (ext4dd-2.6.35-trace): 270,870 entries
Skips: 0 forward (0 - 0.0%)
Input file ext4dd-2.6.35-trace.blktrace.2 added
[sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35+-trace | tail -n 13
CPU2 (ext4dd-2.6.35+-trace):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 504, 2,016KiB Writes Completed: 246,500, 30,610MiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 0
IO unplugs: 0 Timer unplugs: 0
Throughput (R/W): 38KiB/s / 590,917KiB/s
Events (ext4dd-2.6.35+-trace): 247,004 entries
Skips: 0 forward (0 - 0.0%)
Input file ext4dd-2.6.35+-trace.blktrace.2 added
Ok not a -huge- difference in the overall stats, though the unpatched version did
fdo 24,000 more writes, which is 10% ...
At the extremes in both cases there are 8-block and 256-block writes.
2.6.35:
nr wrt size
25256 8
1701 16
1646 24
297 248
...
232657 256
2.6.35+:
nr wrt size
4785 8
1732 16
357 24
...
50 248
237907 256
So not a huge difference in the distribution really, though there were
20,000 more 1-block (8-sector) writes in the unpatched version.
So a lot of 32-block writes turned into 1-block writes and smaller, I guess.
To know for sure about alignment, what is your raid stripe unit, and is this
filesystem on a partition? If so at what offset?Would the stripe unit be the Block size in the following? [root@i7test7 linux-git]# hptrc -s1 query arrays ID Capacity(GB) Type Status Block Sector Cache Name ------------------------------------------------------------------------------- 1 500.00 RAID5 NORMAL 256k 512B WB RAID50_1 2 250.00 RAID5 NORMAL 256k 512B WB RAID500_1 3 2350.00 RAID5 NORMAL 256k 512B WB RAID5_1 The testing was done on the first 500 GB array. And here's how the ext4 filesystem was created (using the entire device): [root@i7test7 linux-git]# mkfs.ext4 /dev/sde mke2fs 1.41.10 (10-Feb-2009) /dev/sde is entire device, not just one partition! Proceed anyway? (y,n) y Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 30523392 inodes, 122070144 blocks 6103507 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 3726 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 25 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. -Thanks -Bill
-Ericquoted
quoted
I keep meaning to patch seekwatcher to color unaligned IOs differently, but without that we need the blktrace data to know if that's what's going on. It's interesting that the patched run is starting at block 0 while unpatched is starting futher in (which would be a little slower at least) was there a fresh mkfs in between?No mkfs in between, and the original mkfs.ext4 was done without any special options. I am using the noauto_da_alloc option on the mount to workaround another 9% performance hit between 2.6.31 and 2.6.32, introduced by 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10 (ext4: Fix the alloc on close after a truncate hueristic). That one only affected already existing files.quoted
Thanks! -Eric