Re: [PATCH 4/9] raid5: log reclaim support
From: Shaohua Li <hidden>
Date: 2015-08-05 21:34:21
On Wed, Aug 05, 2015 at 01:43:30PM +1000, NeilBrown wrote:
On Wed, 29 Jul 2015 17:38:44 -0700 Shaohua Li [off-list ref] wrote:quoted
This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run every minute or out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency.Having arbitrary times lines "every minute" is a warning sign. "As soon as possible" and "Just it time" can both make sense easily. "every minute" needs more justification. I'll probably say more when I find the code.
The idea is if we reclaim periodically, recovery could scan less log space. It's insane recovery scans a 1T disk. As I said this is just to free disk spaces. It's not a signal we will lose data in minute interval. I can change the relaim to run every 1G reclaimable space for example.
quoted
+ r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios, + IO_UNIT_IO_END); + r5l_move_io_unit_list(&log->io_end_ios, &log->stripe_end_ios, + IO_UNIT_STRIPE_END); + r5l_compress_stripe_end_list(log); + run_stripe = !list_empty(&log->io_end_ios); + spin_unlock(&log->io_list_lock); + + if (!run_stripe) + return; + + blkdev_issue_flush(r5l_bdev(log), GFP_NOIO, NULL); + + spin_lock(&log->io_list_lock); + list_for_each_entry(io, &log->io_end_ios, log_sibling) { + if (io->state >= IO_UNIT_STRIPE_START) + continue; + r5l_set_io_unit_state(io, IO_UNIT_STRIPE_START); + + while (!list_empty(&io->stripe_list)) { + sh = list_first_entry(&io->stripe_list, + struct stripe_head, log_list); + list_del_init(&sh->log_list); + set_bit(STRIPE_HANDLE, &sh->state); + release_stripe(sh);This code makes me a bit nervous. handle_stripe() can potentially be called on any stripe at any time. Here you are scheduling a call the handle_stripe() without obviously changing the state of the stripe. So whatever is going to happen now could potentially have happened before... is that safe?
I'm not fully sure, but it's ok in my test.
quoted
+ * move proper io_unit to reclaim list. We should not change the order. + * reclaimable/unreclaimable io_unit can be mixed in the list, we + * shouldn't reuse space of an unreclaimable io_unit + * */ + while (1) { + r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios, + IO_UNIT_IO_END); + r5l_move_io_unit_list(&log->io_end_ios, &log->stripe_end_ios, + IO_UNIT_STRIPE_END); + while (!list_empty(&log->stripe_end_ios)) { + io = list_first_entry(&log->stripe_end_ios, + struct r5l_io_unit, log_sibling); + list_move_tail(&io->log_sibling, &list); + free += (io->log_end - io->log_start + + log->total_blocks) % log->total_blocks; + } + + if (free >= reclaim_target || (list_empty(&log->running_ios) && + list_empty(&log->io_end_ios) && + list_empty(&log->stripe_end_ios))) + break; + + if (!list_empty(&log->io_end_ios)) { + io = list_first_entry(&log->io_end_ios, + struct r5l_io_unit, log_sibling); + spin_unlock(&log->io_list_lock); + /* nobody else can delete the io, we are safe */ + r5l_kick_io_unit(log, io); + spin_lock(&log->io_list_lock); + continue; + } + + if (!list_empty(&log->running_ios)) { + io = list_first_entry(&log->running_ios, + struct r5l_io_unit, log_sibling); + spin_unlock(&log->io_list_lock); + /* nobody else can delete the io, we are safe */ + r5l_kick_io_unit(log, io); + spin_lock(&log->io_list_lock); + continue; + } + } + spin_unlock(&log->io_list_lock);Well, here we are with the important parts of the reclaim code... The main result of the above section is to possibly call r5l_flush_stripe_to_raid() a few times, and to wait until 'list' contains enough io_units to satisfy the requirement. As raid5d already calls r5l_flush_stripe_to_raid - which it really must to make sure that writes complete quickly - this really comes down to some book keeping and some waiting. Book keeping can be done as changes happen, and waiting is best not done at all. To be more specific: when an io_unit transitions to IO_UNIT_STRIPE_END it can immediately be removed from the list and if it was the first io_unit on the list, then the log_start can immediately be updated.
Ok, the original idea is to avoid holding the io_list_lock for every IO end/stripe end. Maybe over-designed, I'll fix this.
quoted
+ + if (list_empty(&list)) + return; + + r5l_flush_all_disks(log); + + /* super always point to last valid meta */ + last = list_last_entry(&list, struct r5l_io_unit, log_sibling); + r5l_write_super(log, r5l_block_to_sector(log, last->log_start));This bit flushes all the disks and then updates the metadata and writes it. As md_super_write already uses WRITE_FLUSH_FUA I don't think the extra flush is needed.
great.
I really think you should just update ->recovery_offset and set MD_CHANGE_DEVS (or similar) and let the update happen.
I think we should write super here. The reclaimed space might be reused immediately, we don't want to confuse recovery.
quoted
+ + mutex_lock(&log->io_mutex); + log->last_checkpoint = last->log_start; + log->last_cp_seq = last->seq; + mutex_unlock(&log->io_mutex); + wake_up(&log->space_waitq); + + while (!list_empty(&list)) { + io = list_first_entry(&list, struct r5l_io_unit, log_sibling); + list_del(&io->log_sibling); + r5l_free_io_unit(log, io); + } +}So I really think all of this can be done as-it-happens (the book-keeping) or asynchronously. There is no need to push something every minute.
Because we need flush raid disks cache, to avoid the overhead, we do batch operation. Once I changed the reclaim to run every specific reclaimable space, this should be ok. Thanks, Shaohua