Re: [PATCH] mm: implement swap prefetching
From: Andrew Morton <hidden>
Date: 2006-02-07 00:36:45
Also in:
lkml
Con Kolivas [off-list ref] wrote:
Andrew et al I'm resubmitting the swap prefetching patch for inclusion in -mm and hopefully mainline.
Resubmitting is good, thanks.
After you removed it from -mm there were some people that described the benefits it afforded their workloads. -mm being ever so slightly quieter at the moment please reconsider.
I wish I could convince myself this is sufficiently beneficial.. I've been running 2.6.15-rc2-mm2 on my main workstation (x86, 2G) since whenever. (Am lazy, haven't gotten around to upgrading that machine). It has swap prefetch. I can't say I noticed any difference, although I did turn it off in /proc a few reboots ago because it was irritating me for some reason which I forget (sorry). One thing about 2.6.15-rc2-mm2 is that the `so' and `si' columns in `vmstat' always read zero. I don't know whether that bug is due to the prefetch patch or not.
The amount prefetched in each group is configurable via the tunable in /proc/sys/vm/swap_prefetch. This is set to a value based on memory size. When laptop_mode is enabled it prefetches in ten times larger blocks to minimise the time spent reading.
That's incomprehensible, sorry. I think it'd be much clearer if the thing was called swap_prefetch_kbytes or swap_prefetch_mbytes or (worse) swap_prefetch_pages - putting the units in the name really helps clarify things. And if such a change is made, the internal variable should also be renamed. Right now it's "swap_prefetch", which sounds like a boolean.
+swap_prefetch + +This is the amount of data prefetched per prefetching interval when +swap prefetching is compiled in. The value means multiples of 128K, +except when laptop_mode is enabled and then it is ten times larger. +Setting it to 0 disables prefetching entirely.
What does "ten times larger" mean? If laptop_mode, this thing is in units of 1280 kbytes and if !laptop_mode it's in units of 128 kbytes? If so (or if not), this tunable is quite obscure and hard-to-understand. Can you find a way to make this more user-friendly?
+/* only used by prefetch externally */ +/* mm/swap_prefetch.c */ +extern void prepare_prefetch(void); +extern void add_to_swapped_list(unsigned long index); +extern void remove_from_swapped_list(unsigned long index); +extern void delay_prefetch(void);
I'd suggest that "prefetch" is too generic a term. We prefetch lots of things in the kernel. Please rename all globally-visible identifiers with s/prefetch/swap_prefetch/.
quoted hunk ↗ jump to hunk
===================================================================--- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc2-ck1/mm/swap_prefetch.c 2006-02-04 18:38:24.000000000 +1100@@ -0,0 +1,431 @@ +/* + * linux/mm/swap_prefetch.c + * + * Copyright (C) 2005 Con Kolivas + * + * Written by Con Kolivas <kernel@kolivas.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/fs.h> +#include <linux/swap.h> +#include <linux/ioprio.h> +#include <linux/kthread.h> +#include <linux/pagemap.h> +#include <linux/syscalls.h> +#include <linux/writeback.h> + +/* Time to delay prefetching if vm is busy or prefetching unsuccessful */ +#define PREFETCH_DELAY (HZ * 5) +/* Time between attempting prefetching when vm is idle */ +#define PREFETCH_INTERVAL (HZ) + +/* sysctl - how many SWAP_CLUSTER_MAX pages to prefetch at a time */ +int swap_prefetch __read_mostly; + +struct swapped_root { + unsigned long busy; /* vm busy */ + spinlock_t lock; /* protects all data */ + struct list_head list; /* MRU list of swapped pages */ + struct radix_tree_root swap_tree; /* Lookup tree of pages */ + unsigned int count; /* Number of entries */ + unsigned int maxcount; /* Maximum entries allowed */ + kmem_cache_t *cache;
/* Of struct swapped_entry */
+};
+struct swapped_entry {
+ swp_entry_t swp_entry;
+ struct list_head swapped_list;
+};
+
+static struct swapped_root swapped = {
+ .busy = 0,
+ .lock = SPIN_LOCK_UNLOCKED,
+ .list = LIST_HEAD_INIT(swapped.list),
+ .swap_tree = RADIX_TREE_INIT(GFP_ATOMIC),
+ .count = 0,
+};Description of `busy' and `count'?
+static task_t *kprefetchd_task; + +/* Max mapped we will prefetch to */ +static unsigned long mapped_limit __read_mostly; +/* Last total free pages */ +static unsigned long last_free = 0; +static unsigned long temp_free = 0;
Unneeded initialisation.
+
+/*
+ * Accounting is sloppy on purpose. As adding and removing entries from the
+ * list happens during swapping in and out we don't want to be spinning on
+ * locks. It is cheaper to just miss adding an entry since having a reference
+ * to every entry is not critical.
+ */
+void add_to_swapped_list(unsigned long index)
+{
+ struct swapped_entry *entry;
+ int error;
+
+ if (unlikely(!spin_trylock(&swapped.lock)))
+ goto out;hm, spin_trylock() should internally do unlikely(), but it doesn't. (It's a bit of a mess, too).
+ if (swapped.count >= swapped.maxcount) {/* * <comment about LRU> */
+ entry = list_entry(swapped.list.next,
+ struct swapped_entry, swapped_list);
+ radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+ list_del(&entry->swapped_list);
+ swapped.count--;
+ } else {
+ entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+ if (unlikely(!entry))
+ /* bad, can't allocate more mem */
+ goto out_locked;
+ }
+
+ entry->swp_entry.val = index;
+
+ error = radix_tree_preload(GFP_ATOMIC);I suspect we don't need to preload here. We can handle radix_tree_insert() failure, so just go ahead and do it.
+static inline int high_zone(struct zone *zone)
+{
+ if (zone == NULL)
+ return 0;
+ return is_highmem(zone);
+}
+
+/*
+ * Find the zone with the most free pages, recheck the watermarks and
+ * then directly allocate the ram. We don't want prefetch to use
+ * __alloc_pages and go calling on reclaim.
+ */
+static struct page *prefetch_get_page(void)
+{
+ struct zone *zone = NULL, *z;
+ struct page *page = NULL;
+ struct zonelist *zonelist;
+ long most_free = 0;
+
+ for_each_zone(z) {
+ long free;
+
+ if (z->present_pages == 0)
+ continue;
+
+ /* We don't prefetch into DMA */
+ if (zone_idx(z) == ZONE_DMA)
+ continue;
+
+ free = z->free_pages;
+ /* Select the zone with the most free ram preferring high */
+ if ((free > most_free && (!high_zone(zone) || high_zone(z))) ||
+ (!high_zone(zone) && high_zone(z))) {
+ most_free = free;
+ zone = z;
+ }
+ }<stares at the above expression for three minutes> I think it'll always select ZONE_HIGHMEM no matter what. Users of 1G x86 boxes not happy.
+/*
+ * How many pages to prefetch at a time. We prefetch SWAP_CLUSTER_MAX *
+ * swap_prefetch per PREFETCH_INTERVAL, but prefetch ten times as much at a
+ * time in laptop_mode to minimise the time we keep the disk spinning.
+ */
+static inline unsigned long prefetch_pages(void)
+{
+ return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
+}I don't think this should be done in-kernel. There's a nice script to start and stop laptop mode. We can make this decision in that script.
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+ struct page_state ps;
+ unsigned long limit;
+ struct zone *z;
+ int ret = 0;
+
+ /* Purposefully racy and might return false positive which is ok */
+ if (__test_and_clear_bit(0, &swapped.busy))
+ goto out;
+
+ temp_free = 0;
+ /*
+ * Have some hysteresis between where page reclaiming and prefetching
+ * will occur to prevent ping-ponging between them.
+ */
+ for_each_zone(z) {
+ unsigned long free;
+
+ if (z->present_pages == 0)
+ continue;
+ free = z->free_pages;
+ if (z->pages_high * 3 > free)
+ goto out;
+ temp_free += free;
+ }
+
+ /*
+ * We check to see that pages are not being allocated elsewhere
+ * at any significant rate implying any degree of memory pressure
+ * (eg during file reads)
+ */
+ if (last_free) {
+ if (temp_free + SWAP_CLUSTER_MAX < last_free) {
+ last_free = temp_free;
+ goto out;
+ }
+ } else
+ last_free = temp_free;What is the actual threshold rate here? SWAP_CLUSTER_MAX/(how fast your CPU is)? Seems a bit vague?
+ get_page_state(&ps);
get_page_state() can be super-expensive. How frequently is this called?
+
+static int kprefetchd(void *__unused)
+{
+ set_user_nice(current, 19);
+ /* Set ioprio to lowest if supported by i/o scheduler */
+ sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+ do {
+ enum trickle_return prefetched;
+
+ try_to_freeze();
+
+ /*
+ * TRICKLE_FAILED implies no entries left - we do not schedule
+ * a wakeup, and further delay the next one.
+ */
+ prefetched = trickle_swap();
+ switch (prefetched) {
+ case TRICKLE_SUCCESS:
+ last_free = temp_free;This `last_free' thing is really confusing. It's central to the algorithms yet its name is largely meaningless. last_free *what*? It seems to mean "total number of free pages on the last prefetching pass", yes? Wanna think of a better name and a better comment for it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>