Thread (42 messages) 42 messages, 5 authors, 2012-09-27

Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

From: Mel Gorman <mgorman@suse.de>
Date: 2012-09-25 09:12:14
Also in: kvm, lkml, qemu-devel

On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman [off-list ref] wrote:
quoted
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
quoted
Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?
The "something" event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free paths
of the page allocator. I felt that that was too high a price to pay.
We already do a similar thing in the page allocator: clearing of
->all_unreclaimable and ->pages_scanned. 
That is true but that is a simple write (shared cache line but still) to
a struct zone. Worse, now that you point it out, that's pretty stupid. It
should be checking if the value is non-zero before writing to it to avoid
a cache line bounce.

Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
not as cheap as it needs to

set_pageblock_skip
  -> set_pageblock_flags_group
    -> page_zone
    -> page_to_pfn
    -> get_pageblock_bitmap
    -> pfn_to_bitidx
    -> __set_bit
But that isn't on the "fast
path" really - it happens once per pcp unload. 
That's still an important enough path that I'm wary of making it fatter
and that only covers the free path. To avoid the polling, the allocation
side needs to be handled too. It could be shoved down into rmqueue() to
put it into a slightly colder path but still, it's a price to pay to keep
compaction happy.
Can we do something
like that?  Drop some hint into the zone without having to visit each
page?
Not without incurring a cost, but yes, t is possible to give a hint on when
PG_migrate_skip should be cleared and move away from that time-based hammer.

First, we'd introduce a variant of get_pageblock_migratetype() that returns
all the bits for the pageblock flags and then helpers to extract either the
migratetype or the PG_migrate_skip. We already are incurring the cost of
get_pageblock_migratetype() so it will not be much more expensive than what
is already there. If there is an allocation or free within a pageblock that
as the PG_migrate_skip bit set then we increment a counter. When the counter
reaches some to-be-decided "threshold" then compaction may clear all the
bits. This would match the criteria of the clearing being based on activity.

There are four potential problems with this

1. The logic to retrieve all the bits and split them up will be a little
   convulated but maybe it would not be that bad.

2. The counter is a shared-writable cache line but obviously it could
   be moved to vmstat and incremented with inc_zone_page_state to offset
   the cost a little.

3. The biggested weakness is that there is not way to know if the
   counter is incremented based on activity in a small subset of blocks.

4. What should the threshold be?

The first problem is minor but the other three are potentially a mess.
Adding another vmstat counter is bad enough in itself but if the counter
is incremented based on a small subsets of pageblocks, the hint becomes
is potentially useless.

However, does this match what you have in mind or am I over-complicating
things?
quoted
quoted
quoted
...

+static void reset_isolation_suitable(struct zone *zone)
+{
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+
+	/*
+	 * Do not reset more than once every five seconds. If allocations are
+	 * failing sufficiently quickly to allow this to happen then continually
+	 * scanning for compaction is not going to help. The choice of five
+	 * seconds is arbitrary but will mitigate excessive scanning.
+	 */
+	if (time_before(jiffies, zone->compact_blockskip_expire))
+		return;
+	zone->compact_blockskip_expire = jiffies + (HZ * 5);
+
+	/* Walk the zone and mark every pageblock as suitable for isolation */
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (zone != page_zone(page))
+			continue;
+
+		clear_pageblock_skip(page);
+	}
What's the worst-case loop count here?
zone->spanned_pages >> pageblock_order
What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
Lets take an unlikely case - 128G single-node machine. That loop count
on x86-64 would be 65536. It'll be fast enough, particularly in this
path.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help