Re: [PATCH v2] memcg: add vmscan_stat
From: Andrew Bresticker <hidden>
Date: 2011-07-15 18:34:20
Subsystem:
control group - memory resource controller (memcg), memory management, memory management - mglru (multi-gen lru), memory management - reclaim, the rest · Maintainers:
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton, Linus Torvalds
I've extended your patch to track write-back during page reclaim: --- From: Andrew Bresticker <redacted> Date: Thu, 14 Jul 2011 17:56:48 -0700 Subject: [PATCH] vmscan: Track number of pages written back during page reclaim. This tracks pages written out during page reclaim in memory.vmscan_stat and breaks it down by file vs. anon and context (like "scanned_pages", "rotated_pages", etc.). Example output: $ mkdir /dev/cgroup/memory/1 $ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes $ echo $$ > /dev/cgroup/memory/1/tasks $ dd if=/dev/urandom of=file_20g bs=4096 count=524288 $ cat /dev/cgroup/memory/1/memory.vmscan_stat ... written_pages_by_limit 36 written_anon_pages_by_limit 0 written_file_pages_by_limit 36 ... written_pages_by_limit_under_hierarchy 28 written_anon_pages_by_limit_under_hierarchy 0 written_file_pages_by_limit_under_hierarchy 28 Signed-off-by: Andrew Bresticker <redacted> --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 12 ++++++++++++ mm/vmscan.c | 10 +++++++--- 3 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b49edf..4be907e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h@@ -46,6 +46,7 @@ struct memcg_scanrecord { unsigned long nr_scanned[2]; /* the number of scanned pages */ unsigned long nr_rotated[2]; /* the number of rotated pages */ unsigned long nr_freed[2]; /* the number of freed pages */ + unsigned long nr_written[2]; /* the number of pages written back */ unsigned long elapsed; /* nsec of time elapsed while scanning */ };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bb6e93..5ec2aa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c@@ -221,6 +221,9 @@ enum { FREED, FREED_ANON, FREED_FILE, + WRITTEN, + WRITTEN_ANON, + WRITTEN_FILE, ELAPSED, NR_SCANSTATS, };
@@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = { "freed_pages", "freed_anon_pages", "freed_file_pages", + "written_pages", + "written_anon_pages", + "written_file_pages", "elapsed_ns", }; #define SCANSTAT_WORD_LIMIT "_by_limit"
@@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsignedlong *stats,
stats[FREED_ANON] += rec->nr_freed[0];
stats[FREED_FILE] += rec->nr_freed[1];
+ stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
+ stats[WRITTEN_ANON] += rec->nr_written[0];
+ stats[WRITTEN_FILE] += rec->nr_written[1];
+
stats[ELAPSED] += rec->elapsed;
}
@@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(structmem_cgroup *root_mem,
rec.nr_rotated[1] = 0;
rec.nr_freed[0] = 0;
rec.nr_freed[1] = 0;
+ rec.nr_written[0] = 0;
+ rec.nr_written[1] = 0;
rec.elapsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fb1abd..f73b96e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c@@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(structlist_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc, int file)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);@@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head*page_list, unsigned long nr_dirty = 0; unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; + unsigned long nr_written = 0; cond_resched();
@@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head*page_list, case PAGE_ACTIVATE: goto activate_locked; case PAGE_SUCCESS: + nr_written++; if (PageWriteback(page)) goto keep_lumpy; if (PageDirty(page))
@@ -958,6 +960,8 @@ keep_lumpy: free_page_list(&free_pages); list_splice(&ret_pages, page_list); + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_written[file] += nr_written; count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; }
@@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, structzone *zone, spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, zone, sc); + nr_reclaimed = shrink_page_list(&page_list, zone, sc, file); if (!scanning_global_lru(sc)) sc->memcg_record->nr_freed[file] += nr_reclaimed;
@@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, structzone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
}
local_irq_disable();
--
1.7.3.1
Thanks,
Andrew
On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Tue, 12 Jul 2011 16:02:02 -0700 Andrew Bresticker [off-list ref] wrote:quoted
On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote:quoted
This patch is onto mmotm-0710... got bigger than expected ;( == [PATCH] add memory.vmscan_stat commit log of commit 0ae5e89 " memcg: count the soft_limit reclaimin..."quoted
quoted
says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. For avoiding complexity, I added a new param in scan_control which isforquoted
quoted
exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki <redacted> Changelog: - renamed as vmscan_stat - handle file/anon - added "rotated" - changed names of param in vmscan_stat. --- Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ include/linux/memcontrol.h | 19 ++++ include/linux/swap.h | 6 - mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 39 +++++++- 5 files changed, 303 insertions(+), 18 deletions(-) Index: mmotm-0710/Documentation/cgroups/memory.txt ===================================================================--- mmotm-0710.orig/Documentation/cgroups/memory.txt +++ mmotm-0710/Documentation/cgroups/memory.txt@@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -memory.stat file includes following statistics +5.2.1 memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory.@@ -438,6 +438,89 @@ Note: file_mapped is accounted only when the memory cgroup is ownerofquoted
quoted
page cache.) +5.2.2 memory.vmscan_stat + +memory.vmscan_stat includes statistics information for memory scanningandquoted
quoted
+freeing, reclaiming. The statistics shows memory scanning information since +memory cgroup creation and can be reset to 0 by writing 0 as + + #echo 0 > ../memory.vmscan_stat + +This file contains following statistics. + +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] +[param]_elapsed_ns_by_[reason]_[under_hierarchy] + +For example, + + scanned_file_pages_by_limit indicates the number of scanned + file pages at vmscan. + +Now, 3 parameters are supported + + scanned - the number of pages scanned by vmscan + rotated - the number of pages activated at vmscan + freed - the number of pages freed by vmscan + +If "rotated" is high against scanned/freed, the memcg seems busy. + +Now, 2 reason are supported + + limit - the memory cgroup's limit + system - global memory pressure + softlimit + (global memory pressure not under softlimit is not handlednow)quoted
quoted
+ +When under_hierarchy is added in the tail, the number indicates the +total memcg scan of its children and itself. + +elapsed_ns is a elapsed time in nanosecond. This may include sleeptimequoted
quoted
+and not indicates CPU usage. So, please take this as just showing +latency. + +Here is an example. + +# cat /cgroup/memory/A/memory.vmscan_stat +scanned_pages_by_limit 9471864 +scanned_anon_pages_by_limit 6640629 +scanned_file_pages_by_limit 2831235 +rotated_pages_by_limit 4243974 +rotated_anon_pages_by_limit 3971968 +rotated_file_pages_by_limit 272006 +freed_pages_by_limit 2318492 +freed_anon_pages_by_limit 962052 +freed_file_pages_by_limit 1356440 +elapsed_ns_by_limit 351386416101 +scanned_pages_by_system 0 +scanned_anon_pages_by_system 0 +scanned_file_pages_by_system 0 +rotated_pages_by_system 0 +rotated_anon_pages_by_system 0 +rotated_file_pages_by_system 0 +freed_pages_by_system 0 +freed_anon_pages_by_system 0 +freed_file_pages_by_system 0 +elapsed_ns_by_system 0 +scanned_pages_by_limit_under_hierarchy 9471864 +scanned_anon_pages_by_limit_under_hierarchy 6640629 +scanned_file_pages_by_limit_under_hierarchy 2831235 +rotated_pages_by_limit_under_hierarchy 4243974 +rotated_anon_pages_by_limit_under_hierarchy 3971968 +rotated_file_pages_by_limit_under_hierarchy 272006 +freed_pages_by_limit_under_hierarchy 2318492 +freed_anon_pages_by_limit_under_hierarchy 962052 +freed_file_pages_by_limit_under_hierarchy 1356440 +elapsed_ns_by_limit_under_hierarchy 351386416101 +scanned_pages_by_system_under_hierarchy 0 +scanned_anon_pages_by_system_under_hierarchy 0 +scanned_file_pages_by_system_under_hierarchy 0 +rotated_pages_by_system_under_hierarchy 0 +rotated_anon_pages_by_system_under_hierarchy 0 +rotated_file_pages_by_system_under_hierarchy 0 +freed_pages_by_system_under_hierarchy 0 +freed_anon_pages_by_system_under_hierarchy 0 +freed_file_pages_by_system_under_hierarchy 0 +elapsed_ns_by_system_under_hierarchy 0 + 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy ofgroupsquoted
quoted
only. Index: mmotm-0710/include/linux/memcontrol.h ===================================================================--- mmotm-0710.orig/include/linux/memcontrol.h +++ mmotm-0710/include/linux/memcontrol.h@@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_ struct mem_cgroup *mem_cont, int active, int file); +struct memcg_scanrecord { + struct mem_cgroup *mem; /* scanend memory cgroup */ + struct mem_cgroup *root; /* scan target hierarchy root */ + int context; /* scanning context (see memcontrol.c)*/quoted
quoted
+ unsigned long nr_scanned[2]; /* the number of scanned pages */ + unsigned long nr_rotated[2]; /* the number of rotated pages */ + unsigned long nr_freed[2]; /* the number of freed pages */ + unsigned long elapsed; /* nsec of time elapsed while scanning*/quoted
quoted
+}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or@@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup*mem,quoted
quoted
+ gfp_t gfp_mask, bool noswap, + structmemcg_scanrecordquoted
quoted
*rec); +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup*mem,quoted
quoted
+ gfp_t gfp_mask, bool noswap, + struct zone *zone, + struct memcg_scanrecord *rec, + unsigned long*nr_scanned);quoted
quoted
+ #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif Index: mmotm-0710/include/linux/swap.h ===================================================================--- mmotm-0710.orig/include/linux/swap.h +++ mmotm-0710/include/linux/swap.h@@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zonelist *zonelist, intorder, gfp_t gfp_mask, nodemask_t*mask);quoted
quoted
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup*mem,quoted
quoted
- gfp_t gfp_mask, bool noswap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup*mem,quoted
quoted
- gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long*nr_scanned);quoted
quoted
extern int __isolate_lru_page(struct page *page, int mode, int file); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; Index: mmotm-0710/mm/memcontrol.c ===================================================================--- mmotm-0710.orig/mm/memcontrol.c +++ mmotm-0710/mm/memcontrol.c@@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list { static void mem_cgroup_threshold(struct mem_cgroup *mem); static void mem_cgroup_oom_notify(struct mem_cgroup *mem); +enum { + SCAN_BY_LIMIT, + SCAN_BY_SYSTEM, + NR_SCAN_CONTEXT, + SCAN_BY_SHRINK, /* not recorded now */ +}; + +enum { + SCAN, + SCAN_ANON, + SCAN_FILE, + ROTATE, + ROTATE_ANON, + ROTATE_FILE, + FREED, + FREED_ANON, + FREED_FILE, + ELAPSED, + NR_SCANSTATS, +}; + +struct scanstat { + spinlock_t lock; + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; +};I'm working on a similar effort with Ying here at Google and so far we've been using per-cpu counters for these statistics instead of spin-lock protected counters. Clearly the spin-lock protected counters have less memory overhead and make reading the stat file faster, but our concern is that this method is inconsistent with the other memory stat files such /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any particular reason you chose to use spin-lock protected counters insteadofquoted
per-cpu counters?In my experience, if we do "batch" enouch, it works always better than percpu-counter. percpu counter is effective when batching is difficult. This patch's implementation does enough batching and it's much coarse grained than percpu counter. Then, this patch is better than percpu.quoted
I've also modified your patch to use per-cpu counters instead ofspin-lockquoted
protected counters. I tested it by doing streaming I/O from a ramdisk: $ mke2fs /dev/ram1 $ mkdir /tmp/swapram $ mkdir /tmp/swapram/ram1 $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1 $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096 $ mkdir /dev/cgroup/memory/1 $ echo 8m > /dev/cgroup/memory/1 $ ./ramdisk_load.sh 7 $ echo $$ > /dev/cgroup/memory/1/tasks $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > /dev/zero; done Where ramdisk_load.sh is: for ((i=0; i<=$1; i++)) do echo $$ >/dev/cgroup/memory/1/tasks for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >/dev/zero;quoted
done & done Surprisingly, the per-cpu counters perform worse than the spin-lock protected counters. Over 10 runs of the test above, the per-cpu counters were 1.60% slower in both real time and sys time. I'm wondering if youhavequoted
any insight as to why this is. I can provide my diff against your patchifquoted
necessary.The percpu counte works effectively only when we use +1/-1 at each change of counters. It uses "batch" to merge the per-cpu value to the counter. I think you use default "batch" value but the scan/rotate/free/elapsed value is always larger than "batch" and you just added memory overhead and "if" to pure spinlock counters. Determining this "batch" threshold for percpu counter is difficult. Thanks, -Kame