Re: [PATCH V5] Eliminate task stack trace duplication.
From: Ying Han <hidden>
Date: 2011-08-29 07:08:20
On Fri, Aug 26, 2011 at 4:19 PM, Andrew Morton [off-list ref]wrote:
(I'm back!)
Thank you Andrew for the comments. Hmm, Looks like we still need some changes for this patch to get it merged into -mm and I might be able to jump into it sometime next week. :) --Ying
On Thu, 28 Jul 2011 18:25:59 -0700 Andrew Bresticker [off-list ref] wrote:quoted
The problem with small dmesg ring buffer like 512k is that only limitednumberquoted
of task traces will be logged. Sometimes we lose important informationonlyquoted
because of too many duplicated stack traces. This problem occurs whendumpingquoted
lots of stacks in a single operation, such as sysrq-T. This patch tries to reduce the duplication of task stack trace in thedumpquoted
message by hashing the task stack. The hashtable is a 32k pre-allocatedbufferquoted
during bootup. Then we hash the task stack with stack_depth 32 for eachstackquoted
entry. Each time if we find the identical task trace in the task stack,we dumpquoted
only the pid of the task which has the task trace dumped. So it is easyto backquoted
track to the full stack with the pid. [ 58.469730] kworker/0:0 S 0000000000000000 0 4 20x00000000quoted
[ 58.469735] ffff88082fcfde80 0000000000000046 ffff88082e9d8000ffff88082fcfc010quoted
[ 58.469739] ffff88082fce9860 0000000000011440 ffff88082fcfdfd8ffff88082fcfdfd8quoted
[ 58.469743] 0000000000011440 0000000000000000 ffff88082fcee180ffff88082fce9860quoted
[ 58.469747] Call Trace: [ 58.469751] [<ffffffff8108525a>] worker_thread+0x24b/0x250 [ 58.469754] [<ffffffff8108500f>] ? manage_workers+0x192/0x192 [ 58.469757] [<ffffffff810885bd>] kthread+0x82/0x8a [ 58.469760] [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10 [ 58.469763] [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112 [ 58.469765] [<ffffffff8141aed0>] ? gs_change+0xb/0xb [ 58.469768] kworker/u:0 S 0000000000000004 0 5 20x00000000quoted
[ 58.469773] ffff88082fcffe80 0000000000000046 ffff880800000000ffff88082fcfe010quoted
[ 58.469777] ffff88082fcea080 0000000000011440 ffff88082fcfffd8ffff88082fcfffd8quoted
[ 58.469781] 0000000000011440 0000000000000000 ffff88082fd4e9a0ffff88082fcea080quoted
[ 58.469785] Call Trace: [ 58.469786] <Same stack as pid 4> [ 58.470235] kworker/0:1 S 0000000000000000 0 13 20x00000000quoted
[ 58.470255] ffff88082fd3fe80 0000000000000046 ffff880800000000ffff88082fd3e010quoted
[ 58.470279] ffff88082fcee180 0000000000011440 ffff88082fd3ffd8ffff88082fd3ffd8quoted
[ 58.470301] 0000000000011440 0000000000000000 ffffffff8180b020ffff88082fcee180quoted
[ 58.470325] Call Trace: [ 58.470332] <Same stack as pid 4>The code looks OK(ish) to me. I'm still concerned that the implementation will miss lots of de-duplications because it is hashing random crud in the stack frame.quoted
Note: Non-x86 architectures will need to be updated since show_stack() now takes an additional argument.Well, we can't break all architectures. I can't think of a way to make the preprocessor convert show_stack(a, b) into show_stack(a, b, N) (this can be done in the other direction). So all I can think of is to rename x86 show_stack() to something else and do #define show_stack_something_else(a, b, c) show_stack(a, b) for other architectures. But on the other hand, why did the show_stack() interface get changed? show_stack() dumps a single tasks's stack, so top-level callers have no earthly reason to be passing the dup_stack_pid into show_stack(). dup_stack_pid is purely for many-task stackdumps. Also, the code as-is is pretty much useless for other architectures. The core changes in arch/x86/kernel/stacktrace.c look pretty generic - can we design and place this code so that all architectures can use it?quoted
The problem with small dmesg ring buffer like 512k is that only limitednumberquoted
of task traces will be logged. Sometimes we lose important informationonlyquoted
because of too many duplicated stack traces. This problem occurs whendumpingquoted
lots of stacks in a single operation, such as sysrq-T. This patch tries to reduce the duplication of task stack trace in thedumpquoted
message by hashing the task stack. The hashtable is a 32k pre-allocatedbufferquoted
during bootup. Then we hash the task stack with stack_depth 32 for eachstackquoted
entry. Each time if we find the identical task trace in the task stack,we dumpquoted
only the pid of the task which has the task trace dumped. So it is easyto backquoted
track to the full stack with the pid. ... +/* + * The implementation of stack trace dedup. It tries to reduce theduplicationquoted
+ * of task stack trace in the dump by hashing the stack trace. Thehashtable isquoted
+ * 32k pre-allocated buffer. Then we hash the task stack withstack_depthquoted
+ * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identicaltracequoted
+ * is found in the stack, we dump only the pid of previous task. So itis easyquoted
+ * to back track to the full stack with the pid. + */ +#define DEDUP_MAX_STACK_DEPTH 32 +#define DEDUP_STACK_HASH 32768 +#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack)) +#define DEDUP_HASH_MAX_ITERATIONS 10It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least). But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)quoted
+struct task_stack { + pid_t pid; + int len; + unsigned long hash; +}; + +static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES]; +static struct task_stack cur_stack; +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock); + +/* + * The stack hashtable uses linear probing to resolve collisions. + * We consider two stacks to be the same if their hash values andlengthsquoted
+ * are equal. + */ +static unsigned int stack_trace_lookup(void) +{ + int j; + int index; + unsigned int ret = 0; + struct task_stack *stack; + + index = cur_stack.hash % DEDUP_STACK_ENTRIES; + + for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) { + stack = stack_hash_table + (index + j) %DEDUP_STACK_ENTRIES; (this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)quoted
+ if (stack->hash == 0) { + *stack = cur_stack; + ret = 0; + break; + } else { + if (stack->hash == cur_stack.hash && + stack->len == cur_stack.len) { + ret = stack->pid; + break; + } + } + } + if (j == DEDUP_HASH_MAX_ITERATIONS) + stack_hash_table[index] = cur_stack;Why stop there? Why not just append to stack_hash_table[]? When we first decide to do a multi-task stackdump, zero the index into the array. Each time a task is processed, look to see if it is unique and if so, add its task_stack to the end of the array. This may require adding a stacktrace_ops.start(). This could be done while moving stacktrace_ops (which advertises itself as a "Generic stack tracer"!) out of x86-specific code.quoted
+ memset(&cur_stack, 0, sizeof(cur_stack));Sane, but I'm not sure it's necessary.quoted
+ return ret; +} + ...Making this all arch-neutral is quite a bit of work, which you may not feel like undertaking, ho hum. Also, the lack of any documentation in that x86 code makes it unready for prime time.