Re: [PATCH V5] Eliminate task stack trace duplication.

From: Ying Han <hidden>
Date: 2011-08-29 07:08:20

On Fri, Aug 26, 2011 at 4:19 PM, Andrew Morton [off-list ref]wrote:

(I'm back!)

Thank you Andrew for the comments.

Hmm, Looks like we still need some changes for this patch to get it merged
into -mm and I might be able to jump into it sometime next week. :)

--Ying

On Thu, 28 Jul 2011 18:25:59 -0700
Andrew Bresticker [off-list ref] wrote:

quoted

The problem with small dmesg ring buffer like 512k is that only limited

number

quoted

of task traces will be logged. Sometimes we lose important information

only

quoted

because of too many duplicated stack traces. This problem occurs when

dumping

quoted

lots of stacks in a single operation, such as sysrq-T.

This patch tries to reduce the duplication of task stack trace in the

dump

quoted

message by hashing the task stack. The hashtable is a 32k pre-allocated

buffer

quoted

during bootup. Then we hash the task stack with stack_depth 32 for each

stack

quoted

entry. Each time if we find the identical task trace in the task stack,

we dump

quoted

only the pid of the task which has the task trace dumped. So it is easy

to back

quoted

track to the full stack with the pid.

[   58.469730] kworker/0:0     S 0000000000000000     0     4      2

0x00000000

quoted

[   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000

ffff88082fcfc010

quoted

[   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8

ffff88082fcfdfd8

quoted

[   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180

ffff88082fce9860

quoted

[   58.469747] Call Trace:
[   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
[   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
[   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
[   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
[   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
[   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
[   58.469768] kworker/u:0     S 0000000000000004     0     5      2

0x00000000

quoted

[   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000

ffff88082fcfe010

quoted

[   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8

ffff88082fcfffd8

quoted

[   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0

ffff88082fcea080

quoted

[   58.469785] Call Trace:
[   58.469786] <Same stack as pid 4>
[   58.470235] kworker/0:1     S 0000000000000000     0    13      2

0x00000000

quoted

[   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000

ffff88082fd3e010

quoted

[   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8

ffff88082fd3ffd8

quoted

[   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020

ffff88082fcee180

quoted

[   58.470325] Call Trace:
[   58.470332] <Same stack as pid 4>

The code looks OK(ish) to me.  I'm still concerned that the implementation
will miss lots of de-duplications because it is hashing random crud in
the stack frame.

quoted

Note: Non-x86 architectures will need to be updated since show_stack()
now takes an additional argument.

Well, we can't break all architectures.

I can't think of a way to make the preprocessor convert show_stack(a,
b) into show_stack(a, b, N) (this can be done in the other direction).
So all I can think of is to rename x86 show_stack() to something else and
do

#define show_stack_something_else(a, b, c) show_stack(a, b)

for other architectures.

But on the other hand, why did the show_stack() interface get changed?
show_stack() dumps a single tasks's stack, so top-level callers have no
earthly reason to be passing the dup_stack_pid into show_stack().
dup_stack_pid is purely for many-task stackdumps.

Also, the code as-is is pretty much useless for other architectures.
The core changes in arch/x86/kernel/stacktrace.c look pretty generic -
can we design and place this code so that all architectures can use it?

quoted

The problem with small dmesg ring buffer like 512k is that only limited

number

quoted

of task traces will be logged. Sometimes we lose important information

only

quoted

because of too many duplicated stack traces. This problem occurs when

dumping

quoted

lots of stacks in a single operation, such as sysrq-T.

This patch tries to reduce the duplication of task stack trace in the

dump

quoted

message by hashing the task stack. The hashtable is a 32k pre-allocated

buffer

quoted

during bootup. Then we hash the task stack with stack_depth 32 for each

stack

quoted

entry. Each time if we find the identical task trace in the task stack,

we dump

quoted

only the pid of the task which has the task trace dumped. So it is easy

to back

quoted

track to the full stack with the pid.


...

+/*
+ * The implementation of stack trace dedup. It tries to reduce the

duplication

quoted

+ * of task stack trace in the dump by hashing the stack trace. The

hashtable is

quoted

+ * 32k pre-allocated buffer. Then we hash the task stack with

stack_depth

quoted

+ * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical

trace

quoted

+ * is found in the stack, we dump only the pid of previous task. So it

is easy

quoted

+ * to back track to the full stack with the pid.
+ */
+#define DEDUP_MAX_STACK_DEPTH 32
+#define DEDUP_STACK_HASH 32768
+#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
+#define DEDUP_HASH_MAX_ITERATIONS 10

It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least).

But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)

quoted

+struct task_stack {
+     pid_t pid;
+     int len;
+     unsigned long hash;
+};
+
+static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
+static struct task_stack cur_stack;
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
+
+/*
+ * The stack hashtable uses linear probing to resolve collisions.
+ * We consider two stacks to be the same if their hash values and

lengths

quoted

+ * are equal.
+ */
+static unsigned int stack_trace_lookup(void)
+{
+     int j;
+     int index;
+     unsigned int ret = 0;
+     struct task_stack *stack;
+
+     index = cur_stack.hash % DEDUP_STACK_ENTRIES;
+
+     for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
+             stack = stack_hash_table + (index + j) %

DEDUP_STACK_ENTRIES;

(this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)

quoted

+             if (stack->hash == 0) {
+                     *stack = cur_stack;
+                     ret = 0;
+                     break;
+             } else {
+                     if (stack->hash == cur_stack.hash &&
+                         stack->len == cur_stack.len) {
+                             ret = stack->pid;
+                             break;
+                     }
+             }
+     }
+     if (j == DEDUP_HASH_MAX_ITERATIONS)
+             stack_hash_table[index] = cur_stack;

Why stop there?  Why not just append to stack_hash_table[]?  When we
first decide to do a multi-task stackdump, zero the index into the
array.  Each time a task is processed, look to see if it is unique and
if so, add its task_stack to the end of the array.

This may require adding a stacktrace_ops.start().  This could be done
while moving stacktrace_ops (which advertises itself as a "Generic
stack tracer"!) out of x86-specific code.

quoted

+     memset(&cur_stack, 0, sizeof(cur_stack));

Sane, but I'm not sure it's necessary.

quoted

+     return ret;
+}
+

...

Making this all arch-neutral is quite a bit of work, which you may not
feel like undertaking, ho hum.  Also, the lack of any documentation in
that x86 code makes it unready for prime time.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help