Thread (3 messages) 3 messages, 3 authors, 2011-08-29

Re: [PATCH V5] Eliminate task stack trace duplication.

From: Ying Han <hidden>
Date: 2011-08-29 07:08:20

On Fri, Aug 26, 2011 at 4:19 PM, Andrew Morton [off-list ref]wrote:
(I'm back!)
Thank you Andrew for the comments.

Hmm, Looks like we still need some changes for this patch to get it merged
into -mm and I might be able to jump into it sometime next week. :)

--Ying
On Thu, 28 Jul 2011 18:25:59 -0700
Andrew Bresticker [off-list ref] wrote:
quoted
The problem with small dmesg ring buffer like 512k is that only limited
number
quoted
of task traces will be logged. Sometimes we lose important information
only
quoted
because of too many duplicated stack traces. This problem occurs when
dumping
quoted
lots of stacks in a single operation, such as sysrq-T.

This patch tries to reduce the duplication of task stack trace in the
dump
quoted
message by hashing the task stack. The hashtable is a 32k pre-allocated
buffer
quoted
during bootup. Then we hash the task stack with stack_depth 32 for each
stack
quoted
entry. Each time if we find the identical task trace in the task stack,
we dump
quoted
only the pid of the task which has the task trace dumped. So it is easy
to back
quoted
track to the full stack with the pid.

[   58.469730] kworker/0:0     S 0000000000000000     0     4      2
0x00000000
quoted
[   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000
ffff88082fcfc010
quoted
[   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8
ffff88082fcfdfd8
quoted
[   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180
ffff88082fce9860
quoted
[   58.469747] Call Trace:
[   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
[   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
[   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
[   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
[   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
[   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
[   58.469768] kworker/u:0     S 0000000000000004     0     5      2
0x00000000
quoted
[   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000
ffff88082fcfe010
quoted
[   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8
ffff88082fcfffd8
quoted
[   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0
ffff88082fcea080
quoted
[   58.469785] Call Trace:
[   58.469786] <Same stack as pid 4>
[   58.470235] kworker/0:1     S 0000000000000000     0    13      2
0x00000000
quoted
[   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000
ffff88082fd3e010
quoted
[   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8
ffff88082fd3ffd8
quoted
[   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020
ffff88082fcee180
quoted
[   58.470325] Call Trace:
[   58.470332] <Same stack as pid 4>
The code looks OK(ish) to me.  I'm still concerned that the implementation
will miss lots of de-duplications because it is hashing random crud in
the stack frame.
quoted
Note: Non-x86 architectures will need to be updated since show_stack()
now takes an additional argument.
Well, we can't break all architectures.

I can't think of a way to make the preprocessor convert show_stack(a,
b) into show_stack(a, b, N) (this can be done in the other direction).
So all I can think of is to rename x86 show_stack() to something else and
do

#define show_stack_something_else(a, b, c) show_stack(a, b)

for other architectures.

But on the other hand, why did the show_stack() interface get changed?
show_stack() dumps a single tasks's stack, so top-level callers have no
earthly reason to be passing the dup_stack_pid into show_stack().
dup_stack_pid is purely for many-task stackdumps.

Also, the code as-is is pretty much useless for other architectures.
The core changes in arch/x86/kernel/stacktrace.c look pretty generic -
can we design and place this code so that all architectures can use it?

quoted
The problem with small dmesg ring buffer like 512k is that only limited
number
quoted
of task traces will be logged. Sometimes we lose important information
only
quoted
because of too many duplicated stack traces. This problem occurs when
dumping
quoted
lots of stacks in a single operation, such as sysrq-T.

This patch tries to reduce the duplication of task stack trace in the
dump
quoted
message by hashing the task stack. The hashtable is a 32k pre-allocated
buffer
quoted
during bootup. Then we hash the task stack with stack_depth 32 for each
stack
quoted
entry. Each time if we find the identical task trace in the task stack,
we dump
quoted
only the pid of the task which has the task trace dumped. So it is easy
to back
quoted
track to the full stack with the pid.


...

+/*
+ * The implementation of stack trace dedup. It tries to reduce the
duplication
quoted
+ * of task stack trace in the dump by hashing the stack trace. The
hashtable is
quoted
+ * 32k pre-allocated buffer. Then we hash the task stack with
stack_depth
quoted
+ * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical
trace
quoted
+ * is found in the stack, we dump only the pid of previous task. So it
is easy
quoted
+ * to back track to the full stack with the pid.
+ */
+#define DEDUP_MAX_STACK_DEPTH 32
+#define DEDUP_STACK_HASH 32768
+#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
+#define DEDUP_HASH_MAX_ITERATIONS 10
It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least).

But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)
quoted
+struct task_stack {
+     pid_t pid;
+     int len;
+     unsigned long hash;
+};
+
+static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
+static struct task_stack cur_stack;
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
+
+/*
+ * The stack hashtable uses linear probing to resolve collisions.
+ * We consider two stacks to be the same if their hash values and
lengths
quoted
+ * are equal.
+ */
+static unsigned int stack_trace_lookup(void)
+{
+     int j;
+     int index;
+     unsigned int ret = 0;
+     struct task_stack *stack;
+
+     index = cur_stack.hash % DEDUP_STACK_ENTRIES;
+
+     for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
+             stack = stack_hash_table + (index + j) %
DEDUP_STACK_ENTRIES;

(this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)
quoted
+             if (stack->hash == 0) {
+                     *stack = cur_stack;
+                     ret = 0;
+                     break;
+             } else {
+                     if (stack->hash == cur_stack.hash &&
+                         stack->len == cur_stack.len) {
+                             ret = stack->pid;
+                             break;
+                     }
+             }
+     }
+     if (j == DEDUP_HASH_MAX_ITERATIONS)
+             stack_hash_table[index] = cur_stack;
Why stop there?  Why not just append to stack_hash_table[]?  When we
first decide to do a multi-task stackdump, zero the index into the
array.  Each time a task is processed, look to see if it is unique and
if so, add its task_stack to the end of the array.

This may require adding a stacktrace_ops.start().  This could be done
while moving stacktrace_ops (which advertises itself as a "Generic
stack tracer"!) out of x86-specific code.
quoted
+     memset(&cur_stack, 0, sizeof(cur_stack));
Sane, but I'm not sure it's necessary.
quoted
+     return ret;
+}
+

...
Making this all arch-neutral is quite a bit of work, which you may not
feel like undertaking, ho hum.  Also, the lack of any documentation in
that x86 code makes it unready for prime time.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help