Thread (4 messages) 4 messages, 2 authors, 2009-01-31

Re: 2.6.28: All I/O, D-state: 3ware 9650SE-16ML/Intel/965/MTRR/PAT?

From: Justin Piszcz <hidden>
Date: 2009-01-31 17:37:58
Also in: linux-ide, linux-xfs, lkml

I think it was a coincidence when I tried to login before, I tried it 
again (after trying to re-produce the problem, which I had been able to 
do) and this time the mouse worked but the keyboard did not..even running
echo b > sysrq-trigger did not reboot the server..  Had to resort to 
manual method.

I am now on kernel 2.6.28.2 and will see if I can make it happen again.

Here are the relevant traces:
http://home.comcast.net/~jpiszcz/20090131/stacktrace-cpus.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-m.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-t.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-w.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-t.txt
http://home.comcast.net/~jpiszcz/20090131/xfs.txt

The stacktace for each CPU is interesting, see below, XFS related(?)

Can anyone offer any suggestions on what to look for or how to debug this
further?

$ df -i
/dev/sda1            2539035008 1571332 2537463676    1% /r1

Should I be mounting with -o inode64?

/dev/sda1 on /r1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)

[2604823.699927] SysRq : Show backtrace of all active CPUs
[2604823.699993] CPU0:
[2604823.699996]  ffffffff807d8f38 0000000000000046 ffff880228b516e8 0000000000000000
[2604823.700000]  ffff8801f4f93588 000000000000000e ffff8801a8843e98 ffffffff8046ff72
[2604823.700000]  ffff880228b516c0 ffffffff80272a8c ffff8801f4f935a8 ffff8801f4f93450
[2604823.700000] Call Trace:
[2604823.700000]  <IRQ>  [<ffffffff8046ff72>] showacpu+0x42/0x60
[2604823.700000]  [<ffffffff80272a8c>] generic_smp_call_function_interrupt+0x4c/0xf0
[2604823.700000]  [<ffffffff8023855f>] smp_call_function_interrupt+0x1f/0x40
[2604823.700000]  [<ffffffff8022a3bb>] call_function_interrupt+0x6b/0x70
[2604823.700000]  <EOI>  [<ffffffff802790d6>] find_get_pages+0x56/0xf0
[2604823.700000]  [<ffffffff802790b4>] find_get_pages+0x34/0xf0
[2604823.700000]  [<ffffffff80281cd7>] pagevec_lookup+0x17/0x20
[2604823.700000]  [<ffffffff803e8459>] xfs_cluster_write+0x99/0x170
[2604823.700000]  [<ffffffff803e8bdb>] xfs_page_state_convert+0x52b/0x6d0
[2604823.700000]  [<ffffffff803e9071>] xfs_vm_writepage+0x71/0x130
[2604823.700000]  [<ffffffff802849c0>] shrink_page_list+0x600/0x760
[2604823.700000]  [<ffffffff80284d9f>] shrink_list+0x27f/0x640
[2604823.700000]  [<ffffffff8040c9e1>] __up_write+0x21/0x150
[2604823.700000]  [<ffffffff8028095d>] determine_dirtyable_memory+0xd/0x20
[2604823.700000]  [<ffffffff802809cd>] get_dirty_limits+0x1d/0x290
[2604823.700000]  [<ffffffff803cd468>] xfs_iomap+0x188/0x310
[2604823.700000]  [<ffffffff802853a6>] shrink_zone+0x246/0x340
[2604823.700000]  [<ffffffff802861de>] try_to_free_pages+0x22e/0x390
[2604823.700000]  [<ffffffff80283330>] isolate_pages_global+0x0/0x270
[2604823.700000]  [<ffffffff8027f472>] __alloc_pages_internal+0x212/0x4e0
[2604823.700000]  [<ffffffff80281644>] __do_page_cache_readahead+0xf4/0x240
[2604823.700000]  [<ffffffff80281ae2>] ondemand_readahead+0x172/0x1f0
[2604823.700000]  [<ffffffff8027af39>] generic_file_aio_read+0x329/0x620
[2604823.700000]  [<ffffffff803f08da>] xfs_read+0x12a/0x280
[2604823.700000]  [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
[2604823.700000]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2604823.700000]  [<ffffffff80608577>] thread_return+0x3d/0x5c6
[2604823.700000]  [<ffffffff802a4aa8>] vfs_read+0xc8/0x170
[2604823.700000]  [<ffffffff802a4c53>] sys_read+0x53/0xa0
[2604823.700000]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2604823.700089] CPU2:
[2604823.700091]  ffff88022b8d3f38 0000000000000046 ffff880228b516e8 0000000000000002
[2604823.700094]  00007fff4e576050 00007fff4e576048 00007fff4e57605c ffffffff8046ff72
[2604823.700097]  ffff880228b516c0 ffffffff80272a8c 0000000000000000 ffff880100003eb0
[2604823.700099] Call Trace:
[2604823.700101]  <IRQ>  [<ffffffff8046ff72>] showacpu+0x42/0x60
[2604823.700105]  [<ffffffff80272a8c>] generic_smp_call_function_interrupt+0x4c/0xf0
[2604823.700108]  [<ffffffff8023855f>] smp_call_function_interrupt+0x1f/0x40
[2604823.700110]  [<ffffffff8022a3bb>] call_function_interrupt+0x6b/0x70
[2604823.700111]  <EOI>  [<ffffffff8040f426>] copy_user_generic_string+0x36/0x40
[2604823.700116]  [<ffffffff8025a7e2>] sys_rt_sigprocmask+0x62/0x120
[2604823.700119]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
p34:~#


On Sat, 31 Jan 2009, Justin Piszcz wrote:
I have opened a case with 3ware about this, but I do not believe the 3ware 
card is at fault, it appears something to do with X or the memory subsystem.
This is not easily reproducible and does not occur often. Maybe once or twice 
per month, if that, hopefully someone can make out what is happening in the 
trace below?

KERNEL:
# uname -a
Linux p34.internal.lan 2.6.28 #9 SMP Thu Jan 1 08:07:31 EST 2009 x86_64 
GNU/Linux

PROBLEM:
During I/O to the 3ware RAID (at least in the past 3-4 previous cases when it 
occured), the system stops all I/O to all processes that are have I/O to 
perform on the 3ware card.  I did a lot of analysis which is shown at the 
bottom of this e-mail.

ODDITY:
After all of the diagnostics I did below(!), what ended up "bringing" the 
system back to life?  I moved the mouse cursor on the local machine (I run X) 
and it uses the Intel graphics (Intel DG965WH motherboard), once I logged in, 
the ENTIRE system became responsive again, what the heck!?  All of the 
processes in D-state were now freed!  I have taken an echo t > 
/proc/sysrq-trigger, hopefully someone can shed some light on this?  It 
almost seems like the memory/MTRR(?) is stomping the other memory that is 
used by the system?

QUESTION: Why??
At the bottom of this e-mail (and also the 3ware ticket):
Request:-  31st January 2009 at 2:24
       Just as I was about to give up, I went to the machine, woke up the 
screensaver (moved mouse) and logged in and then EVERYTHING started working 
again??

NOTE:
After all of this, I re-ran the unrar, it worked fine.  My guess is X is 
doing
something very strange with the memory?  I can try running no X for awhile
and see if that gets rid of the problem?  Is that the recommended course of
action, are there other things I can try to debug this problem further? Has
anyone ever seen anything like this?

CONFIG:
http://home.comcast.net/~jpiszcz/20090131/config-2.6.28.txt

LOGS: (from Xorg.log)
(II) intel(0): EDID vendor "DEL", prod id 40984
(II) intel(0): Output VGA connected
(II) intel(0): Output VGA using initial mode 1680x1050
(II) intel(0): detected 512 kB GTT.
(II) intel(0): detected 7676 kB stolen memory. <- Is this of interest?
(==) intel(0): video overlay key set to 0x101fe
(==) intel(0): Will not try to enable page flipping
(==) intel(0): Triple buffering disabled
(==) intel(0): Intel XvMC decoder disabled
(==) intel(0): Using gamma correction (1.0, 1.0, 1.0)
(**) intel(0): Display dimensions: (430, 270) mm
(**) intel(0): DPI set to (99, 158)

KERNEL OPTIONS: (of interest?):
 | |    [*] MTRR (Memory Type Range Register) support                    | |
 | |    [*]   MTRR cleanup support                                       | |
 | |    (0)     MTRR cleanup enable value (0-1)                          | |
 | |    (1)     MTRR cleanup spare reg num (0-7)                         | |
 | |    [*]   x86 PAT support                                            | |

Should I consider disabling MTRR, x86 PAT support or both?  To see if the
problem recurs? Are there any other kernel options I have enabled that could
lead to this problem?

LSPCI:
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 
02)
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 
02)
00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated 
Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82P965/G965 HECI 
Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network 
Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio 
Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 
(rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 
(rev 02)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 
(rev 02)
00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4 
(rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 
(rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HH (ICH8DH) LPC Interface 
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port 
SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 
02)
01:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID (rev 01)
03:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6101 single-port 
PATA133 interface (rev b1)
07:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet 
Controller (rev 02)
07:01.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet 
Controller

INTERRUPTS:
# cat /proc/interrupts
          CPU0       CPU1       CPU2       CPU3
 0:       8722          0          0          0   IO-APIC-edge      timer
 1:          2          0          0          0   IO-APIC-edge      i8042
 9:          0          0          0          0   IO-APIC-fasteoi   acpi
12:          3          0          0          0   IO-APIC-edge      i8042
16:   10253239   34561425   28715428   20097192   IO-APIC-fasteoi   3w-9xxx, 
uhci_hcd:usb3
18:       1532         38          0          0   IO-APIC-fasteoi 
ehci_hcd:usb1, uhci_hcd:usb7
19:      43319      66498          0          0   IO-APIC-fasteoi 
uhci_hcd:usb6
21:    1503630  282221348   34906481  251529888   IO-APIC-fasteoi 
uhci_hcd:usb4, eth1
22:    3041085   50898734   44446192  106189021   IO-APIC-fasteoi   HDA 
Intel, eth2
23:   13630119   27422086    1544853      13244   IO-APIC-fasteoi 
ehci_hcd:usb2, uhci_hcd:usb5
375:    3059314   15607219    8335846    3335608   PCI-MSI-edge      ahci
376:    2259672  275715231   18496746  107799784   PCI-MSI-edge      eth0
377:    2653268    4958739          0          0   PCI-MSI-edge 
i915@pci:0000:00:02.0
NMI:          0          0          0          0   Non-maskable interrupts
LOC: 2603771749 2610047685 2594237486 2589369293   Local timer interrupts
RES:    5511462    1894029    4550761    1766978   Rescheduling interrupts
CAL:      15747      15686      15135       6535   Function call interrupts
TLB:     749115     747241     962808     893832   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0

DETAILS (warning! very long):

Request:-  31st January 2009 at 1:45
       This problem has happened 3 or 4 times now, it is very strange, it 
occurs across several kernels, the one I am running now is (2.6.28).  It 
occurs across two schedulers (so it is unlikely the kernel); noop and 
deadline.

Essentially what happens is all I/O is blocked during decompression of a file 
and it stays that way:

Extracting from backup.r81

...         backup-20090101.img 
87%

Extracting from backup.r82

...         backup-20090101.img 
88%

Extracting from backup.r83

...         backup-20090101.img 
89%

Extracting from backup.r84

...         backup-20090101.img 
90%

Here, it is *stuck* at 90%:

 PID USER      PR  NI  VIRT SWAP  RES  SHR S %CPU %MEM   TIME P COMMAND
 348 user      20   0 18616  16m 1572 1108 R  100  0.0  21:59 0 unrar

It sits at 100%, this has happened during VERIFYs and also when the system is 
just plain idle.  It seems to be a very nasty bug because the process _never_ 
seems to complete.

When this bug occurs, it appears all I/O to the array is blocked:
$ dd if=/dev/zero of=create_file bs=1M count=1024

It just sits there, when I try to strace the process, I get nothing:
$ strace -p 770
Process 770 attached - interrupt to quit

I tried some things:

# /opt/3ware/9500/tw_cli  /c0 set verify=5
Setting Verify Rate on /c0 to 5 (Fastest I/O)...Done.

# /opt/3ware/9500/tw_cli  /c0 set verify=1
Setting Verify Rate on /c0 to 1 (Fastest Verify)...Done.

# /opt/3ware/9500/tw_cli  /c0 set verify=3
Setting Verify Rate on /c0 to 3 (Balanced Verify and I/O )...Done.

No difference.

In its current failure mode:
//p34> info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    VERIFYING      -       91%(A)  64K     12107.1   ON     ON u1 
SPARE     OK             -       -       -       931.505   -      ON


It is verifying, maybe it will 'unblock' the I/O when its done verifying?

As you can see, the box is stagnant during this problem:
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
5  0  83104  47396    136 6586588    0    0   580   442    3    0  1  1 98 
0
5  0  83104  47380    136 6586612    0    0     0     0 4143  129  0 25 75 
0
5  0  83104  47380    136 6586612    0    0     0     0 4063  147  0 25 75 
0
5  0  83104  47380    136 6586612    0    0     0     0 4052  139  0 25 75 
0
5  0  83104  47256    136 6586612    0    0     0     0 4137  199  0 25 75 
0
5  0  83104  47256    136 6586612    0    0     0     0 4916  160  0 25 75 
0

Here is an iostat on the device (/dev/sda):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.00    0.00   25.00    0.00    0.00   75.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00 
0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.00    0.00   25.06    0.00    0.00   74.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00 
0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.00    0.00   25.00    0.00    0.00   75.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00 
0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.00    0.00   25.06    0.00    0.00   74.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda1              0.00     0.00    0.00    0.00     0.00     0.00     0.00 
0.00    0.00   0.00   0.00

I will let it sit for now and see if it "comes to" after the VERIFY is 
complete.  In the interim I have attached the tw_cli /c0 show diag output, 
but that looks normal as well, at least the most recent output, this looks 
like it could be a deeper bug than just something on the surface.

The 25% is probably 1 of the CPUs (quad core) pegged at 100% trying to get 
I/O and the controller is not allowing it!

This problem was *NEVER* experienced when I was not using the 3ware card, it 
only happened after I began using this card in my machine.

# /opt/3ware/9500/tw_cli  /c0 show all
/c0 Driver Version = 2.26.02.011
/c0 Model = 9650SE-16ML
/c0 Available Memory = 224MB
/c0 Firmware Version = FE9X 4.06.00.004
/c0 Bios Version = BE9X 4.05.00.015
/c0 Boot Loader Version = BL9X 3.08.00.001
/c0 Serial Number = [[ snip ]]
/c0 PCB Version = Rev 032
/c0 PCHIP Version = 2.00
/c0 ACHIP Version = 1.90
/c0 Number of Ports = 16
/c0 Number of Drives = 16
/c0 Number of Units = 2
/c0 Total Optimal Units = 2
/c0 Not Optimal Units = 0 /c0 JBOD Export Policy = off
/c0 Disk Spinup Policy = 1
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Controller Bus Type = PCIe
/c0 Controller Bus Width = 8 lanes
/c0 Controller Bus Speed = 2.5 Gbps/lane

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    VERIFYING      -       92%(A)  64K     12107.1   ON     ON u1 
SPARE     OK             -       -       -       931.505   -      ON

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    WD-...
p1     OK               u0     931.51 GB   1953525168    WD-...
p2     OK               u0     931.51 GB   1953525168    WD-...
p3     OK               u0     931.51 GB   1953525168    WD-...
p4     OK               u0     931.51 GB   1953525168    WD-...
p5     OK               u0     931.51 GB   1953525168    WD-...
p6     OK               u0     931.51 GB   1953525168    WD-...
p7     OK               u0     931.51 GB   1953525168    WD-...
p8     OK               u0     931.51 GB   1953525168    WD-...
p9     OK               u0     931.51 GB   1953525168    WD-...
p10    OK               u0     931.51 GB   1953525168    WD-...
p11    OK               u0     931.51 GB   1953525168    WD-...
p12    OK               u0     931.51 GB   1953525168    WD-...
p13    OK               u0     931.51 GB   1953525168    WD-...
p14    OK               u0     931.51 GB   1953525168    WD-...
p15    OK               u1     931.51 GB   1953525168    WD-...

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255    28-Dec-2008

Once this is "over" then I can attach the tw_cli diag output right now its 
locking/blocking any I/O, I cannot even ssh to the host, even though the root 
partition/drives are not on the 3ware controller!!

root      8694  0.0  0.0      0     0 ?        D    Jan29   2:21 [pdflush]
root      2226  0.0  0.0  10100   404 ?        Ds   Jan01   0:14 
/sbin/syslogd -r

Lots of processes in D-state.


Request:-  31st January 2009 at 2:8
       Nope, its still LOCKED/blocked/etc:

# /opt/3ware/9500/tw_cli  info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       64K     12107.1   ON     ON u1 
SPARE     OK             -       -       -       931.505   -      ON

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    WD-
p1     OK               u0     931.51 GB   1953525168    WD-
p2     OK               u0     931.51 GB   1953525168    WD-
p3     OK               u0     931.51 GB   1953525168    WD-
p4     OK               u0     931.51 GB   1953525168    WD-
p5     OK               u0     931.51 GB   1953525168    WD-
p6     OK               u0     931.51 GB   1953525168    WD-
p7     OK               u0     931.51 GB   1953525168    WD-
p8     OK               u0     931.51 GB   1953525168    WD-
p9     OK               u0     931.51 GB   1953525168    WD-
p10    OK               u0     931.51 GB   1953525168    WD-
p11    OK               u0     931.51 GB   1953525168    WD-
p12    OK               u0     931.51 GB   1953525168    WD-
p13    OK               u0     931.51 GB   1953525168    WD-
p14    OK               u0     931.51 GB   1953525168    WD-
p15    OK               u1     931.51 GB   1953525168    WD-

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255    28-Dec-2008

Completely blocking all I/O..

p34:~# ps auxww | grep -i unrar
user       348 99.6  0.0  18616  1572 pts/19   R+   04:05  60:59 unrar x 
backup01.rar
root       865  0.0  0.0   4220   724 pts/29   S+   05:06   0:00 grep -i 
unrar
p34:~# kill -9 348
p34:~# p34:~# killall -9 dd
p34:~#

Still...... BLOCKED...



Request:-  31st January 2009 at 2:13
       I was able to get a call trace, when I tried to write it to a file 
though, it was BLOCKED, so this is from a free ssh session I had open:

[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff80264979>] autoremove_wake_function+0x9/0x30
[2578818.429166]  [<ffffffff8024597a>] __wake_up_common+0x5a/0x90
[2578818.429166]  [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
[2578818.429166]  [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
[2578818.429166]  [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
[2578818.429166]  [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
[2578818.429166]  [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
[2578818.429166]  [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802af15d>] kern_path+0x1d/0x40
[2578818.429166]  [<ffffffff8025408e>] current_fs_time+0x1e/0x30
[2578818.429166]  [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
[2578818.429166]  [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
[2578818.429166]  [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
[2578818.429166]  [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
[2578818.429166]  [<ffffffff8056db89>] sys_sendto+0x119/0x180
[2578818.429166]  [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
[2578818.429166]  [<ffffffff8056d046>] sys_connect+0x86/0xe0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] cron          S ffff88022aae7bc8     0   861   3044
[2578818.429166]  ffff8801005b9b88 0000000000000086 00007f68d6c77000 
ffffe2000453ab70
[2578818.429166]  ffffffff807cd000 ffff8801005b5260 ffff880211f8f160 
ffff8801005b5490
[2578818.429166]  000000032abe1b60 ffff8801005b9d18 ffff8801005b5490 
ffffffff8028b45b
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff8028b45b>] handle_mm_fault+0x2bb/0x7a0
[2578818.429166]  [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
[2578818.429166]  [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
[2578818.429166]  [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
[2578818.429166]  [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
[2578818.429166]  [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
[2578818.429166]  [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802af15d>] kern_path+0x1d/0x40
[2578818.429166]  [<ffffffff8025408e>] current_fs_time+0x1e/0x30
[2578818.429166]  [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
[2578818.429166]  [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
[2578818.429166]  [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
[2578818.429166]  [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
[2578818.429166]  [<ffffffff8056db89>] sys_sendto+0x119/0x180
[2578818.429166]  [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
[2578818.429166]  [<ffffffff8056d046>] sys_connect+0x86/0xe0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] cron          S 00000000ffffffff     0   862   3044
[2578818.429166]  ffff8801005bbb88 0000000000000086 00007f68d6c77000 
ffffe2000453ab70
[2578818.429166]  ffffffff807cd000 ffff8801005b4cf0 ffff88022b8950e0 
ffff8801005b4f20
[2578818.429166]  0000000129928420 0000000199ae5567 ffff8801005b4f20 
ffffffff8028b45b
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff8028b45b>] handle_mm_fault+0x2bb/0x7a0
[2578818.429166]  [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
[2578818.429166]  [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
[2578818.429166]  [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
[2578818.429166]  [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
[2578818.429166]  [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
[2578818.429166]  [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802af15d>] kern_path+0x1d/0x40
[2578818.429166]  [<ffffffff8025408e>] current_fs_time+0x1e/0x30
[2578818.429166]  [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
[2578818.429166]  [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
[2578818.429166]  [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
[2578818.429166]  [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
[2578818.429166]  [<ffffffff8056db89>] sys_sendto+0x119/0x180
[2578818.429166]  [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
[2578818.429166]  [<ffffffff8056d046>] sys_connect+0x86/0xe0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b


Request:-  31st January 2009 at 2:16
       These are the 'optimizations' I use on this array:

 echo 128 > /sys/block/$i/queue/max_sectors_kb   # 128 is default
 echo 512 > /sys/block/$i/queue/nr_requests      # 128 is default
 # echo 254 > /sys/block/$i/device/queue_depth   # 254 is default
 echo deadline > /sys/block/$i/queue/scheduler   # distribution dependent
 blockdev --setra 16384 /dev/$i                  # set readahead

In the past I had been able to kill the process(es) involved or they died 
eventually, however, this does not seem to be the case this time, I still 
cannot get my system back!

Request:-  31st January 2009 at 2:20
       Here is more of the echo t > sysrq-trigger:

[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff805de205>] unix_stream_sendmsg+0x3b5/0x3d0
[2578818.429166]  [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
[2578818.429166]  [<ffffffff80264b03>] prepare_to_wait+0x23/0x80
[2578818.429166]  [<ffffffff805ddc92>] unix_stream_recvmsg+0x432/0x5f0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802aeb6c>] __link_path_walk+0xdbc/0xf20
[2578818.429166]  [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
[2578818.429166]  [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802a4b3e>] vfs_read+0x15e/0x170
[2578818.429166]  [<ffffffff802a4c53>] sys_read+0x53/0xa0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] sshd          S 00000000ffffffff     0 32669  32667
[2578818.429166]  ffff8801c4b15998 0000000000000086 ffffffff80755008 
ffffffff8040c2c0
[2578818.429166]  ffffffff807cd000 ffff88022b968700 ffff88022b8f2bb0 
ffff88022b968930
[2578818.429166]  0000000328057c18 0000000199b05a5c ffff88022b968930 
ffff8801c4b159a8
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff8040c2c0>] rb_insert_color+0x100/0x130
[2578818.429166]  [<ffffffff8026800f>] hrtimer_start_range_ns+0xdf/0x1b0
[2578818.429166]  [<ffffffff80267eb9>] hrtimer_try_to_cancel+0x39/0x80
[2578818.429166]  [<ffffffff80609415>] schedule_hrtimeout_range+0xb5/0x150
[2578818.429166]  [<ffffffff802678e0>] hrtimer_wakeup+0x0/0x30
[2578818.429166]  [<ffffffff802b35d9>] do_select+0x539/0x630
[2578818.429166]  [<ffffffff802b3d30>] __pollwait+0x0/0x130
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80575f57>] memcpy_toiovec+0x57/0x80
[2578818.429166]  [<ffffffff8060a529>] _spin_lock_bh+0x9/0x20
[2578818.429166]  [<ffffffff8056fcc3>] release_sock+0x13/0xa0
[2578818.429166]  [<ffffffff805ab7dc>] tcp_recvmsg+0x32c/0x940
[2578818.429166]  [<ffffffff8056f110>] sock_common_recvmsg+0x30/0x50
[2578818.429166]  [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
[2578818.429166]  [<ffffffff802b38cf>] core_sys_select+0x1ff/0x310
[2578818.429166]  [<ffffffff80240bb0>] do_page_fault+0x2f0/0x980
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff8026aa98>] getnstimeofday+0x48/0xc0
[2578818.429166]  [<ffffffff802308b9>] read_tsc+0x9/0x20
[2578818.429166]  [<ffffffff8026aa98>] getnstimeofday+0x48/0xc0
[2578818.429166]  [<ffffffff802b3c71>] sys_select+0x51/0x110
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] bash          S 00000000ffffffff     0 32670  32669
[2578818.429166]  ffff88022607de88 0000000000000082 ffff8800b773cc60 
00000000006c4e5c
[2578818.429166]  ffffffff807cd000 ffff8801006a8410 ffffffff8071a340 
ffff8801006a8640
[2578818.429166]  00000000006a8410 000000019977482f ffff8801006a8640 
ffffffff807258c8
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff8045be2b>] tty_check_change+0x11b/0x130
[2578818.429166]  [<ffffffff8025191e>] session_of_pgrp+0xe/0x50
[2578818.429166]  [<ffffffff8045c6f2>] tty_ioctl+0x8b2/0x920
[2578818.429166]  [<ffffffff80252872>] do_wait+0x272/0x350
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff802529e5>] sys_wait4+0x95/0xf0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] sr            S 00000000ffffffff     0 32677  32670
[2578818.429166]  ffff88008f50fe88 0000000000000086 ffff88010ccebbb8 
00000000006c4e5c
[2578818.429166]  ffffffff807cd000 ffff880148a6f260 ffffffff8071a340 
ffff880148a6f490
[2578818.429166]  0000000048a6f260 000000019977482f ffff880148a6f490 
ffffffff807258c8
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff80252872>] do_wait+0x272/0x350
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff802529e5>] sys_wait4+0x95/0xf0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] screen        S 00000000ffffffff     0 32683  32677
[2578818.429166]  ffff88020a9e5f68 0000000000000086 ffff88022ba8f160 
ffff88020a9e5f28
[2578818.429166]  ffffffff807cd000 ffff88022ba8f160 ffff88022b8950e0 
ffff88022ba8f390
[2578818.429166]  000000015afa2f00 0000000199b0f5e0 ffff88022ba8f390 
00007fff5afa5be2
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff80253955>] alarm_setitimer+0x35/0x70
[2578818.429166]  [<ffffffff8025a2e9>] sys_pause+0x19/0x30
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] sshd          S 00000000ffffffff     0 32684  10156
[2578818.429166]  ffff880198c4bbe8 0000000000000082 ffff880228808000 
ffff880198c4bb68
[2578818.429166]  ffffffff807cd000 ffff880100552090 ffff88022b8f2bb0 
ffff8801005522c0
[2578818.429166]  0000000398c4bb68 000000019977506f ffff8801005522c0 
ffff8800058e33c0
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff805de205>] unix_stream_sendmsg+0x3b5/0x3d0
[2578818.429166]  [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
[2578818.429166]  [<ffffffff80264b03>] prepare_to_wait+0x23/0x80
[2578818.429166]  [<ffffffff805ddc92>] unix_stream_recvmsg+0x432/0x5f0
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802aeb6c>] __link_path_walk+0xdbc/0xf20
[2578818.429166]  [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
[2578818.429166]  [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff802a4b3e>] vfs_read+0x15e/0x170
[2578818.429166]  [<ffffffff802a4c53>] sys_read+0x53/0xa0
[2578818.429166]  [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2578818.429166] sshd          S 00000000ffffffff     0 32686  32684
[2578818.429166]  ffff880040dff998 0000000000000086 ffff880040dff9a8 
ffffffff8040c250
[2578818.429166]  ffffffff807cd000 ffff8801e96411e0 ffff88022b8f2bb0 
ffff8801e9641410
[2578818.429166]  0000000328057c18 0000000199b0a381 ffff8801e9641410 
ffff880040dff9a8
[2578818.429166] Call Trace:
[2578818.429166]  [<ffffffff8040c250>] rb_insert_color+0x90/0x130
[2578818.429166]  [<ffffffff8026800f>] hrtimer_start_range_ns+0xdf/0x1b0
[2578818.429166]  [<ffffffff80267eb9>] hrtimer_try_to_cancel+0x39/0x80
[2578818.429166]  [<ffffffff80609415>] schedule_hrtimeout_range+0xb5/0x150
[2578818.429166]  [<ffffffff802678e0>] hrtimer_wakeup+0x0/0x30
[2578818.429166]  [<ffffffff802b35d9>] do_select+0x539/0x630
[2578818.429166]  [<ffffffff802b3d30>] __pollwait+0x0/0x130
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80249e30>] default_wake_function+0x0/0x10
[2578818.429166]  [<ffffffff80575f57>] memcpy_toiovec+0x57/0x80
[2578818.429166]  [<ffffffff8060a529>] _spin_lock_bh+0x9/0x20
[2578818.429166]  [<ffffffff8056fcc3>] release_sock+0x13/0xa0
[2578818.429166]  [<ffffffff805ab7dc>] tcp_recvmsg+0x32c/0x940
[2578818.429166]  [<ffffffff8056f110>] sock_common_recvmsg+0x30/0x50
[2578818.429166]  [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
[2578818.429166]  [<ffffffff802b38cf>] core_sys_select+0x1ff/0x310
[2578818.429166]  [<ffffffff8045e7a1>] n_tty_read+0x371/0x870
[2578818.429166]  [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2578818.429166]  [<ffffffff80249e30>] default_wake_fun

Request:-  31st January 2009 at 2:24
       Just as I was about to give up, I went to the machine, woke up the 
screensaver (moved mouse) and logged in and then EVERYTHING started working 
again??
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help