Thread (7 messages) 7 messages, 2 authors, 2004-07-12

Re: ide-io.c, ide_do_request -- race condition?

From: Bartlomiej Zolnierkiewicz <hidden>
Date: 2004-07-10 20:01:37

On Saturday 10 of July 2004 21:25, Max T. Woodbury wrote:
Bartlomiej Zolnierkiewicz wrote:
quoted
Hi,

On Tuesday 06 of July 2004 00:51, Max T. Woodbury wrote:
quoted
(The fact that the machine runs other OSs without noticeable
problems is also an indication that the underlying hardware
is in working order.  Only the system software and disk
drive changed between the two setups and I have explained
why I do not think it is the disk drive.)
disk drive changed?  please explain
The drives in the Thinkpad 760 are mounted in caddies that can
be easily exchanged when the power is off.  I have three drives.
One runs the machine as a GPS, the second as a code development
Windows box and the third is my Linux code development machine.
I'm having a fair amount of trouble getting the Linux setup to do
what I want it to do.  Not only did the Linux install flake out,
but I still can't get the PCMCIA sockets working, but that's another
issue for another list and I haven't quite got enough information
on that set of problems to make a request for help useful...  In
order to get to the internet with Linux I have to use its docking
station.  No such problem with Windoze.  (Yeah, absolutely
disgusting but that's what's happening.)
Are you sure that 'Linux' disk is okay?
http://smartmontool.sf.net
quoted
__cli() there is just "paranoia" and it is gone in 2.6 kernels
That's not quite correct.  There is a check and a BUG() call to assure
that interrupts are disabled on entry in the 2.6 code I've seen.  If I
understand the new code correctly, you've replaced the single interrupt
disable call at the top of this routine by a bunch of similar calls
elsewhere before entering this routine.  That would make interrupt latency
worse, not better.
This is not correct - __cli() is really just a "paranoia", you may remove
it if you like and it shouldn't change anything (but we would like to know
if it changes something ie. fixes fs corruption :-).

Please take a look at generic_unplug_device() in drivers/block/ll_rw_blk.c:

spin_lock_irq() disables IRQs
__generic_unplug_device() calls queue->request_fn (ide_do_request)
spin_unlock_irq() enables IRQs

The only difference between 2.4 and 2.6 is that 2.4 is using
spin_lock_irqsave() / spin_unlock_irqrestore() variants.
quoted
quoted
bunch of code has been executing under interrupt lockout when
there was no need for the lockout.  Not a huge problem, just
strange.  Also, in 2.6, the lockout has to begin before the
routine is called which is why I said 2.6 was worse.
2.6 is much better - you have one spinlock per block queue while
in 2.4 you have one global spinlock (io_request_lock) for all
block requests.
Yep.  That's a little courser than the model I was using on the never
completed OS design I did in the early 70s, but it is better than the
single global lock in 2.4 and way better than the design of many other
OSs I've waded into.  Still, you've got a complete interrupt lockout
in place at the top of this routine which has two bad effects: 1) the
interrupt latency is longer and 2) there is no one place to turn it off
any longer.
'lockout' happens earlier both in 2.4 and 2.6 -> generic_unplug_device().
Thanks.  I was hoping to get your attention, but I did not want to
presume on your time, thus the post to linux-ide.  (If you don't
mind, linux-kernel is way too noisy.  I subscribed once a good while
ago and turned it off because I could not handle the volume of just
plain junk that gets posted to that list.  Linus must be some kind of
saint if he wades through all of it...)
well, I don't read everything and I guess Linus does the same 8)
quoted
quoted
I've been going through the linux-ide archives and noticed
that there have been a number of mystery fs corruption issues
that just disappeared.  This might be related.  There was also
a DMA problem that might have been relevant, but I know it does
not apply in this case since "hdparm" shows DMA turned off by
default on this machine.
dmesg output would be helpful, the same goes for lspci output
That is an important part of this issue.  Nothing shows in dmesg
until it is much too late.  The read errors get reported, but no
write errors.  There should be a 'pirntk' in 'ide_abort' and
'idedisk_abort' (I may have the routine names wrong, I'm doing
this from memory) but there isn't,  (I'll post a patch for that fix
if you want.) so I can't tell if the problem is coming down from
the upper layers.  I also think there should be a 'printk' associated
with the posting of the immediate stop command.  (Again, this is from
memory.  I'll post a patch with all this if you want me to.  It will
not fix any problems, but might shed light.)
I believe that dmesg/lspci would be useful for me or other people reading
this because it allows us to know a bit more about this specific hardware
('Thinkpad 760' is really not enough).
Still, this is an important problem.  File system corruption is just not
something an OS should allow to happen unless the user does something
extreme.
Without more info we won't go further in solving this issue.

Bartlomiej
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help