[PATCH] usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP

From: Mark Salter <hidden>
Date: 2011-08-31 18:35:16
Also in: linux-omap

On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote:

On 08/31/2011 12:51 PM, Will Deacon wrote:

quoted

On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:

quoted

On Wed, 31 Aug 2011, Will Deacon wrote:

quoted

On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:

quoted

On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:

quoted

On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:

quoted

One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
also uncache, but bufferable?

Which CPU was on this platform?

Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
nosmp on the commandline, I see 20.3MB/s.

Can someone explain why nosmp would make such a difference?

Oh gawd, that's horrible. I have a feeling it's probably a separate issue
though, caused by:

omap_modify_auxcoreboot0(0x200, 0xfffffdff);

in boot_secondary for OMAP. Unfortunately I have no idea what that line is
doing because it ends up talking to the secure monitor.

Well, this issue is apparently affecting other ARMv9 implementations 
too.  In which case this code in arch/arm/mm/mmu.c could be responsible:

                if (is_smp()) {
                        /*
                         * Mark memory with the "shared" attribute
                         * for SMP systems
                         */
                        user_pgprot |= L_PTE_SHARED;
                        kern_pgprot |= L_PTE_SHARED;
                        vecs_pgprot |= L_PTE_SHARED;
                        mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
                        mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
                        mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
                        mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
                        mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
                }

However I don't see the nosmp kernel argument having any effect on the 
result from is_smp().

Yes, the first thing that sprung to mind was the shared attribute, but like
you say, that doesn't seem to be affected by the nosmp command line
argument.

Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
CPU during boot (by commenting out most of smp_init). In this case, I/O
performance was good until we tried to online the secondary CPU. The online
failed but after that the I/O performance was certainly degraded.

Was the SCU enabled at that point? One diff between nosmp boot and
offlining the 2nd core would be that the SCU remains enabled in the
latter case. I think the SCU does not get enabled for nosmp.

Do we really know which write buffer the data is sitting? Some
experiments to only flush the L1 write buffer would be interesting.
Perhaps something executed on the 2nd core has a mb which doesn't help
for SMP because the other core's L1 write buffer is not flushed, but it
helps for nosmp because everything runs on 1 core and any occurrence of
a mb will flush all data out. I wouldn't expect the behavior to be so
consistent though. Could it be something is not visible to the other
core rather than not visible to the EHCI controller?

One experiment I did a few days ago was to pin processes and interrupts
to core#0 (except IPI and local timer). This didn't make any noticeable
difference.

My current understanding is that the writes are getting hung up in a
cache and not a write buffer. I am seeing delays of 10-15ms between
queuing the urb and getting an interrupt for urb completion. That
drops to a few hundred microseconds with the explicit flushing added
to the ehci driver. I don't see how any write buffer could hold data
that long without draining out on its own. What I see seems to suggest
that the memory is only coherent among the cores and not coherent for
CPU writes/device reads. Adding just a dsb() for the ehci flush does
not help. An outer_sync() is also necessary.

--Mark

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help