Oops in guest after ioremap() on ARMv7

From: catalin.marinas@arm.com (Catalin Marinas)
Date: 2011-12-22 18:13:56

On Thu, Dec 22, 2011 at 04:38:23PM +0000, David Vrabel wrote:

On 22/12/11 14:49, Catalin Marinas wrote:

quoted

On Thu, Dec 22, 2011 at 12:08:07PM +0000, David Vrabel wrote:

quoted

When running the linux kernel on the ARMv7 envelope model as a guest
under the Xen hypervisor there is a oops (see below for an example of
the page translation fault) when trying to access ioremap()'d memory.

The translation tables for userspace seem to be also affected.  The
program repeatedly faults with a translation fault on the same address.
 Putting a cache_flush_all() after the call to handle_mm_fault() in
__do_page_fault() makes userspace work as well.

With the classic page tables, on A15 we need this patch:

http://git.kernel.org/?p=linux/kernel/git/cmarinas/linux.git;a=commitdiff_plain;h=27cbbe6b1e17fa0b954edd37e26d601bdd7766a6

But that's to do with TLBs rather than cache and it only shows on real
hardware rather than model.

quoted

The same kernel works fine when not running under the hypervisor.

It's a 3.2.0-rc5+ kernel with the two additional linux-arch-arm
branches: arm-arch/vexpress and arm-arch/arm-lpae.

Calling flush_cache_all() in flush_cache_vmap() makes it work.  What
isn't being correctly flushed?  I see that flush_pmd_entry() and
cpu_v7_set_pte_ext() already flush the L1 and L2 translation table
entries and I can't think of anything else that would need to be flushed
(unless the mapped virtual addresses need to be flushed as well?)

The "Barrier Litmus Tests and Cookbook" says that a TLB flush and a
branch predictor flush are required after a translation table entry
update.  This seems not to be done but adding this didn't seem to help
(and using local_flush_tlb_all()) in flush_cache_vmap() didn't help either).

I don't see anything in the hypervisor that could be causing this as the
fault is occurring at stage 1 and not stage 2 translation.

Interesting error, I don't have an immediate idea of what might be
wrong, just some questions.

What's the value of the VTCR register for this guest? Are the
translation table walks marked as cacheable? Also, are the page table
attributes Normal Cacheable in the stage 2 translation? The processor
chooses the more restrictive attribute between stage 1 and stage 2.

VTCR = 0x80002558 which is: Outer Shareable; Normal memory, outer
write-back write-allocate cacheable; Normal memory, inner write-back,
write-allocate cacheable.

L3 TT entries for stage 2 have the following attributes:
Outer-Shareable; Normal, inner write-back cachable; Normal, outer
write-back cacheable.

These look sensible to me.

They look fine (UP system). BTW, I assume that the hypervisor also
flushes the caches and TLBs for the stage 2 translation tables.

It could as well be a model bug but people are on holiday at the moment
(and I'm off shortly as well, until 3rd of January). Could you try to
disable the cacheability of the page table walks for both stage 1 (TTBRx
with classic page tables or TTBCR with LPAE) and stage 2 (VTCR)? Since
Linux does the correct cache flushing and I assume the hypervisor as
well, this may work around possible model bug.

-- 
Catalin

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help