Thread (87 messages) 87 messages, 12 authors, 2007-03-08

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Thomas Gleixner <hidden>
Date: 2007-03-07 22:59:10
Also in: lkml

On Wed, 2007-03-07 at 14:05 -0800, Jeremy Fitzhardinge wrote:
Thomas Gleixner wrote:
quoted
This is tinkering of the best. My understanding of the paravirt
discussion at Kernel Summit was, that paravirt ops are exactly there to
prevent the above random hackery in the kernel and to allow _ALL_
hypervisors to interact via a sane interface inside of the kernel.
  
No, I don't think that was ever the intent.  The idea was to create a
new interface for things which don't currently have an interface in the
kernel, such as how to run the CPU in ring 1 and manage pagetable
updates.  But an important and explicit intent of the project was to use
existing kernel interfaces where possible, rather than try to make
pv_ops an monster all-encompassing interface.
Maybe I missunderstood. 

Still there is a difference between using existing kernel interfaces and
abusing them in a way which makes modifications to the core kernel code
hard and unmaintainable. See below.
Using the new time infrastructure was an explicit example of that.  We
anticipated that different hypervisors would have different ways of
doing time, but all would be easily accommodated by the
clocksource/events infrastructure, and so each would have its own
implementation for these interfaces.  From the kernel's perspective,
they're just another time device, and we manage to avoid making any core
kernel changes, or bloating the pv_ops interface.  It seems like a
natural use of the clock subsystem's design.
On the other hand we yet see things like:

        /* We use normal irq0 handler on cpu0. */
        time_init_hook();

Which is just reaching into the kernel code directly and does not handle
the clock event interrupt self contained. clockevents is not bound to
IRQ0 and this kind of hackery is exactly what we need to avoid in order
to get this maintainable.

Once this is used by paravirt implementations a change to the
mach-default implementation will break stuff left and right.

Also the whole LAPIC business is so horrible, that it hurts. The generic
interrupt layer is there since almost a year and we still see the crude
emulation of hardware and assumptions of irq0 setup all over the place.

We carefully need to define, which existing kernel interfaces are used /
hooked in which way.

If the paravirt implementations actually use the already available
abstractions in the way in which those abstractions are designed, then
we get into a maintainable design. If there are shortcomings on those
abstractions we need to fix them in a sane way or provide a _common_
workaround (e.g. 128 bit math back and forth library) without impacting
the main kernel code.

Looking at vmitimer.c and the number of hardcoded assumptions are
telling me, that we are heading in exactly the opposite direction.
quoted
You are just perverting the whole idea of a standartized
paravirtualization interface.

This things can be done for clocksources, clockevents, interrupts (the
generic irq code allows this) and probaly for a whole bunch of other
stuff.
  
Yes, exactly.  The entirety of the Xen support consists of not only an
implementation of the paravirt_ops interface, but also the Xen
clocksource and clockevents and the Xen irqchip.  My hope and intent is
that we can shrink the paravirt_ops interface in favour of using
existing generally useful kernel interfaces.
Yes, if they are used in a sane and self contained way without reaching
all over the place and expecting that those functions, which are not
part of the paravirt interfaces will work for ever.
quoted
The current paravirt interface is completely insane and will explode
into an unmaintainable nightmare within no time, if we keep accepting
that crap further.
  
No, that's exactly what we've been trying to avoid.

If we start patching in new paravirt_ops to deal with time, interrupts,
or whatever piece of functionality which already has a perfectly good
kernel interface, then we're just increasing the size of the pv_ops
interface, its entanglement with the rest of the system and the amount
of potential legacy stuff which gets dragged around as the interface
evolves.
You are not increasing the entanglement with the rest of the system,
when you use a self contained device on top of an existing core kernel
infrastructure, which has a paravirt backend. Quite the contrary, you
have one piece of virtual hardware which is connected to the kernel and
interacts with the various incarnations on the other side, which can as
well live inside the kernel code. Granted it is another level of
indirection, but I'd be happy to have only to deal with one of those
beasts.
As hardware gets better at supporting virtualization directly, we're
going to see more hybrid para- and fully- virtualized hypervisor
interfaces.  The result will be that more and more of paravirt_ops will
be implemented by the "native" versions of the functions; maybe at some
point the whole thing will evaporate away.

It's not a huge reach to expect the hardware vendors to get a clue about
time hardware (scratch that, of course it is, but we can always hope)
hehe. There is always hope, but reality is so frustrating :)
and come up with something that is directly usable from either an OS
running natively or from within a virtual machine.  In that case, I'm
sure you'd agree it would warrant a real clocksource/event
implementation.  
Yes
In the scheme I'm proposing, that's no big deal; you
just register the hardware driver, and that's that.  But what you're
proposing leaves this vestigial interface sitting in pv_ops, doing
nothing other than being redundant.
Fair enough.
My principle goal here is to get the Xen code into the kernel, and I'm
being pragmatic about it.  If you think having a xen_clocksource is an
absolute blocker to merging this stuff, then I'll add the interface to
pv_ops, and we'll work out how to wire all the hypervisors up underneath
that interface.  But I think it's precisely the wrong way to go from an
overall kernel perspective.
No it's not an absolute blocker, as long as we can take care, that the
number of incarnations is 

- designed to be shareable between hypervisors which have the same time
model
- common code like the 128 bit math is in a shared library
- self contained and not reaching out into core kernel code for no good
reason

Same goes for clock events, interrupts and other core facilities.

Thanks,

	tglx
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help