[PATCH v4 06/12] KVM: arm64: implement basic ITS register handlers

From: andre.przywara@arm.com (Andre Przywara)
Date: 2016-06-03 15:42:34
Also in: kvm, kvmarm

Hi,

I addressed some comments of yours in v5, just some rationale for the
others (sorry for the late reply on these!):

...

quoted

+static int vgic_mmio_read_its_typer(struct kvm_vcpu *vcpu,
+				    struct kvm_io_device *this,
+				    gpa_t addr, int len, void *val)
+{
+	u64 reg = GITS_TYPER_PLPIS;
+
+	/*
+	 * We use linear CPU numbers for redistributor addressing,
+	 * so GITS_TYPER.PTA is 0.
+	 * To avoid memory waste on the guest side, we keep the
+	 * number of IDBits and DevBits low for the time being.
+	 * This could later be made configurable by userland.
+	 * Since we have all collections in linked list, we claim
+	 * that we can hold all of the collection tables in our
+	 * own memory and that the ITT entry size is 1 byte (the
+	 * smallest possible one).

All of this is going to bite us when we want to implement migration,
specially the HW collection bit.

How so? Clearly we keep the number of VCPUs constant, so everything
guest visible stays the same even upon migrating to a much different
host? Or are we talking about different things here?

Let me rephrase this: How are you going to address HW collections from
userspace in order to dump them? Short of having memory tables, you cannot.

I was just hoping for addressing first things first. Having the
collection table entirely in the kernel makes things easier for now.

How is that save/restore story covered by a hardware ITS? Is it just
ignored since it's not needed/used there?

That being said I would be happy to change this when we get migration
support for the ITS register state and we have agreed on an API for
userland ITS save/restore.

quoted

+	 */
+	reg |= 0xff << GITS_TYPER_HWCOLLCNT_SHIFT;
+	reg |= 0x0f << GITS_TYPER_DEVBITS_SHIFT;
+	reg |= 0x0f << GITS_TYPER_IDBITS_SHIFT;
+
+	write_mask64(reg, addr & 7, len, val);
+
+	return 0;
+}

.....

quoted

+static int vgic_mmio_read_its_idregs(struct kvm_vcpu *vcpu,
+				     struct kvm_io_device *this,
+				     gpa_t addr, int len, void *val)
+{
+	struct vgic_io_device *iodev = container_of(this,
+						    struct vgic_io_device, dev);
+	u32 reg = 0;
+	int idreg = (addr & ~3) - iodev->base_addr + GITS_IDREGS_BASE;
+
+	switch (idreg) {
+	case GITS_PIDR2:
+		reg = GIC_PIDR2_ARCH_GICv3;

Are we leaving the lowest 4 bits to zero?

I couldn't stuff "42" in 4 bits, so: yes ;-)

The 4 bottom bits are "Implementation Defined", so you could fit
something in it if you wanted.

quoted

+		break;
+	case GITS_PIDR4:
+		/* This is a 64K software visible page */
+		reg = 0x40;

Same question.

Also, how about all the others PIDR registers?

I guess I was halfway undecided between going with the pure
architectural requirements (which only mandate bits 7:4 in PIDR2) and
complying with the recommendation later in the spec.
So I will just fill PIDR0, PIDR2 and the rest of PIDR2 as well.

quoted

+		break;
+	/* Those are the ID registers for (any) GIC. */
+	case GITS_CIDR0:
+		reg = 0x0d;
+		break;
+	case GITS_CIDR1:
+		reg = 0xf0;
+		break;
+	case GITS_CIDR2:
+		reg = 0x05;
+		break;
+	case GITS_CIDR3:
+		reg = 0xb1;
+		break;
+	}

Given that these values are directly taken from the architecture, and
seem common to the whole GICv3 architecture when implemented by ARM, we
could have a common handler for the whole GICv3 implementatuin. Not a
bit deal though.

I tried it, but it turned out to be more involved than anticipated, so I
shelved it for now. I can make a patch on top later.

quoted

Agreed.

quoted

+
+	write_mask32(reg, addr & 3, len, val);
+
+	return 0;
+}

....

quoted

+/*
+ * By writing to CWRITER the guest announces new commands to be processed.
+ * Since we cannot read from guest memory inside the ITS spinlock, we
+ * iterate over the command buffer (with the lock dropped) until the read
+ * pointer matches the write pointer. Other VCPUs writing this register in the
+ * meantime will just update the write pointer, leaving the command
+ * processing to the first instance of the function.
+ */
+static int vgic_mmio_write_its_cwriter(struct kvm_vcpu *vcpu,
+				       struct kvm_io_device *this,
+				       gpa_t addr, int len, const void *val)
+{
+	struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
+	struct vgic_its *its = &dist->its;
+	gpa_t cbaser = its_cmd_buffer_base(vcpu->kvm);
+	u64 cmd_buf[4];
+	u32 reg;
+	bool finished;
+
+	reg = mask64(its->cwriter & 0xfffe0, addr & 7, len, val);
+	reg &= 0xfffe0;
+	if (reg > its_cmd_buffer_size(vcpu->kvm))
+		return 0;
+
+	spin_lock(&its->lock);
+
+	/*
+	 * If there is still another VCPU handling commands, let this
+	 * one pick up the new CWRITER and process "our" new commands as well.
+	 */

How do you detect that condition? All I see is a massive race here, with
two threads processing the queue in parallel, possibly corrupting each
other's data.

Please explain why you think this is safe.

OK, here you go: Actually we are handling the command queue
synchronously: The first writer to CWRITER will end up here and will
iterate over all commands below. From the guests point of view it will
take some time to do the CWRITER MMIO write, but upon the writel
returning the queue is actually already empty again.
That means that any _sane_ guest will _never_ trigger the case where two
threads/VCPUs are handling the queue, because a concurrent write to
CWRITER without a lock in the guest would be broken by design.

That feels very fragile, and you're relying on a well behaved guest.

No, I don't rely on it, as mentioned in the next sentence.
I just don't optimize for that case, because in reality it should never
happen - apart from a malicious guest. Let me think about the security
implications, though (whether a misbehaved guest could hog a physical CPU).

quoted

However I am not sure we can rely on this, so I added this precaution
scheme:

quoted

The first writer takes our ITS (emulation) lock and examines cwriter and
creadr to see if they are equal. If they are, then the queue is
currently empty, which means we are the first one and need to take care
of the commands. We update our copy of cwriter (now marking the queue as
non-empty) and drop the lock. As we kept the queue-was-empty status in a
local variable, we now can start iterating through the queue. We only
take the lock briefly when really needed in this process (for instance
when one command has been processed and creadr is incremented).

If now a second VCPU writes CWRITER, we also take the lock and compare
creadr and cwriter. Now they are different, because the first thread is
still busy with handling the commands and hasn't finished yet.
So we set our local "finished" to true. Also we update cwriter to the
new value, then drop the lock. The while loop below will not be entered
in this thread and we return to the guest.
Now the first thread has handled another command, takes the lock to
increase creadr and compares it with cwriter. Without the second writer
it may have been finished already, but now cwriter has grown, so it will
just carry on with handling the commands.

A bit like someone washing the dishes while another person adds some
more dirty plates to the pile ;-)

I'm sorry, this feels incredibly complicated, and will make actual
debugging/tracing a nightmare. It also penalize the first vcpu that gets
to submitting a command, leading to scheduling imbalances.

Well, as mentioned that would never happen apart from that
malicious/misbehaved guest case - so I don't care so much about
scheduling imbalances _for the guest_. As said above I have to think
about how this affects the host cores, though.

And frankly I don't see how this is overly complicated (in the context
of SMP locking schemes in general) - apart from me explaining it badly
above. It's just a bit tricky, but guarantees a strict order of command
processing.
On another approach we could just qualify a concurrent access as illegal
and ignore it - on real hardware you would have no guarantee for this to
work either - issuing two CWRITER MMIO writes at the same time would
just go bonkers.

Why can't you simply turn the ITS lock into a mutex and keep it held,
processing each batch in the context of the vcpu that is writing to
GITC_CWRITER?

Frankly I spent quite some time to make this locking more fine grained
after some comments on earlier revisions - and I think it's better now,
since we don't hog all of the ITS data structures during the command
processing. Pessimistically one core could issue a big number of
commands, which would block the whole ITS.
Instead MSI injections happening in parallel now have a good chance of
iterating our lists and tables in between. This should improve scalability.

Thanks,
Andre.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help