Thread (39 messages) 39 messages, 4 authors, 2014-11-25

Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn

From: Wei Yang <hidden>
Date: 2014-11-21 01:46:06
Also in: linux-pci

On Thu, Nov 20, 2014 at 12:05:41PM -0700, Bjorn Helgaas wrote:
On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:
quoted
On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
quoted
On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
quoted
From: Gavin Shan <redacted>

pci_dn is the extension of PCI device node and it's created from
device node. Unfortunately, VFs that are enabled dynamically by
PF's driver and they don't have corresponding device nodes, and
pci_dn. The patch refactors pci_dn to support VFs:

   * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
     to the child list of pci_dn of PF's bridge. pci_dn of other
     device put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when applying
     final fixup to PF. VF's pci_dn will be destroyed when releasing
     PF's pci_dev instance. pci_dn of other device is still created
     from device node as before.

   * For one particular PCI device (VF or not), its pci_dn can be
     found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
     or parent's list. The fast path (fetching pci_dn through PCI
     device instance) is populated during early fixup time.

Signed-off-by: Gavin Shan <redacted>
---
...
quoted
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent, *pdn;
+	int i;
+
+	/* Only support IOV for now */
+	if (!pdev->is_physfn)
+		return pci_get_pdn(pdev);
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return NULL;
+
+	pdn->flags |= PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return NULL;
+
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		pdn = add_one_dev_pci_info(parent, NULL,
+					   pci_iov_virtfn_bus(pdev, i),
+					   pci_iov_virtfn_devfn(pdev, i));
I'm not sure this makes sense, but I certainly don't know this code, so
maybe I'm missing something.

pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
and First VF Offset in the SR-IOV capability by sriov_init(), which is
called before add_dev_pci_info():

 pci_scan_child_bus
   pci_scan_slot
     pci_scan_single_device
pci_device_add
  pci_init_capabilities
    pci_iov_init(PF)
      sriov_init(PF, pos)
	pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
	pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
	pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
	iov->offset = offset
	iov->stride = stride

 pci_bus_add_devices
   pci_bus_add_device
     pci_fixup_device(pci_fixup_final)
add_dev_pci_info
  pci_iov_virtfn_bus
    return ... + sriov->offset + (sriov->stride * id) ...

But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
sriov_init() above.  We will change NumVFs to something different when a
driver calls pci_enable_sriov():

 pci_enable_sriov
   sriov_enable
     pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)

Now First VF Offset and VF Stride have changed from what they were when we
called pci_iov_virtfn_bus() above.
Oops, I see the ARI would affect those value, while missed the NumVFs also
would.

Let's look at the problem one by one.

1. The ARI capability.
===============================================================================
The kernel initialize the capability like this:

pci_init_capabilities()
	pci_configure_ari()
	pci_iov_init()
		iov->offset = offset
		iov->stride = stride

When offset/stride is retrieved at this point, the ARI capability is taken
into consideration.
PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the
PF, so I don't think this is really a problem.
quoted
2. The PF's NumVFs field
===============================================================================
2.1 Potential problem in current code
===============================================================================
First, is current pci code has some potential problem?

sriov_enable()
	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
	iov->offset = offset;
	iov->stride = stride;
	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
	virtfn_add()
		...
		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);

The sriov_enable() retrieve the offset/stride then write the NumVFs. According
to the SPEC, at this moment the offset/stride may change. While I don't see
some code to retrieve and store those value again. And these fields will be
used in virtfn_add().

If my understanding is correct, I suggest to move the retrieve and store
operation after NumVFs is written.
Yep, it looks like the existing code has similar problems.  We might want
to add a simple function that writes PCI_SRIOV_NUM_VF, then reads
PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values
in dev->sriov.

Then we'd at least know that virtfn_bus() and virtfn_devfn() return values
that are correct until the next NumVFs change.
Ok, I will write a function to wrap it.
quoted
2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
===============================================================================
In current pci core, when enumerating the pci tree, we do like this:

pci_scan_child_bus()
	pci_scan_slot()
		pci_scan_single_device()
			pci_device_add()
				pci_init_capabilities()
					pci_iov_init()
	max += pci_iov_bus_range(bus);
		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
	max = pci_scan_bridge(bus, dev, max, pass);

From this point, we see pci core reserve some bus range for VFs. This
calculation is based on the offset/stride at this moment. And do the
enumeration with the new bus number.

sriov_enable() could be called several times from driver to enable SRIOV, and
with different nr_virtfn. If each time the NumVFs written, the offset/stride
will change. This means we may try to address an extra bus we didn't reserve?
Or this means it is out of control?
This looks like a problem, too.  I don't have a good suggestion for fixing
it.
How about enumerating all possible NumVFs and select the maximum?

I am not sure what will happen if the FW sets a different number? So FW should
listen to kernel, right?

I could write a code with this logic and test, while I am afraid this will
break some platfrom in some case.
quoted
2.3 How can I reserve bus range in FW?
===============================================================================
This question comes from the previous one.

Based on my understanding, current pci core will rely on the bus number in HW
if pcibios_assign_all_busses() is not set. If we want to support those VFs
sits on different bus with PF, we need to reserve bus range and write the
correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
may not be addressed.

Currently I am writing the code in FW to reserve the range with the same
mechanism in pci core. While as you mentioned the offset/stride may change
after sriov_enable(), I am confused whether this is the correct way.
If your firmware knows something about the device and can compute the
number of buses it will likely need, it can set up bridges with appropriate
bus number ranges, and Linux will generally leave those alone.
Yep, this is what I am trying to do.
I don't know the best way to figure out the number of buses, though.  It
seems like you almost need to experimentally set NumVFs and read the
resulting offset/stride, because I think it's really up to the device to
decide how to number the VFs.  Maybe pci_iov_bus_range() needs to do
something similar.
Got it, I will add this logic.
quoted
2.4 The potential problem for [Patch 08/18]
===============================================================================
According to the SPEC, the offset/stride will change after each
sriov_enable(). This means the bus/devfn will change after each
sriov_enable().

My current thought is to fix it up in virtfn_add(). If the total VF number
will not change, we could create those pci_dn at the beginning and fix the
bus/devfn at each time the VF is truely created.
By "fix it up," I assume you mean call an arch function that does the
pci_pdn setup you need.

It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
so it might be nice to have this setup connected to that.
If my understanding is correct, we could wrap up the configuration write/read
on PCI_SRIOV_CTRL and when it involves PCI_SRIOV_CTRL_VFE, do the fix up?
Bjorn
-- 
Richard Yang
Help you, Help me
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help