Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
From: Wei Yang <hidden>
Date: 2014-11-21 01:46:06
Also in:
linux-pci
On Thu, Nov 20, 2014 at 12:05:41PM -0700, Bjorn Helgaas wrote:
On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:quoted
On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:quoted
On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:quoted
From: Gavin Shan <redacted> pci_dn is the extension of PCI device node and it's created from device node. Unfortunately, VFs that are enabled dynamically by PF's driver and they don't have corresponding device nodes, and pci_dn. The patch refactors pci_dn to support VFs: * pci_dn is organized as a hierarchy tree. VF's pci_dn is put to the child list of pci_dn of PF's bridge. pci_dn of other device put to the child list of pci_dn of its upstream bridge. * VF's pci_dn is expected to be created dynamically when applying final fixup to PF. VF's pci_dn will be destroyed when releasing PF's pci_dev instance. pci_dn of other device is still created from device node as before. * For one particular PCI device (VF or not), its pci_dn can be found from pdev->dev.archdata.firmware_data, PCI_DN(devnode), or parent's list. The fast path (fetching pci_dn through PCI device instance) is populated during early fixup time. Signed-off-by: Gavin Shan <redacted> --- ...quoted
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev) +{ +#ifdef CONFIG_PCI_IOV + struct pci_dn *parent, *pdn; + int i; + + /* Only support IOV for now */ + if (!pdev->is_physfn) + return pci_get_pdn(pdev); + + /* Check if VFs have been populated */ + pdn = pci_get_pdn(pdev); + if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF)) + return NULL; + + pdn->flags |= PCI_DN_FLAG_IOV_VF; + parent = pci_bus_to_pdn(pdev->bus); + if (!parent) + return NULL; + + for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) { + pdn = add_one_dev_pci_info(parent, NULL, + pci_iov_virtfn_bus(pdev, i), + pci_iov_virtfn_devfn(pdev, i));I'm not sure this makes sense, but I certainly don't know this code, so maybe I'm missing something. pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on pdev->sriov->stride and pdev->sriov->offset. These are read from VF Stride and First VF Offset in the SR-IOV capability by sriov_init(), which is called before add_dev_pci_info(): pci_scan_child_bus pci_scan_slot pci_scan_single_device pci_device_add pci_init_capabilities pci_iov_init(PF) sriov_init(PF, pos) pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0) pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset) pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride) iov->offset = offset iov->stride = stride pci_bus_add_devices pci_bus_add_device pci_fixup_device(pci_fixup_final) add_dev_pci_info pci_iov_virtfn_bus return ... + sriov->offset + (sriov->stride * id) ... But both First VF Offset and VF Stride change when ARI Capable Hierarchy or NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10). We set NumVFs to zero in sriov_init() above. We will change NumVFs to something different when a driver calls pci_enable_sriov(): pci_enable_sriov sriov_enable pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn) Now First VF Offset and VF Stride have changed from what they were when we called pci_iov_virtfn_bus() above.Oops, I see the ARI would affect those value, while missed the NumVFs also would. Let's look at the problem one by one. 1. The ARI capability. =============================================================================== The kernel initialize the capability like this: pci_init_capabilities() pci_configure_ari() pci_iov_init() iov->offset = offset iov->stride = stride When offset/stride is retrieved at this point, the ARI capability is taken into consideration.PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the PF, so I don't think this is really a problem.quoted
2. The PF's NumVFs field =============================================================================== 2.1 Potential problem in current code =============================================================================== First, is current pci code has some potential problem? sriov_enable() pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset); pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride); iov->offset = offset; iov->stride = stride; pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn); virtfn_add() ... virtfn->devfn = pci_iov_virtfn_devfn(dev, id); The sriov_enable() retrieve the offset/stride then write the NumVFs. According to the SPEC, at this moment the offset/stride may change. While I don't see some code to retrieve and store those value again. And these fields will be used in virtfn_add(). If my understanding is correct, I suggest to move the retrieve and store operation after NumVFs is written.Yep, it looks like the existing code has similar problems. We might want to add a simple function that writes PCI_SRIOV_NUM_VF, then reads PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values in dev->sriov. Then we'd at least know that virtfn_bus() and virtfn_devfn() return values that are correct until the next NumVFs change.
Ok, I will write a function to wrap it.
quoted
2.2 The IOV bus range may not be correct in pci_scan_child_bus()? =============================================================================== In current pci core, when enumerating the pci tree, we do like this: pci_scan_child_bus() pci_scan_slot() pci_scan_single_device() pci_device_add() pci_init_capabilities() pci_iov_init() max += pci_iov_bus_range(bus); busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1); max = pci_scan_bridge(bus, dev, max, pass); From this point, we see pci core reserve some bus range for VFs. This calculation is based on the offset/stride at this moment. And do the enumeration with the new bus number. sriov_enable() could be called several times from driver to enable SRIOV, and with different nr_virtfn. If each time the NumVFs written, the offset/stride will change. This means we may try to address an extra bus we didn't reserve? Or this means it is out of control?This looks like a problem, too. I don't have a good suggestion for fixing it.
How about enumerating all possible NumVFs and select the maximum? I am not sure what will happen if the FW sets a different number? So FW should listen to kernel, right? I could write a code with this logic and test, while I am afraid this will break some platfrom in some case.
quoted
2.3 How can I reserve bus range in FW? =============================================================================== This question comes from the previous one. Based on my understanding, current pci core will rely on the bus number in HW if pcibios_assign_all_busses() is not set. If we want to support those VFs sits on different bus with PF, we need to reserve bus range and write the correct secondary/subordinate in bridge. Otherwise, those VFs on different bus may not be addressed. Currently I am writing the code in FW to reserve the range with the same mechanism in pci core. While as you mentioned the offset/stride may change after sriov_enable(), I am confused whether this is the correct way.If your firmware knows something about the device and can compute the number of buses it will likely need, it can set up bridges with appropriate bus number ranges, and Linux will generally leave those alone.
Yep, this is what I am trying to do.
I don't know the best way to figure out the number of buses, though. It seems like you almost need to experimentally set NumVFs and read the resulting offset/stride, because I think it's really up to the device to decide how to number the VFs. Maybe pci_iov_bus_range() needs to do something similar.
Got it, I will add this logic.
quoted
2.4 The potential problem for [Patch 08/18] =============================================================================== According to the SPEC, the offset/stride will change after each sriov_enable(). This means the bus/devfn will change after each sriov_enable(). My current thought is to fix it up in virtfn_add(). If the total VF number will not change, we could create those pci_dn at the beginning and fix the bus/devfn at each time the VF is truely created.By "fix it up," I assume you mean call an arch function that does the pci_pdn setup you need. It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(), sriov_enable(), sriov_disable(), and sriov_restore_state(). From a hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set, so it might be nice to have this setup connected to that.
If my understanding is correct, we could wrap up the configuration write/read on PCI_SRIOV_CTRL and when it involves PCI_SRIOV_CTRL_VFE, do the fix up?
Bjorn
-- Richard Yang Help you, Help me