Re: [Bug #14016] mm/ipw2200 regression

From: Andrew Morton <akpm@linux-foundation.org>
Date: 2009-08-26 14:46:45
Also in: linux-mm, linux-wireless, lkml

(cc IPW maintainers and mailing lists)

On Wed, 26 Aug 2009 10:37:49 +0100 Mel Gorman [off-list ref] wrote:

quoted hunk ↗ jump to hunk

On Wed, Aug 26, 2009 at 10:27:41AM +0200, Johannes Weiner wrote:

quoted

[Cc netdev]

On Wed, Aug 26, 2009 at 09:09:44AM +0300, Pekka Enberg wrote:

quoted

On Tue, Aug 25, 2009 at 11:34 PM, Rafael J. Wysocki[off-list ref] wrote:

quoted

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.30. __Please verify if it still should be listed and let me know
(either way).

Bug-Entry __ __ __ : http://bugzilla.kernel.org/show_bug.cgi?id=14016
Subject __ __ __ __ : mm/ipw2200 regression
Submitter __ __ __ : Bartlomiej Zolnierkiewicz [off-list ref]
Date __ __ __ __ __ __: 2009-08-15 16:56 (11 days old)
References __ __ __: http://marc.info/?l=linux-kernel&m=125036437221408&w=4

If am reading the page allocator dump correctly, there's plenty of
pages left but we're unable to satisfy an order 6 allocation. There's
no slab allocator involved so the page allocator changes that went
into 2.6.31 seem likely. Mel, ideas?

It's an atomic order-6 allocation, the chances for this to succeed
after some uptime become infinitesimal.  The chunks > order-2 are
pretty much exhausted on this dump.

64 pages, presumably 256k, for fw->boot_size while current ipw
firmware images have ~188k.  I don't know jack squat about this
driver, but given the field name and the struct:

	struct ipw_fw {
		__le32 ver;
		__le32 boot_size;
		__le32 ucode_size;
		__le32 fw_size;
		u8 data[0];
	};

fw->boot_size alone being that big sounds a bit fishy to me.

Agreed. While there are a low number of order-6 pages free in the page
allocation failure dump, there are not enough for watermarks to be
satisified. As it's atomic, there is little that can be done from a VM
perspective and it's the responsibility of the driver. I'm no driver expert
but I'll have a go at fixing it anyway.

My reading of this is that the firmware is being loaded from a workqueue and
I am failing to see any restriction on sleeping in the path. It would appear
that the driver just used the most convenient *_alloc_coherent function
available forgetting that it assumes GFP_ATOMIC. Can someone who does know
which way is up with a driver tell me why the patch below might not
work?

Bartlomiej, any chance you could give this a spin? Preferably, you'd
have preempt enabled and CONFIG_DEBUG_SPINLOCK_SLEEP on as well because
that combination will complain loudly if we really can't sleep in this
path.

=====
ipw2200: Avoid large GFP_ATOMIC allocation during firmware loading

ipw2200 uses pci_alloc_consistent() to allocate a large coherent buffer for
the loading of firmware which is an order-6 allocation of GFP_ATOMIC. At
system start-up time, this is not a problem. However, the firmware on the
card can get confused and the corrective action taken is to reload the
firmware and reinit the card. High-order GFP_ATOMIC allocations of this
type can and will fail when the system is already up and running.

As the firmware is loaded from a workqueue, it should be possible for
the driver to go to sleep. This patch converts the call of
pci_alloc_consistent() which assumes GFP_ATOMIC to dma_alloc_coherent()
which can specify its own flags.

The big downside with this patch is that it uses GFP_REPEAT to avoid the
driver unloading. There is potential that this will cause a reclaim
storm as the machine tries to find a free order-6 buffer. A suggested
alternative for the driver owner is in the comments.

Signed-off-by: Mel Gorman <redacted>
--- 
 drivers/net/wireless/ipw2x00/ipw2200.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ipw2x00/ipw2200.c b/drivers/net/wireless/ipw2x00/ipw2200.c
index 44c29b3..f2e251e 100644
--- a/drivers/net/wireless/ipw2x00/ipw2200.c
+++ b/drivers/net/wireless/ipw2x00/ipw2200.c

@@ -3167,7 +3167,19 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
 	u8 *shared_virt;
 
 	IPW_DEBUG_TRACE("<< : \n");
-	shared_virt = pci_alloc_consistent(priv->pci_dev, len, &shared_phys);
+
+	/*
+	 * This is a whopping large allocation, in or around order-6 so
+	 * dma_alloc_coherent is used to specify the GFP_KERNEL|__GFP_REPEAT
+	 * flags. Note that this action means the system could go into a
+	 * reclaim loop until it cannot reclaim any more trying to satisfy
+	 * the allocation. It would be preferable if one buffer is allocated
+	 * at driver initialisation and reused when the firmware needs to
+	 * be reloaded, overwriting the existing firmware each time
+	 */
+	shared_virt = dma_alloc_coherent(
+			priv->pci_dev == NULL ? NULL : &priv->pci_dev->dev, 
+			len, &shared_phys, GFP_KERNEL|__GFP_REPEAT);
 
 	if (!shared_virt)
 		return -ENOMEM;

Of course, the risk of making a change like this is that we'll then go
and leave it there.

To fix this code properly we should, as you say, stop doing an order-6
allocation altogether.

And right now I think it's doing _two_ order-6 allocations:

	shared_virt = pci_alloc_consistent(priv->pci_dev, len, &shared_phys);

	if (!shared_virt)
		return -ENOMEM;

	memmove(shared_virt, data, len);

whoever allocated `data' is being obnoxious as well.

It is perhaps pretty simple to make the second (GFP_ATOMIC) allocation
go away.  The code is already conveniently structured to do this:

	do {
		chunk = (struct fw_chunk *)(data + offset);
		offset += sizeof(struct fw_chunk);
		/* build DMA packet and queue up for sending */
		/* dma to chunk->address, the chunk->length bytes from data +
		 * offeset*/
		/* Dma loading */
		rc = ipw_fw_dma_add_buffer(priv, shared_phys + offset,
					   le32_to_cpu(chunk->address),
					   le32_to_cpu(chunk->length));
		if (rc) {
			IPW_DEBUG_INFO("dmaAddBuffer Failed\n");
			goto out;
		}

		offset += le32_to_cpu(chunk->length);
	} while (offset < len);

what is the typical/expected value of chunk->length here?  If it's
significantly less than 4096*(2^6), could we convert this function to
use a separate DMAable allocation per fw_chunk?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help