MMC quirks relating to performance/lifetime.
From: Andrei Warkentin <hidden>
Date: 2011-02-11 22:33:42
Also in:
linux-mmc
On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann [off-list ref] wrote:
On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:quoted
[Quoting in verbatin so the orginal mail hits linux-mmc, this is very interesting!] 2011/2/8 Andrei Warkentin [off-list ref]:quoted
Hi, I'm not sure if this is the best place to bring this up, but Russel's name is on a fair share of drivers/mmc code, and there does seem to be quite a bit of MMC-related discussions. Excuse me in advance if this isn't the right forum :-). Certain MMC vendors (maybe even quite a bit of them) use a pretty rigid buffering scheme when it comes to handling writes. There is usually a buffer A for random accesses, and a buffer B for sequential accesses. For certain Toshiba parts, it looks like buffer A is 8KB wide, with buffer B being 4MB wide, and all accesses larger than 8KB effectively equating to 4MB accesses. Worse, consecutive small (8k) writes are treated as one large sequential access, once again ending up in buffer B, thus necessitating out-of-order writing to work around this.It's more complex, but I now have a pretty good understanding of what the flash media actually do, after doing a lot of benchmarking. Most of my results so far are documented on https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey but I still need to write about the more recent discoveries. What you describe as buffer A is the "page size" of the underlying flash. It depends on the size and brand of the NAND flash chip and can be anywhere between 2 KB and 16 KB for modern cards, depending on how they combine multiple chips and planes within the chips. What you describe as buffer B is sometime called an "erase block group" or an "allocation unit". This is the smallest unit that gets kept in a global lookup table in the medium and can be anywhere between 1 MB and 8 MB for cards larger than 4 GB, or as small as 128 KB (a single erase block) for smaller media, as far as I have seen. When you don't write full aligned allocation units, the card will have to eventually do garbage collection on the allocation unit, which can take a long time (many milliseconds). Most cards have a third size, typically somewhere between 32 and 128 KB, which is the optimimum size for writes. While you can do linear writes to the card in page size units (writing an allocation unit from start to finish), doing random access within the allocation unit will be much faster doing larger writes.quoted
quoted
What this means is decreased life span for the parts, and it also means a performance impact on small writes, but the first item is much more crucial, especially for smaller parts. As I've mentioned, probably more vendors are affected. How about a generic MMC_BLOCK quirk that splits the requests (and optionally reorders) them? The thresholds would then be adjustable as module/kernel parameters based on manfid. I'm asking because I have a patch now, but its ugly and hardcoded against a specific manufacturer.It's not just MMC specific: USB flash drives, CF cards and even cheap PATA or SATA SSDs have the same patterns. I think this will need to be solved on a higher level, in the block device elevator code and in the file systems.quoted
There is a quirk API so that specific quirks can be flagged for certain vendors and cards, e.g. some Toshibas in this case. e.g. grep the kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE. But as Russell says this probably needs to be signalled up to the block layer to be handled properly. Why don't you post the code you have today as an RFC: patch, I think many will be interested?Yes, I agree, that would be good. Also, I'd be interested to see the output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing that the manufacturer ID of 0x0002 is Toshiba, and these are indeed the worst cards that I have seen so far, because they can not do random access within an allocation unit, and they can not write to multiple allocation units alternating (# open AUs linear is "1" in my wiki table), while most cards can do at least two. Andrei, I'm certainly interested in working with you on this. The point you brought up about the toshiba cards being especially bad is certainly vald, even if we do something better in the block layer, we need to have a way to detect the worst-case scenario, so we can work around that. ? ? ? ?Arnd
Arnd, Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email. cid - 02010053454d3332479070cc51451d00 csd - d00f00320f5903ffffffffff92404000 erase_size - 524288 fwrev - 0x0 hwrev - 0x0 manfid - 0x000002 name - SEM32G oemid - 0x0100 preferred_erase_size - 2097152