Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
From: Greg KH <hidden>
Date: 2010-09-29 02:54:31
Also in:
linux-mm, lkml
On Tue, Sep 28, 2010 at 10:12:18AM -0500, Robin Holt wrote:
On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:quoted
On 09/27/2010 09:09 PM, Nathan Fontenot wrote:quoted
This set of patches decouples the concept that a single memory section corresponds to a single directory in /sys/devices/system/memory/. On systems with large amounts of memory (1+ TB) there are perfomance issues related to creating the large number of sysfs directories. For a powerpc machine with 1 TB of memory we are creating 63,000+ directories. This is resulting in boot times of around 45-50 minutes for systems with 1 TB of memory and 8 hours for systems with 2 TB of memory. With this patch set applied I am now seeing boot times of 5 minutes or less. The root of this issue is in sysfs directory creation. Every time a directory is created a string compare is done against all sibling directories to ensure we do not create duplicates. The list of directory nodes in sysfs is kept as an unsorted list which results in this being an exponentially longer operation as the number of directories are created. The solution solved by this patch set is to allow a single directory in sysfs to span multiple memory sections. This is controlled by an optional architecturally defined function memory_block_size_bytes(). The default definition of this routine returns a memory block size equal to the memory section size. This maintains the current layout of sysfs memory directories as it appears to userspace to remain the same as it is today.Why not update sysfs directory creation to be fast, for example by using an rbtree instead of a linked list. This fixes an implementation problem in the kernel instead of working around it and creating a new ABI.Because the old ABI creates 129,000+ entries inside /sys/devices/system/memory with their associated links from /sys/devices/system/node/node*/ back to those directory entries. Thankfully things like rpm, hald, and other miscellaneous commands scan that information.
Really? Why? Why would rpm care about this? hald is dead now so we don't need to worry about that anymore, but what other commands/programs read this information? thanks, greg k-h