Thread (7 messages) 7 messages, 5 authors, 2017-05-18

Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

From: Darrick J. Wong <hidden>
Date: 2017-05-14 04:25:12
Also in: linux-btrfs, linux-ext4, linux-fsdevel, linux-man, linux-xfs

Possibly related (same subject, not in this thread)

On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote:
On May 10, 2017, at 11:10 PM, Eric Biggers [off-list ref] wrote:
quoted
On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
quoted
[cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
quoted
Theodore Ts'o [off-list ref] writes:
quoted
On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
quoted
1.) Privacy implications.  Say the filesystem is being shared between multiple
   users, and one user unpacks foo.tar.gz into their home directory, which
   they've set to mode 700 to hide from other users.  Because of this new
   ioctl, all users will be able to see every (inode number, size in blocks)
   pair that was added to the filesystem, as well as the exact layout of the
   physical block allocations which might hint at how the files were created.
   If there is a known "fingerprint" for the unpacked foo.tar.gz in this
   regard, its presence on the filesystem will be revealed to all users.  And
   if any filesystems happen to prefer allocating blocks near the containing
   directory, the directory the files are in would likely be revealed too.
Frankly, why are container users even allowed to make unrestricted ioctl
calls?  I thought we had a bunch of security infrastructure to constrain
what userspace can do to a system, so why don't ioctls fall under these
same protections?  If your containers are really that adversarial, you
ought to be blacklisting as much as you can.
Personally I don't find the presence of sandboxing features to be a very good
excuse for introducing random insecure ioctls.  Not everyone has everything
perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
about the filesystem ioctls, too, since they can be executed on any regular
file, without having to open some device node in /dev.

(And this actually does happen; the SELinux policy in Android, for example,
still allows apps to call any ioctl on their data files, despite all the effort
that has gone into whitelisting other types of ioctls.  Which should be fixed,
of course, but it shows that this kind of mistake is very easy to make.)
quoted
quoted
quoted
Unix/Linux has historically not been terribly concerned about trying
to protect this kind of privacy between users.  So for example, in
order to do this, you would have to call GETFSMAP continously to track
this sort of thing.  Someone who wanted to do this could probably get
this information (and much, much more) by continuously running "ps" to
see what processes are running.

(I will note. wryly, that in the bad old days, when dozens of users
were sharing a one MIPS Vax/780, it was considered a *good* thing
that social pressure could be applied when it was found that someone
was running a CPU or memory hogger on a time sharing system.  The
privacy right of someone running "xtrek" to be able to hide this from
other users on the system was never considered important at all.  :-)
Not to mention someone running GETFSMAP in a loop will be pretty obvious
both from the high kernel cpu usage and the huge number of metadata
operations.
Well, only if that someone running GETFSMAP actually wants to watch things in
real-time (it's not necessary for all scenarios that have been mentioned), *and*
there is monitoring in place which actually detects it and can do something
about it.

Yes, PIDs have traditionally been global, but today we have PID namespaces, and
many other isolation features such as mount namespaces.  Nothing is perfect, of
course, and containers are a lot worse than VMs, but it seems weird to use that
as an excuse to knowingly make things worse...
quoted
quoted
quoted
Fortunately, the days of timesharing seem to well behind us.  For
those people who think that containers are as secure as VM's (hah,
hah, hah), it might be that best way to handle this is to have a mount
option that requires root access to this functionality.  For those
people who really care about this, they can disable access.
Or use separate filesystems for each container so that exploitable bugs
that shut down the filesystem can't be used to kill the other
containers.  You could use a torrent of metadata-heavy operations
(fallocate a huge file, punch every block, truncate file, repeat) to DoS
the other containers.
quoted
What would be the reason for not putting this behind
capable(CAP_SYS_ADMIN)?

What possible legitimate function could this functionality serve to
users who don't own your filesystem?
As I've said before, it's to enable dedupe tools to decide, given a set
of files with shareable blocks, roughly how many other times each of
those shareable blocks are shared so that they can make better decisions
about which file keeps its shareable blocks, and which file gets
remapped.  Dedupe is not a privileged operation, nor are any of the
tools.
So why does the ioctl need to return all extent mappings for the entire
filesystem, instead of just the share count of each block in the file that the
ioctl is called on?
One possibility is that the ioctl() can return the mapping for all inodes
owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
than one if there is a reason to do so) with all the other allocated blocks
for inodes the user doesn't have permission to access?
Hmm, CAP_DAC_OVERRIDE/CAP_FOWNER?  That might be a reasonable set of
capabilities to grant access...
IMHO, this would allow a non-root user the main benefit of GETFSMAP,  which
is trying to determine how fragmented their files are and/or how fragmented
the free space is, without leaking any information about file sizes, location,
or other information the user can't already get today in a less efficient
manner.

I don't know how hard this is to implement, but seems not impossible.
It's already implemented in both XFS and ext4. <cough>

File extents are marked as "owned" by "unknown".

Now, I suppose one could devise a scheme such that files that the caller
can open actually do get inode numbers returned, but ... that's more
engineering work, let's see if anyone asks for that (vs. asks for any of
the magic capability bits).

--D
Cheers, Andreas



Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help