Linux Kernel VFSisms
Introduction
This is intended as a collection of helpful knowledge bits around Linus Kernel VFS internals. It mostly contains (hopefully) useful bits and pieces I picked up while working on the Linux kernel and talking to VFS maintainers or high-profile contributors.
ksys_close()
Should never be used. One of the major reasons being that it is too easy to get wrong.
On creating and installing new file descriptors
A file descriptor should only be installed past every possible point of
failure. Specifically for a syscall the file descriptor should be installed
right before returning to userspace.
Consider the function anon_inode_getfd()
. This functions creates and installs
a new file descriptor for a task. Hence, by the rule given above it should only
ever be called when the syscall cannot fail anymore in any other way then by
failing anon_inode_getfd()
.
For all other cases the rule is to reserve a file descriptor but defer the installation of the file descriptor past the last point of failure. Note, that installing an file descriptor itself is not an operation that can fail.
Back to the anonymous inode example: Instead of calling anon_inode_getfd()
callers who need a file descriptor before the last point of failure should
reserve a file descriptor, call anon_inode_getfile()
and then defer the
fd_install()
until after the last point of failure. Here is a concrete
example blessed by Al Viro:
if (clone_flags & CLONE_PIDFD) {
/* reserve a new file descriptor */
retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
if (retval < 0)
goto bad_fork_free_pid;
pidfd = retval;
/* get file to associate with file descriptor */
pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
O_RDWR | O_CLOEXEC);
if (IS_ERR(pidfile)) {
put_unused_fd(pidfd);
retval = ERR_PTR(pidfile);
goto bad_fork_free_pid;
}
get_pid(pid); /* held by pidfile now */
/* place file descriptor in buffer accessible for userspace */
retval = put_user(pidfd, parent_tidptr);
if (retval)
goto bad_fork_put_pidfd;
}
/* a lot more code that can fail somehow */
/* Let kill terminate clone/fork in the middle */
if (fatal_signal_pending(current)) {
retval = -EINTR;
goto bad_fork_cancel_cgroup;
}
/* past the last point of failure */
if (pidfile)
fd_install(pidfd, pidfile);
Setting i_nlink
after creating a directory in-kernel
When a new directory is created in a filesystem the inode needs to be
initialized. The new_inode
for the directoy needs to get a count of 2
for
(.
and ..
) the count of the parent_inode
of the parent directory needs to
be incremented. There are a few places in kernel where this is done like this:
inc_nlink(new_inode);
d_instantiate(dentry, new_inode);
inc_nlink(parent_inode);
fsnotify_mkdir(parent_inode, dentry);
But the preferred method of doing this is:
set_nlink(new_inode, 0);
d_instantiate(dentry, new_inode);
inc_nlink(parent_inode);
fsnotify_mkdir(parent_inode, dentry);
since new_inode
cannot be modified by someone else concurrently.