Hi, Nikolai Khechumov is here.
We all use software and hardware with many security mechanisms inside and don’t notice them. But these mechanisms are not magical: they still live somewhere in code or schematics and trigger just when needed.
I’ve always been interested in how things work, which is one of the main reasons I’ve been passionate about Engineering, especially Security.
So, I’m starting the “Security’s Moving Parts” series.
My goal is to dig deeper, learn something new, understand specific processes under the hood, and pinpoint interesting places.
Today, I’d like to cover access control mechanisms in Unix/Linux. We’ve all seen the front part of the process — that RWX thing every file and directory has when we use the ls command. But how is it represented technically? When is it used and — the most interesting — who uses it?
Filesystem Representation
Storage disks have bits, and OSes have files.
Something in between should glue these two concepts together. So, a file system represents data on a disk as abstractions (files) and makes other essential operations.
Operating Systems obviously have some code to support this. There are many file systems, and it is more reasonable for an OS to have an abstraction layer to operate on any of them. Linux has such an abstraction called VFS or Virtual File System.
It allows system calls (such as open
, read
, write
, and close
) to interact with files similarly, regardless of the underlying file system type.
So, VFS provides a unified interface to various file systems, abstracting how file data is stored.
Linux’s VFS defines the following key structures:
- inode: metadata of a file or a directory (size, ownership, permissions, timestamps, pointers to data blocks on a disk).
- dentry: a mapping between a filename and corresponding inode.
- file: a currently open file and its state (eq. pointer and rw-flags).
We can discover an inode’s model in the source code of the Linux kernel:
In general, inode looks like this:
struct inode {
unsigned long i_ino; // Inode number (unique identifier)
umode_t i_mode; // File mode (type, permissions)
unsigned int i_nlink; // Number of hard links
uid_t i_uid; // User ID of the file's owner
gid_t i_gid; // Group ID of the file's owner
loff_t i_size; // File size in bytes
struct timespec i_atime; // Time of last access
struct timespec i_mtime; // Time of last modification
struct timespec i_ctime; // Time of last status change
struct super_block *i_sb; // Pointer to the superblock for the file system
struct inode_operations *i_op; // File system-specific inode operations
struct file_operations *i_fop; // File operations (e.g., read, write, open)
// Additional fields may include:
// - Pointers to data block mappings
// - Extended attributes and ACLs
// - Caching and locking primitives
};
Permission information for a file is stored in a single i_mode field (16 bits) and encodes the permission composition for three types of actions (read, write, execute) for three types of actors (owner, group, others).
Some bits also describe an inode’s type (is it a file or a directory), and some special (and very interesting!) bits are out of scope for now.
Let’s break down the structure of these 16 bits:
15 14 13 12 | 11 | 10 | 9 | 8 7 6 | 5 4 3 | 2 1 0
---------------------------------------------------------------
| File Type |SUID|SGID|Sticky| Owner | Group | Others |
| (S_IF*) | | | Bit | (rwx) | (rwx) | (rwx) |
---------------------------------------------------------------
The bits of interest are from 0 to 8 and are divided into three groups for each actor.
Every group of three bits indicates whether read (r), write (w), or execute (x) against the inode is allowed.
So, this is how the RWX permission abstraction is represented in the file system and described in the kernel code.
Now, let’s get things moving.
The Syscalls
Let’s examine a simple operation: opening a file. Like any other OS, Linux has a system call for that with a pretty straightforward name: do_open
. Any process can call it.
do_open()
The corresponding source code lives in the /fs/open.c
file and will help us understand the process.
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;
return do_sys_open(AT_FDCWD, filename, flags, mode);
}
Following the chain of other calls, we land on the other (kernel space) do_open
function inside /fs/namei.c
where the may_open()
function is called:
*
* Handle the last step of open()
*/
static int do_open(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
// most of the actual code was removed for readability
int error;
error = may_open(idmap, &nd->path, acc_mode, open_flag);
if (!error && !(file->f_mode & FMODE_OPENED))
error = vfs_open(&nd->path, file);
if (!error)
error = security_file_post_open(file, op->acc_mode);
if (!error && do_truncate)
error = handle_truncate(idmap, file);
if (unlikely(error > 0)) {
WARN_ON(1);
error = -EINVAL;
}
return error;
}
may_open()
This function converts the requested file’s name into an inode and then calls the inode_permission
function:
static int may_open(struct mnt_idmap *idmap, const struct path *path,
int acc_mode, int flag)
{
// Step 1: Getting the inode for the path through dentry
// path -> dentry -> inode
struct dentry *dentry = path->dentry;
struct inode *inode = dentry->d_inode;
int error;
// Step 2: Checking the inode permission against the idmap
error = inode_permission(idmap, inode, MAY_OPEN | acc_mode);
if (error)
return error;
An idmap
here contains information about current users and groups.
inode_permission() → do_inode_permission() → generic_permission()
The last function is where the RWX checks actually live.
First of all, some basic and cheap permissions are checked in acl_permission_check()
- Are there any RWX rules for the inode? If not → OK
- Is the current user (process) an owner? If so → OK!
Then we finally see some familiar names and ‘ifs. But once again, we are jumping into another abstraction.
Our next step is capable_wrt_inode_uidgid
int generic_permission(struct mnt_idmap *idmap, struct inode *inode,
int mask)
{
int ret;
/*
* Do the basic permission checks.
*/
ret = acl_permission_check(idmap, inode, mask);
if (ret != -EACCES)
return ret;
// SOME CODE REMOVED
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
if (mask == MAY_READ)
if (capable_wrt_inode_uidgid(idmap, inode,
CAP_DAC_READ_SEARCH))
return 0;
// SOME CODE REMOVED
return -EACCES;
}
capable_wrt_inode_uidgid()
This function is used to decide if the current process is allowed to perform an operation on an inode that may be “owned” by another user. WRT in the name is “with respect to” the user and group IDs stored in the inode. In our case, we’re checking the CAP_DAC_READ_SEARCH
.
It lives in kernel/capability.c
bool capable_wrt_inode_uidgid(struct mnt_idmap *idmap,
const struct inode *inode, int cap)
{
struct user_namespace *ns = current_user_ns();
return ns_capable(ns, cap) &&
privileged_wrt_inode_uidgid(ns, idmap, inode);
}
Yes, we can also examine calls of ns_capable
and privileged_wrt_inode_uidgid
, but the current depth seems enough for our specific task. The methods above are about to open a massive layer of ‘Capabilities’ mechanism which I’d like to cover in a separate article some time 🙂
Going Back to the Surface
However, there is one more security mechanism we haven’t found in the pipeline — the Linux Security Module Hooks.
We’ve covered everything above the LSM hook block for now.
To discover the LSM entry, we have to move a bit back — to the inode_permission
function.
int inode_permission(struct mnt_idmap *idmap,
struct inode *inode, int mask)
{
int retval;
retval = sb_permission(inode->i_sb, inode, mask);
if (retval)
return retval;
// WE EXAMINED THIS ROUTE
retval = do_inode_permission(idmap, inode, mask);
if (retval)
return retval;
//
return security_inode_permission(inode, mask);
}
We examined the route of do_inode_permission
earlier, but it was just an intermediate step in the whole process.
In fact, this method represents the “DAC checks” square in the scheme above.
LSM hooks run at the very end of the inode_permission
— inside the security_inode_permission
function.
Summarizing
So, the access control process in Linux turns out to be (as always) complex and multi-staged.
- It is backed in data by the VFS abstraction and permission information inside its inodes.
- The data is used during every system call that involves file access actions.
- The checks inside such a system call start with simple DAC checks and finish with LSM hooks.
Thank you for reading, and see you in the next article!