In order to support multiple filesystems, Linux contains a special kernel interface level called VFS (Virtual Filesystem Switch). This is similar to the vnode/vfs interface found in SVR4 derivatives (originally it came from BSD and Sun original implementations).
Linux inode cache is implemented in a single file,
fs/inode.c, which consists of 977 lines of code. It
is interesting to note that not many changes have been made to it
for the last 5-7 years: one can still recognise some of the code
comparing the latest version with, say, 1.3.42.
The structure of Linux inode cache is as follows:
inode_hashtable, where
each inode is hashed by the value of the superblock pointer and
32bit inode number. Inodes without a superblock
(inode->i_sb == NULL) are added to a doubly
linked list headed by anon_hash_chain instead.
Examples of anonymous inodes are sockets created by
net/socket.c:sock_alloc(), by calling
fs/inode.c:get_empty_inode().inode_in_use),
which contains valid inodes with i_count>0 and
i_nlink>0. Inodes newly allocated by
get_empty_inode() and get_new_inode()
are added to the inode_in_use list.inode_unused),
which contains valid inodes with i_count = 0.sb->s_dirty) which contains valid inodes with
i_count>0, i_nlink>0 and
i_state & I_DIRTY. When inode is marked dirty,
it is added to the sb->s_dirty list if it is
also hashed. Maintaining a per-superblock dirty list of inodes
allows to quickly sync inodes.inode_cachep. As inode objects are allocated and
freed, they are taken from and returned to this SLAB
cache.The type lists are anchored from
inode->i_list, the hashtable from
inode->i_hash. Each inode can be on a hashtable
and one and only one type (in_use, unused or dirty) list.
All these lists are protected by a single spinlock:
inode_lock.
The inode cache subsystem is initialised when
inode_init() function is called from
init/main.c:start_kernel(). The function is marked
as __init, which means its code is thrown away later
on. It is passed a single argument - the number of physical pages
on the system. This is so that the inode cache can configure
itself depending on how much memory is available, i.e. create a
larger hashtable if there is enough memory.
The only stats information about inode cache is the number of
unused inodes, stored in inodes_stat.nr_unused and
accessible to user programs via files
/proc/sys/fs/inode-nr and
/proc/sys/fs/inode-state.
We can examine one of the lists from gdb running on a live kernel thus:
(gdb) printf "%d\n", (unsigned long)(&((struct inode *)0)->i_list) 8 (gdb) p inode_unused $34 = 0xdfa992a8 (gdb) p (struct list_head)inode_unused $35 = {next = 0xdfa992a8, prev = 0xdfcdd5a8} (gdb) p ((struct list_head)inode_unused).prev $36 = (struct list_head *) 0xdfcdd5a8 (gdb) p (((struct list_head)inode_unused).prev)->prev $37 = (struct list_head *) 0xdfb5a2e8 (gdb) set $i = (struct inode *)0xdfb5a2e0 (gdb) p $i->i_ino $38 = 0x3bec7 (gdb) p $i->i_count $39 = {counter = 0x0}
Note that we deducted 8 from the address 0xdfb5a2e8 to obtain
the address of the struct inode (0xdfb5a2e0)
according to the definition of list_entry() macro
from include/linux/list.h.
To understand how inode cache works, let us trace a lifetime of an inode of a regular file on ext2 filesystem as it is opened and closed:
fd = open("file", O_RDONLY); close(fd);
The open(2) system call is implemented in
fs/open.c:sys_open function and the real work is
done by fs/open.c:filp_open() function, which is
split into two parts:
open_namei(): fills in the nameidata structure
containing the dentry and vfsmount structures.dentry_open(): given a dentry and vfsmount,
this function allocates a new struct file and
links them together; it also invokes the filesystem specific
f_op->open() method which was set in
inode->i_fop when inode was read in
open_namei() (which provided inode via
dentry->d_inode).The open_namei() function interacts with dentry
cache via path_walk(), which in turn calls
real_lookup(), which invokes the filesystem specific
inode_operations->lookup() method. The role of
this method is to find the entry in the parent directory with the
matching name and then do iget(sb, ino) to get the
corresponding inode - which brings us to the inode cache. When
the inode is read in, the dentry is instantiated by means of
d_add(dentry, inode). While we are at it, note that
for UNIX-style filesystems which have the concept of on-disk
inode number, it is the lookup method's job to map its endianness
to current CPU format, e.g. if the inode number in raw
(fs-specific) dir entry is in little-endian 32 bit format one
could do:
unsigned long ino = le32_to_cpu(de->inode); inode = iget(sb, ino); d_add(dentry, inode);
So, when we open a file we hit iget(sb, ino)
which is really iget4(sb, ino, NULL, NULL), which
does:
inode_lock. If inode is found, its reference count
(i_count) is incremented; if it was 0 prior to
incrementation and the inode is not dirty, it is removed from
whatever type list (inode->i_list) it is
currently on (it has to be inode_unused list, of
course) and inserted into inode_in_use type list;
finally, inodes_stat.nr_unused is
decremented.iget4() is guaranteed to return an
unlocked inode.get_new_inode(), passing it the pointer to the
place in the hashtable where it should be inserted to.get_new_inode() allocates a new inode from the
inode_cachep SLAB cache but this operation can
block (GFP_KERNEL allocation), so it must drop the
inode_lock spinlock which guards the hashtable.
Since it has dropped the spinlock, it must retry searching the
inode in the hashtable afterwards; if it is found this time, it
returns (after incrementing the reference by
__iget) the one found in the hashtable and
destroys the newly allocated one. If it is still not found in
the hashtable, then the new inode we have just allocated is the
one to be used; therefore it is initialised to the required
values and the fs-specific
sb->s_op->read_inode() method is invoked to
populate the rest of the inode. This brings us from inode cache
back to the filesystem code - remember that we came to the
inode cache when filesystem-specific lookup()
method invoked iget(). While the
s_op->read_inode() method is reading the inode
from disk, the inode is locked (i_state = I_LOCK);
it is unlocked after the read_inode() method
returns and all the waiters for it are woken up.Now, let's see what happens when we close this file
descriptor. The close(2) system call is implemented in
fs/open.c:sys_close() function, which calls
do_close(fd, 1) which rips (replaces with NULL) the
descriptor of the process' file descriptor table and invokes the
filp_close() function which does most of the work.
The interesting things happen in fput(), which
checks if this was the last reference to the file, and if so
calls fs/file_table.c:_fput() which calls
__fput() which is where interaction with dcache (and
therefore with inode cache - remember dcache is a Master of inode
cache!) happens. The fs/dcache.c:dput() does
dentry_iput() which brings us back to inode cache
via iput(inode) so let us understand
fs/inode.c:iput(inode):
sb->s_op->put_inode() method, it is invoked
immediately with no spinlocks held (so it can block).inode_lock spinlock is taken and
i_count is decremented. If this was NOT the last
reference to this inode then we simply check if there are too
many references to it and so i_count can wrap
around the 32 bits allocated to it and if so we print a warning
and return. Note that we call printk() while
holding the inode_lock spinlock - this is fine
because printk() can never block, therefore it may
be called in absolutely any context (even from interrupt
handlers!).The work performed by iput() on the last inode
reference is rather complex so we separate it into a list of its
own:
i_nlink == 0 (e.g. the file was unlinked
while we held it open) then the inode is removed from hashtable
and from its type list; if there are any data pages held in
page cache for this inode, they are removed by means of
truncate_all_inode_pages(&inode->i_data).
Then the filesystem-specific
s_op->delete_inode() method is invoked, which
typically deletes the on-disk copy of the inode. If there is no
s_op->delete_inode() method registered by the
filesystem (e.g. ramfs) then we call
clear_inode(inode), which invokes
s_op->clear_inode() if registered and if inode
corresponds to a block device, this device's reference count is
dropped by bdput(inode->i_bdev).i_nlink != 0 then we check if there are
other inodes in the same hash bucket and if there is none, then
if inode is not dirty we delete it from its type list and add
it to inode_unused list, incrementing
inodes_stat.nr_unused. If there are inodes in the
same hashbucket then we delete it from the type list and add to
inode_unused list. If this was an anonymous inode
(NetApp .snapshot) then we delete it from the type list and
clear/destroy it completely.The Linux kernel provides a mechanism for new filesystems to be written with minimum effort. The historical reasons for this are:
Let us consider the steps required to implement a filesystem
under Linux. The code to implement a filesystem can be either a
dynamically loadable module or statically linked into the kernel,
and the way it is done under Linux is very transparent. All that
is needed is to fill in a struct file_system_type
structure and register it with the VFS using the
register_filesystem() function as in the following
example from fs/bfs/inode.c:
#include <linux/module.h> #include <linux/init.h> static struct super_block *bfs_read_super(struct super_block *, void *, int); static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super); static int __init init_bfs_fs(void) { return register_filesystem(&bfs_fs_type); } static void __exit exit_bfs_fs(void) { unregister_filesystem(&bfs_fs_type); } module_init(init_bfs_fs) module_exit(exit_bfs_fs)
The module_init()/module_exit() macros ensure
that, when BFS is compiled as a module, the functions
init_bfs_fs() and exit_bfs_fs() turn
into init_module() and cleanup_module()
respectively; if BFS is statically linked into the kernel, the
exit_bfs_fs() code vanishes as it is
unnecessary.
The struct file_system_type is declared in
include/linux/fs.h:
struct file_system_type { const char *name; int fs_flags; struct super_block *(*read_super) (struct super_block *, void *, int); struct module *owner; struct vfsmount *kern_mnt; /* For kernel mount, if it's FS_SINGLE fs */ struct file_system_type * next; };
The fields thereof are explained thus:
/proc/filesystems file and is used as a key to
find a filesystem by its name; this same name is used for the
filesystem type in mount(2), and it should be unique:
there can (obviously) be only one filesystem with a given name.
For modules, name points to module's address spaces and not
copied: this means cat /proc/filesystems can oops if the
module was unloaded but filesystem is still registered.FS_REQUIRES_DEV for filesystems that can only be
mounted on a block device, FS_SINGLE for
filesystems that can have only one superblock,
FS_NOMOUNT for filesystems that cannot be mounted
from userspace by means of mount(2) system call: they
can however be mounted internally using
kern_mount() interface, e.g. pipefs.FS_SINGLE
case where it will Oops in get_sb_single(), trying
to dereference a NULL pointer in
fs_type->kern_mnt->mnt_sb with
(fs_type->kern_mnt = NULL).THIS_MODULE does the right thing
automatically.FS_SINGLE filesystems
only. This is set by kern_mount() (TODO:
kern_mount() should refuse to mount filesystems if
FS_SINGLE is not set).file_systems (see fs/super.c). The
list is protected by the file_systems_lock
read-write spinlock and functions
register/unregister_filesystem() modify it by
linking and unlinking the entry from the list.The job of the read_super() function is to fill
in the fields of the superblock, allocate root inode and
initialise any fs-private information associated with this
mounted instance of the filesystem. So, typically the
read_super() would do:
sb->s_dev argument, using buffer cache
bread() function. If it anticipates to read a few
more subsequent metadata blocks immediately then it makes sense
to use breada() to schedule reading extra blocks
asynchronously.sb->s_op to point to
struct super_block_operations structure. This
structure contains filesystem-specific functions implementing
operations like "read inode", "delete inode", etc.d_alloc_root().sb->s_dirt to 1 and mark the buffer containing
superblock dirty (TODO: why do we do this? I did it in BFS
because MINIX did it...)Under Linux there are several levels of indirection between
user file descriptor and the kernel inode structure. When a
process makes open(2) system call, the kernel returns a
small non-negative integer which can be used for subsequent I/O
operations on this file. This integer is an index into an array
of pointers to struct file. Each file structure
points to a dentry via file->f_dentry. And each
dentry points to an inode via
dentry->d_inode.
Each task contains a field tsk->files which is
a pointer to struct files_struct defined in
include/linux/sched.h:
/* * Open file table structure */ struct files_struct { atomic_t count; rwlock_t file_lock; int max_fds; int max_fdset; int next_fd; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; fd_set close_on_exec_init; fd_set open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; };
The file->count is a reference count,
incremented by get_file() (usually called by
fget()) and decremented by fput() and
by put_filp(). The difference between
fput() and put_filp() is that
fput() does more work usually needed for regular
files, such as releasing flock locks, releasing dentry, etc,
while put_filp() is only manipulating file table
structures, i.e. decrements the count, removes the file from the
anon_list and adds it to the free_list,
under protection of files_lock spinlock.
The tsk->files can be shared between parent
and child if the child thread was created using
clone() system call with CLONE_FILES
set in the clone flags argument. This can be seen in
kernel/fork.c:copy_files() (called by
do_fork()) which only increments the
file->count if CLONE_FILES is set
instead of the usual copying file descriptor table in
time-honoured tradition of classical UNIX fork(2).
When a file is opened, the file structure allocated for it is
installed into current->files->fd[fd] slot and
a fd bit is set in the bitmap
current->files->open_fds . All this is done
under the write protection of
current->files->file_lock read-write spinlock.
When the descriptor is closed a fd bit is cleared in
current->files->open_fds and
current->files->next_fd is set equal to
fd as a hint for finding the first unused descriptor
next time this process wants to open a file.
The file structure is declared in
include/linux/fs.h:
struct fown_struct { int pid; /* pid or -pgrp where SIGIO should be sent */ uid_t uid, euid; /* uid/euid of process setting the owner */ int signum; /* posix.1b rt signal to be delivered on IO */ }; struct file { struct list_head f_list; struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; atomic_t f_count; unsigned int f_flags; mode_t f_mode; loff_t f_pos; unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin; struct fown_struct f_owner; unsigned int f_uid, f_gid; int f_error; unsigned long f_version; /* needed for tty driver, and maybe others */ void *private_data; };
Let us look at the various fields of struct
file:
sb->s_files list of
all open files on this filesystem, if the corresponding inode
is not anonymous, then dentry_open() (called by
filp_open()) links the file into this list; b)
fs/file_table.c:free_list, containing unused file
structures; c) fs/file_table.c:anon_list, when a
new file structure is created by get_empty_filp()
it is placed on this list. All these lists are protected by the
files_lock spinlock.open_namei() (or rather path_walk()
which it calls) but the actual file->f_dentry
field is set by dentry_open() to contain the
dentry thus found.vfsmount
structure of the filesystem containing the file. This is set by
dentry_open() but is found as part of nameidata
lookup by open_namei() (or rather
path_init() which it calls).file_operations
which contains various methods that can be invoked on the file.
This is copied from inode->i_fop which is
placed there by filesystem-specific
s_op->read_inode() method during nameidata
lookup. We will look at file_operations methods in
detail later on in this section.get_file/put_filp/fput.O_XXX flags from
open(2) system call copied there (with slight
modifications by filp_open()) by
dentry_open() and after clearing
O_CREAT, O_EXCL,
O_NOCTTY, O_TRUNC - there is no point
in storing these flags permanently since they cannot be
modified by F_SETFL (or queried by
F_GETFL) fcntl(2) calls.dentry_open(). The point of the conversion
is to store read and write access in separate bits so one could
do easy checks like (f_mode & FMODE_WRITE) and
(f_mode & FMODE_READ).long
long, i.e. a 64bit value.SIGIO mechanism (see
fs/fcntl.c:kill_fasync()).get_empty_filp(). If the file is a
socket, used by ipv4 netfilter.fs/nfs/file.c and checked in
mm/filemap.c:generic_file_write().event) whenever
f_pos changes.file->f_dentry->d_inode->i_rdev.Now let us look at file_operations structure
which contains the methods that can be invoked on files. Let us
recall that it is copied from inode->i_fop where
it is set by s_op->read_inode() method. It is
declared in include/linux/fs.h:
struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char *, size_t, loff_t *); ssize_t (*write) (struct file *, const char *, size_t, loff_t *); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); };
THIS_MODULE, filesystems can happily ignore it
because their module counts are controlled at mount/umount time
whilst the drivers need to control it at open/release
time.fs/read_write.c:default_llseek() is used, which
does the right thing (TODO: force all those who set it to NULL
currently to use default_llseek - that way we save an
if() in llseek())read(2) system call.
Filesystems can use
mm/filemap.c:generic_file_read() for regular files
and fs/read_write.c:generic_read_dir() (which
simply returns -EISDIR) for directories here.mm/filemap.c:generic_file_write() for regular
files and ignore it for directories here.FIBMAP,
FIGETBSZ, FIONREAD are implemented by
higher levels so they never read f_op->ioctl()
method.dentry_open(). Filesystems rarely use this, e.g.
coda tries to cache the file locally at open time.release() method
below). The only filesystem that uses this is NFS client to
flush all dirty pages. Note that this can return an error which
will be passed back to userspace which made the close(2)
system call.file->f_count reaches 0.
Although defined as returning int, the return value is ignored
by VFS (see fs/file_table.c:__fput()).file =
fget(fd)) and down/up inode->i_sem
semaphore. Ext2 filesystem currently ignores the last argument
and does exactly the same for fsync(2) and
fdatasync(2).file->f_flags & FASYNC changes.posix_lock_file()), if it succeeds but the
standard POSIX lock code fails then it will never be unlocked
on fs-dependent level..Under Linux, information about mounted filesystems is kept in
two separate structures - super_block and
vfsmount. The reason for this is that Linux allows
to mount the same filesystem (block device) under multiple mount
points, which means that the same super_block can
correspond to multiple vfsmount structures.
Let us look at struct super_block first, declared
in include/linux/fs.h:
struct super_block { struct list_head s_list; /* Keep this first */ kdev_t s_dev; unsigned long s_blocksize; unsigned char s_blocksize_bits; unsigned char s_lock; unsigned char s_dirt; struct file_system_type *s_type; struct super_operations *s_op; struct dquot_operations *dq_op; unsigned long s_flags; unsigned long s_magic; struct dentry *s_root; wait_queue_head_t s_wait; struct list_head s_dirty; /* dirty inodes */ struct list_head s_files; struct block_device *s_bdev; struct list_head s_mounts; /* vfsmount(s) of this one */ struct quota_mount_options s_dquot; /* Diskquota specific options */ union { struct minix_sb_info minix_sb; struct ext2_sb_info ext2_sb; ..... all filesystems that need sb-private info ... void *generic_sbp; } u; /* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */ struct semaphore s_vfs_rename_sem; /* Kludge */ /* The next field is used by knfsd when converting a (inode number based) * file handle into a dentry. As it builds a path in the dcache tree from * the bottom up, there may for a time be a subpath of dentrys which is not * connected to the main tree. This semaphore ensure that there is only ever * one such free path per filesystem. Note that unconnected files (or other * non-directories) are allowed, but not unconnected diretories. */ struct semaphore s_nfsd_free_path_sem; };
The various fields in the super_block structure
are:
FS_REQUIRES_DEV filesystems,
this is the i_dev of the block device. For others
(called anonymous filesystems) this is an integer
MKDEV(UNNAMED_MAJOR, i) where i is
the first unset bit in unnamed_dev_in_use array,
between 1 and 255 inclusive. See
fs/super.c:get_unnamed_dev()/put_unnamed_dev(). It
has been suggested many times that anonymous filesystems should
not use s_dev field.lock_super()/unlock_super().struct
file_system_type of the corresponding filesystem.
Filesystem's read_super() method doesn't need to
set it as VFS fs/super.c:read_super() sets it for
you if fs-specific read_super() succeeds and
resets to NULL if it fails.super_operations
structure which contains fs-specific methods to read/write
inodes etc. It is the job of filesystem's
read_super() method to initialise
s_op correctly.read_super() to read the root inode from
the disk and pass it to d_alloc_root() to allocate
the dentry and instantiate it. Some filesystems spell "root"
other than "/" and so use more generic d_alloc()
function to bind the dentry to a name, e.g. pipefs mounts
itself on "pipe:" as its own root instead of "/".inode->i_state & I_DIRTY)
then it is on superblock-specific dirty list linked via
inode->i_list.fs/file_table.c:fs_may_remount_ro() which goes
through sb->s_files list and denies remounting
if there are files opened for write (file->f_mode
& FMODE_WRITE) or files with pending unlink
(inode->i_nlink == 0).FS_REQUIRES_DEV, this
points to the block_device structure describing the device the
filesystem is mounted on.vfsmount
structures, one for each mounted instance of this
superblock.The superblock operations are described in the
super_operations structure declared in
include/linux/fs.h:
struct super_operations { void (*read_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*statfs) (struct super_block *, struct statfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); };
fs/inode.c:get_new_inode()
from iget4() (and therefore iget()).
If a filesystem wants to use iget() then
read_inode() must be implemented - otherwise
get_new_inode() will panic. While inode is being
read it is locked (inode->i_state = I_LOCK).
When the function returns, all waiters on
inode->i_wait are woken up. The job of the
filesystem's read_inode() method is to locate the
disk block which contains the inode to be read and use buffer
cache bread() function to read it in and
initialise the various fields of inode structure, for example
the inode->i_op and
inode->i_fop so that VFS level knows what
operations can be performed on the inode or corresponding file.
Filesystems that don't implement read_inode() are
ramfs and pipefs. For example, ramfs has its own
inode-generating function ramfs_get_inode() with
all the inode operations calling it as needed.read_inode() in that it needs to locate the
relevant block on disk and interact with buffer cache by
calling mark_buffer_dirty(bh). This method is
called on dirty inodes (those marked dirty with
mark_inode_dirty()) when the inode needs to be
sync'd either individually or as part of syncing the entire
filesystem.inode->i_count and
inode->i_nlink reach 0. Filesystem deletes the
on-disk copy of the inode and calls clear_inode()
on VFS inode to "terminate it with extreme prejudice".brelse() the
block containing the superblock and kfree() any
bitmaps allocated for free blocks, inodes, etc.sb-private area) and
mark_buffer_dirty(bh) . It should also clear
sb->s_dirt flag.struct
statfs passed as argument is a kernel pointer, not a
user pointer so we don't need to do any I/O to userspace. If
not implemented then statfs(2) will fail with
ENOSYS.clear_inode(). Filesystems that attach private
data to inode structure (via generic_ip field)
must free it here.So, let us look at what happens when we mount a on-disk
(FS_REQUIRES_DEV) filesystem. The implementation of
the mount(2) system call is in
fs/super.c:sys_mount() which is the just a wrapper
that copies the options, filesystem type and device name for the
do_mount() function which does the real work:
do_mount() calling
get_fs_type() and once by
get_sb_dev() calling get_filesystem()
if read_super() was successful. The first
increment is to prevent module unloading while we are inside
read_super() method and the second increment is to
indicate that the module is in use by this mounted instance.
Obviously, do_mount() decrements the count before
returning, so overall the count only grows by 1 after each
mount.fs_type->fs_flags &
FS_REQUIRES_DEV is true, the superblock is initialised
by a call to get_sb_bdev() which obtains the
reference to the block device and interacts with the
filesystem's read_super() method to fill in the
superblock. If all goes well, the super_block
structure is initialised and we have an extra reference to the
filesystem's module and a reference to the underlying block
device.vfsmount structure is allocated and
linked to sb->s_mounts list and to the global
vfsmntlist list. The vfsmount field
mnt_instances allows to find all instances mounted
on the same superblock as this one. The mnt_list
field allows to find all instances for all superblocks
system-wide. The mnt_sb field points to this
superblock and mnt_root has a new reference to the
sb->s_root dentry.As a simple example of Linux filesystem that does not require
a block device for mounting, let us consider pipefs from
fs/pipe.c. The filesystem's preamble is rather
straightforward and requires little explanation:
static DECLARE_FSTYPE(pipe_fs_type, "pipefs", pipefs_read_super, FS_NOMOUNT|FS_SINGLE); static int __init init_pipe_fs(void) { int err = register_filesystem(&pipe_fs_type); if (!err) { pipe_mnt = kern_mount(&pipe_fs_type); err = PTR_ERR(pipe_mnt); if (!IS_ERR(pipe_mnt)) err = 0; } return err; } static void __exit exit_pipe_fs(void) { unregister_filesystem(&pipe_fs_type); kern_umount(pipe_mnt); } module_init(init_pipe_fs) module_exit(exit_pipe_fs)
The filesystem is of type FS_NOMOUNT|FS_SINGLE,
which means it cannot be mounted from userspace and can only have
one superblock system-wide. The FS_SINGLE file also
means that it must be mounted via kern_mount() after
it is successfully registered via
register_filesystem(), which is exactly what happens
in init_pipe_fs(). The only bug in this function is
that if kern_mount() fails (e.g. because
kmalloc() failed in add_vfsmnt()) then
the filesystem is left as registered but module initialisation
fails. This will cause cat /proc/filesystems to Oops.
(have just sent a patch to Linus mentioning that although this is
not a real bug today as pipefs can't be compiled as a module, it
should be written with the view that in the future it may become
modularised).
The result of register_filesystem() is that
pipe_fs_type is linked into the
file_systems list so one can read
/proc/filesystems and find "pipefs" entry in there
with "nodev" flag indicating that FS_REQUIRES_DEV
was not set. The /proc/filesystems file should
really be enhanced to support all the new FS_ flags
(and I made a patch to do so) but it cannot be done because it
will break all the user applications that use it. Despite Linux
kernel interfaces changing every minute (only for the better)
when it comes to the userspace compatibility, Linux is a very
conservative operating system which allows many applications to
be used for a long time without being recompiled.
The result of kern_mount() is that:
unnamed_dev_in_use bitmap; if
there are no more bits then kern_mount() fails
with EMFILE.get_empty_super(). The
get_empty_super() function walks the list of
superblocks headed by super_block and looks for
empty entry, i.e. s->s_dev == 0. If no such
empty superblock is found then a new one is allocated using
kmalloc() at GFP_USER priority. The
maximum system-wide number of superblocks is checked in
get_empty_super() so if it starts failing, one can
adjust the tunable /proc/sys/fs/super-max.pipe_fs_type->read_super() method, i.e.
pipefs_read_super(), is invoked which allocates
root inode and root dentry sb->s_root, and sets
sb->s_op to be
&pipefs_ops.kern_mount() calls add_vfsmnt(NULL,
sb->s_root, "none") which allocates a new
vfsmount structure and links it into
vfsmntlist and sb->s_mounts.pipe_fs_type->kern_mnt is set to this
new vfsmount structure and it is returned. The
reason why the return value of kern_mount() is a
vfsmount structure is because even
FS_SINGLE filesystems can be mounted multiple
times and so their mnt->mnt_sb will point to
the same thing which would be silly to return from multiple
calls to kern_mount().Now that the filesystem is registered and inkernel-mounted we
can use it. The entry point into the pipefs filesystem is the
pipe(2) system call, implemented in arch-dependent
function sys_pipe() but the real work is done by a
portable fs/pipe.c:do_pipe() function. Let us look
at do_pipe() then. The interaction with pipefs
happens when do_pipe() calls
get_pipe_inode() to allocate a new pipefs inode. For
this inode, inode->i_sb is set to pipefs'
superblock pipe_mnt->mnt_sb, the file operations
i_fop is set to rdwr_pipe_fops and the
number of readers and writers (held in
inode->i_pipe) is set to 1. The reason why there
is a separate inode field i_pipe instead of keeping
it in the fs-private union is that pipes and FIFOs
share the same code and FIFOs can exist on other filesystems
which use the other access paths within the same union which is
very bad C and can work only by pure luck. So, yes, 2.2.x kernels
work only by pure luck and will stop working as soon as you
slightly rearrange the fields in the inode.
Each pipe(2) system call increments a reference count
on the pipe_mnt mount instance.
Under Linux, pipes are not symmetric (bidirection or STREAM
pipes), i.e. two sides of the file have different
file->f_op operations - the
read_pipe_fops and write_pipe_fops
respectively. The write on read side returns EBADF
and so does read on write side.
As a simple example of ondisk Linux filesystem, let us
consider BFS. The preamble of the BFS module is in
fs/bfs/inode.c:
static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super); static int __init init_bfs_fs(void) { return register_filesystem(&bfs_fs_type); } static void __exit exit_bfs_fs(void) { unregister_filesystem(&bfs_fs_type); } module_init(init_bfs_fs) module_exit(exit_bfs_fs)
A special fstype declaration macro
DECLARE_FSTYPE_DEV() is used which sets the
fs_type->flags to FS_REQUIRES_DEV to
signify that BFS requires a real block device to be mounted
on.
The module's initialisation function registers the filesystem with VFS and the cleanup function (only present when BFS is configured to be a module) unregisters it.
With the filesystem registered, we can proceed to mount it,
which would invoke out fs_type->read_super()
method which is implemented in
fs/bfs/inode.c:bfs_read_super(). It does the
following:
set_blocksize(s->s_dev, BFS_BSIZE): since
we are about to interact with the block device layer via the
buffer cache, we must initialise a few things, namely set the
block size and also inform VFS via fields
s->s_blocksize and
s->s_blocksize_bits.bh = bread(dev, 0, BFS_BSIZE): we read block 0
of the device passed via s->s_dev. This block
is the filesystem's superblock.BFS_MAGIC
number and, if valid, stored in the sb-private field
s->su_sbh (which is really
s->u.bfs_sb.si_sbh).kmalloc(GFP_KERNEL) and clear all bits to 0 except
the first two which we set to 1 to indicate that we should
never allocate inodes 0 and 1. Inode 2 is root and the
corresponding bit will be set to 1 a few lines later anyway -
the filesystem should have a valid root inode at mounting
time!s->s_op, which means
that we can from this point invoke inode cache via
iget() which results in
s_op->read_inode() to be invoked. This finds
the block that contains the specified (by
inode->i_ino and inode->i_dev)
inode and reads it in. If we fail to get root inode then we
free the inode bitmap and release superblock buffer back to
buffer cache and return NULL. If root inode was read OK, then
we allocate a dentry with name / (as becometh
root) and instantiate it with this inode.iput() - we don't hold a reference
to it longer than needed.s->s_dirt flag
(TODO: why do I do this? Originally, I did it because
minix_read_super() did but neither minix nor BFS
seem to modify superblock in the
read_super()).fs/super.c:read_super().After the read_super() function returns
successfully, VFS obtains the reference to the filesystem module
via call to get_filesystem(fs_type) in
fs/super.c:get_sb_bdev() and a reference to the
block device.
Now, let us examine what happens when we do I/O on the
filesystem. We already examined how inodes are read when
iget() is called and how they are released on
iput(). Reading inodes sets up, among other things,
inode->i_op and inode->i_fop;
opening a file will propagate inode->i_fop into
file->f_op.
Let us examine the code path of the link(2) system
call. The implementation of the system call is in
fs/namei.c:sys_link():
getname() function which does the error
checking.path_init()/path_walk() interaction with dcache.
The result is stored in old_nd and nd
structures.old_nd.mnt != nd.mnt then "cross-device
link" EXDEV is returned - one cannot link between
filesystems, in Linux this translates into - one cannot link
between mounted instances of a filesystem (or, in particular
between filesystems).nd by
lookup_create() .vfs_link() function is called which
checks if we can create a new entry in the directory and
invokes the dir->i_op->link() method which
brings us back to filesystem-specific
fs/bfs/dir.c:bfs_link() function.bfs_link(), we check if we are trying
to link a directory and if so, refuse with EPERM
error. This is the same behaviour as standard (ext2).bfs_add_entry() which goes through all entries
looking for unused slot (de->ino == 0) and,
when found, writes out the name/inode pair into the
corresponding block and marks it dirty (at non-superblock
priority).inode->i_nlink, update
inode->i_ctime and mark this inode dirty as
well as instantiating the new dentry with the inode.Other related inode operations like
unlink()/rename() etc work in a similar way, so not
much is gained by examining them all in details.
Linux supports loading user application binaries from disk. More interestingly, the binaries can be stored in different formats and the operating system's response to programs via system calls can deviate from norm (norm being the Linux behaviour) as required, in order to emulate formats found in other flavours of UNIX (COFF, etc) and also to emulate system calls behaviour of other flavours (Solaris, UnixWare, etc). This is what execution domains and binary formats are for.
Each Linux task has a personality stored in its
task_struct (p->personality). The
currently existing (either in the official kernel or as addon
patch) personalities include support for FreeBSD, Solaris,
UnixWare, OpenServer and many other popular operating systems.
The value of current->personality is split into
two parts:
STICKY_TIMEOUTS, WHOLE_SECONDS,
etc.By changing the personality, we can change the way the
operating system treats certain system calls, for example adding
a STICKY_TIMEOUT to
current->personality makes select(2)
system call preserve the value of last argument (timeout) instead
of storing the unslept time. Some buggy programs rely on buggy
operating systems (non-Linux) and so Linux provides a way to
emulate bugs in cases where the source code is not available and
so bugs cannot be fixed.
Execution domain is a contiguous range of personalities implemented by a single module. Usually a single execution domain implements a single personality but sometimes it is possible to implement "close" personalities in a single module without too many conditionals.
Execution domains are implemented in
kernel/exec_domain.c and were completely rewritten
for 2.4 kernel, compared with 2.2.x. The list of execution
domains currently supported by the kernel, along with the range
of personalities they support, is available by reading the
/proc/execdomains file. Execution domains, except
the PER_LINUX one, can be implemented as dynamically
loadable modules.
The user interface is via personality(2) system call,
which sets the current process' personality or returns the value
of current->personality if the argument is set to
impossible personality 0xffffffff. Obviously, the behaviour of
this system call itself does not depend on personality..
The kernel interface to execution domains registration consists of two functions:
int register_exec_domain(struct exec_domain
*): registers the execution domain by linking it into
single-linked list exec_domains under the write
protection of the read-write spinlock
exec_domains_lock. Returns 0 on success, non-zero
on failure.int unregister_exec_domain(struct exec_domain
*): unregisters the execution domain by unlinking it
from the exec_domains list, again using
exec_domains_lock spinlock in write mode. Returns
0 on success.The reason why exec_domains_lock is a read-write
is that only registration and unregistration requests modify the
list, whilst doing cat /proc/filesystems calls
fs/exec_domain.c:get_exec_domain_list(), which needs
only read access to the list. Registering a new execution domain
defines a "lcall7 handler" and a signal number conversion map.
Actually, ABI patch extends this concept of exec domain to
include extra information (like socket options, socket types,
address family and errno maps).
The binary formats are implemented in a similar manner, i.e. a
single-linked list formats is defined in fs/exec.c
and is protected by a read-write lock binfmt_lock.
As with exec_domains_lock, the
binfmt_lock is taken read on most occasions except
for registration/unregistration of binary formats. Registering a
new binary format enhances the execve(2) system call with
new load_binary()/load_shlib() functions as well as
ability to core_dump() . The
load_shlib() method is used only by the old
uselib(2) system call while the load_binary()
method is called by the search_binary_handler() from
do_execve() which implements execve(2) system
call.
The personality of the process is determined at binary format
loading by the corresponding format's load_binary()
method using some heuristics. For example to determine UnixWare7
binaries one first marks the binary using the elfmark(1)
utility, which sets the ELF header's e_flags to the
magic value 0x314B4455 which is detected at ELF loading time and
current->personality is set to PER_UW7. If this
heuristic fails, then a more generic one, such as treat ELF
interpreter paths like /usr/lib/ld.so.1 or
/usr/lib/libc.so.1 to indicate a SVR4 binary, is
used and personality is set to PER_SVR4. One could write a little
utility program that uses Linux's ptrace(2) capabilities
to single-step the code and force a running program into any
personality.
Once personality (and therefore
current->exec_domain) is known, the system calls
are handled as follows. Let us assume that a process makes a
system call by means of lcall7 gate instruction. This transfers
control to ENTRY(lcall7) of
arch/i386/kernel/entry.S because it was prepared in
arch/i386/kernel/traps.c:trap_init(). After
appropriate stack layout conversion, entry.S:lcall7
obtains the pointer to exec_domain from
current and then an offset of lcall7 handler within
the exec_domain (which is hardcoded as 4 in asm code
so you can't shift the handler field around in C
declaration of struct exec_domain) and jumps to it.
So, in C, it would look like this:
static void UW7_lcall7(int segment, struct pt_regs * regs) { abi_dispatch(regs, &uw7_funcs[regs->eax & 0xff], 1); }
where abi_dispatch() is a wrapper around the
table of function pointers that implement this personality's
system calls uw7_funcs.