Every process under Linux is dynamically allocated a
struct task_struct structure. The maximum number of
processes which can be created on Linux is limited only by the
amount of physical memory present, and is equal to (see
kernel/fork.c:fork_init()):
/* * The default maximum number of threads is set to a safe * value: the thread structures can take up at most half * of memory. */ max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
which, on IA32 architecture, basically means
num_physpages/4. As an example, on a 512M machine,
you can create 32k threads. This is a considerable improvement
over the 4k-epsilon limit for older (2.2 and earlier) kernels.
Moreover, this can be changed at runtime using the
KERN_MAX_THREADS sysctl(2), or simply using procfs
interface to kernel tunables:
# cat /proc/sys/kernel/threads-max 32764 # echo 100000 > /proc/sys/kernel/threads-max # cat /proc/sys/kernel/threads-max 100000 # gdb -q vmlinux /proc/kcore Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'. #0 0x0 in ?? () (gdb) p max_threads $1 = 100000
The set of processes on the Linux system is represented as a
collection of struct task_struct structures which
are linked in two ways:
p->next_task and p->prev_task
pointers.The hashtable is called pidhash[] and is defined
in include/linux/sched.h:
/* PID hashing. (shouldnt this be dynamic?) */ #define PIDHASH_SZ (4096 >> 2) extern struct task_struct *pidhash[PIDHASH_SZ]; #define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
The tasks are hashed by their pid value and the above hashing
function is supposed to distribute the elements uniformly in
their domain (0 to PID_MAX-1). The
hashtable is used to quickly find a task by given pid, using
find_task_pid() inline from
include/linux/sched.h:
static inline struct task_struct *find_task_by_pid(int pid) { struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)]; for(p = *htable; p && p->pid != pid; p = p->pidhash_next) ; return p; }
The tasks on each hashlist (i.e. hashed to the same value) are
linked by p->pidhash_next/pidhash_pprev which are
used by hash_pid() and unhash_pid() to
insert and remove a given process into the hashtable. These are
done under protection of the read-write spinlock called
tasklist_lock taken for WRITE.
The circular doubly-linked list that uses
p->next_task/prev_task is maintained so that one
could go through all tasks on the system easily. This is achieved
by the for_each_task() macro from
include/linux/sched.h:
#define for_each_task(p) \ for (p = &init_task ; (p = p->next_task) != &init_task ; )
Users of for_each_task() should take
tasklist_lock for READ. Note that for_each_task() is
using init_task to mark the beginning (and end) of
the list - this is safe because the idle task (pid 0) never
exits.
The modifiers of the process hashtable or/and the process
table links, notably fork(), exit() and
ptrace(), must take tasklist_lock for
WRITE. What is more interesting is that the writers must also
disable interrupts on the local CPU. The reason for this is not
trivial: the send_sigio() function walks the task
list and thus takes tasklist_lock for READ, and it
is called from kill_fasync() in interrupt context.
This is why writers must disable interrupts while readers don't
need to.
Now that we understand how the task_struct
structures are linked together, let us examine the members of
task_struct. They loosely correspond to the members
of UNIX 'struct proc' and 'struct user' combined together.
The other versions of UNIX separated the task state information into one part which should be kept memory-resident at all times (called 'proc structure' which includes process state, scheduling information etc.) and another part which is only needed when the process is running (called 'u area' which includes file descriptor table, disk quota information etc.). The only reason for such ugly design was that memory was a very scarce resource. Modern operating systems (well, only Linux at the moment but others, e.g. FreeBSD seem to improve in this direction towards Linux) do not need such separation and therefore maintain process state in a kernel memory-resident data structure at all times.
The task_struct structure is declared in
include/linux/sched.h and is currently 1680 bytes in
size.
The state field is declared as:
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ #define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 #define TASK_ZOMBIE 4 #define TASK_STOPPED 8 #define TASK_EXCLUSIVE 32
Why is TASK_EXCLUSIVE defined as 32 and not 16?
Because 16 was used up by TASK_SWAPPING and I forgot
to shift TASK_EXCLUSIVE up when I removed all
references to TASK_SWAPPING (sometime in 2.3.x).
The volatile in p->state
declaration means it can be modified asynchronously (from
interrupt handler):
TASK_RUNNING and placing it
on the runqueue is not atomic. You need to hold the
runqueue_lock read-write spinlock for read in
order to look at the runqueue. If you do so, you will then see
that every task on the runqueue is in TASK_RUNNING
state. However, the converse is not true for the reason
explained above. Similarly, drivers can mark themselves (or
rather the process context they run in) as
TASK_INTERRUPTIBLE (or
TASK_UNINTERRUPTIBLE) and then call
schedule(), which will then remove it from the
runqueue (unless there is a pending signal, in which case it is
left on the runqueue).TASK_INTERRUPTIBLE, except it cannot be woken
up.wait()-ed for) by the parent
(natural or by adoption).TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE. This means that when this
task is sleeping on a wait queue with many other tasks, it will
be woken up alone instead of causing "thundering herd" problem
by waking up all the waiters.Task flags contain information about the process states which are not mutually exclusive:
unsigned long flags; /* per process flags, defined below */ /* * Per process flags */ #define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */ /* Not implemented yet, only for 486*/ #define PF_STARTING 0x00000002 /* being created */ #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_VFORK 0x00001000 /* Wake up parent in mm_release */ #define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
The fields p->has_cpu,
p->processor, p->counter,
p->priority, p->policy and
p->rt_priority are related to the scheduler and
will be looked at later.
The fields p->mm and
p->active_mm point respectively to the process'
address space described by mm_struct structure and
to the active address space if the process doesn't have a real
one (e.g. kernel threads). This helps minimise TLB flushes on
switching address spaces when the task is scheduled out. So, if
we are scheduling-in the kernel thread (which has no
p->mm) then its next->active_mm
will be set to the prev->active_mm of the task
that was scheduled-out, which will be the same as
prev->mm if prev->mm != NULL. The
address space can be shared between threads if
CLONE_VM flag is passed to the clone(2)
system call or by means of vfork(2) system call.
The fields p->exec_domain and
p->personality relate to the personality of the
task, i.e. to the way certain system calls behave in order to
emulate the "personality" of foreign flavours of UNIX.
The field p->fs contains filesystem
information, which under Linux means three pieces of
information:
This structure also includes a reference count because it can
be shared between cloned tasks when CLONE_FS flag is
passed to the clone(2) system call.
The field p->files contains the file
descriptor table. This too can be shared between tasks, provided
CLONE_FILES is specified with clone(2) system
call.
The field p->sig contains signal handlers and
can be shared between cloned tasks by means of
CLONE_SIGHAND.
Different books on operating systems define a "process" in different ways, starting from "instance of a program in execution" and ending with "that which is produced by clone(2) or fork(2) system calls". Under Linux, there are three kinds of processes:
The idle thread is created at compile time for the first CPU;
it is then "manually" created for each CPU by means of
arch-specific fork_by_hand() in
arch/i386/kernel/smpboot.c, which unrolls the
fork(2) system call by hand (on some archs). Idle tasks
share one init_task structure but have a private TSS structure,
in the per-CPU array init_tss. Idle tasks all have
pid = 0 and no other task can share pid, i.e. use
CLONE_PID flag to clone(2).
Kernel threads are created using kernel_thread()
function which invokes the clone(2) system call in kernel
mode. Kernel threads usually have no user address space, i.e.
p->mm = NULL, because they explicitly do
exit_mm(), e.g. via daemonize()
function. Kernel threads can always access kernel address space
directly. They are allocated pid numbers in the low range.
Running at processor's ring 0 (on x86, that is) implies that the
kernel threads enjoy all I/O privileges and cannot be pre-empted
by the scheduler.
User tasks are created by means of clone(2) or fork(2) system calls, both of which internally invoke kernel/fork.c:do_fork().
Let us understand what happens when a user process makes a
fork(2) system call. Although fork(2) is
architecture-dependent due to the different ways of passing user
stack and registers, the actual underlying function
do_fork() that does the job is portable and is
located at kernel/fork.c.
The following steps are done:
retval is set to
-ENOMEM, as this is the value which
errno should be set to if fork(2) fails to
allocate a new task structure.CLONE_PID is set in
clone_flags then return an error
(-EPERM), unless the caller is the idle thread
(during boot only). So, normal user threads cannot pass
CLONE_PID to clone(2) and expect it to
succeed. For fork(2), this is irrelevant as
clone_flags is set to SIFCHLD - this
is only relevant when do_fork() is invoked from
sys_clone() which passes the
clone_flags from the value requested from
userspace.current->vfork_sem is initialised (it is
later cleared in the child). This is used by
sys_vfork() (vfork(2) system call,
corresponds to clone_flags =
CLONE_VFORK|CLONE_VM|SIGCHLD) to make the parent sleep
until the child does mm_release(), for example as
a result of exec()ing another program or
exit(2)-ing.alloc_task_struct() macro. On x86 it is just a gfp
at GFP_KERNEL priority. This is the first reason
why fork(2) system call may sleep. If this allocation
fails, we return -ENOMEM.*p =
*current. Perhaps this should be replaced by a memcpy?
Later on, the fields that should not be inherited by the child
are set to the correct values.RLIMIT_NPROC soft
limit - if so, fail with -EAGAIN, if not,
increment the count of processes by given uid
p->user->count.-EAGAIN.p->did_exec = 0)p->swappable = 0)p->state = TASK_UNINTERRUPTIBLE (TODO: why is
this done? I think it's not needed - get rid of it, Linus
confirms it is not needed)p->flags are set according to
the value of clone_flags; for plain fork(2), this will
be p->flags = PF_FORKNOEXEC.p->pid is set using the
fast algorithm in kernel/fork.c:get_pid() (TODO:
lastpid_lock spinlock can be made redundant since
get_pid() is always called under big kernel lock
from do_fork(), also remove flags argument of
get_pid(), patch sent to Alan on 20/06/2000 -
followup later).do_fork() initialises
the rest of child's task structure. At the very end, the
child's task structure is hashed into the pidhash
hashtable and the child is woken up (TODO:
wake_up_process(p) sets p->state =
TASK_RUNNING and adds the process to the runq, therefore
we probably didn't need to set p->state to
TASK_RUNNING earlier on in
do_fork()). The interesting part is setting
p->exit_signal to clone_flags &
CSIGNAL, which for fork(2) means just
SIGCHLD and setting
p->pdeath_signal to 0. The
pdeath_signal is used when a process 'forgets' the
original parent (by dying) and can be set/get by means of
PR_GET/SET_PDEATHSIG commands of prctl(2)
system call (You might argue that the way the value of
pdeath_signal is returned via userspace pointer
argument in prctl(2) is a bit silly - mea culpa, after
Andries Brouwer updated the manpage it was too late to fix
;)Thus tasks are created. There are several ways for tasks to terminate:
func == 1
(this is Linux-specific, for compatibility with old
distributions that still had the 'update' line in
/etc/inittab - nowadays the work of update is done
by kernel thread kupdate).Functions implementing system calls under Linux are prefixed
with sys_, but they are usually concerned only with
argument checking or arch-specific ways to pass some information
and the actual work is done by do_ functions. So it
is with sys_exit() which calls
do_exit() to do the work. Although, other parts of
the kernel sometimes invoke sys_exit() while they
should really call do_exit().
The function do_exit() is found in
kernel/exit.c. The points to note about
do_exit():
schedule() at the end, which never
returns.TASK_ZOMBIE.current->pdeath_signal, if not 0.current->exit_signal, which is usually equal to
SIGCHLD.The job of a scheduler is to arbitrate access to the current
CPU between multiple processes. The scheduler is implemented in
the 'main kernel file' kernel/sched.c. The
corresponding header file include/linux/sched.h is
included (either explicitly or indirectly) by virtually every
kernel source file.
The fields of task structure relevant to scheduler include:
p->need_resched: this field is set if
schedule() should be invoked at the 'next
opportunity'.p->counter: number of clock ticks left to
run in this scheduling slice, decremented by a timer. When this
field becomes lower than or equal to zero, it is reset to 0 and
p->need_resched is set. This is also sometimes
called 'dynamic priority' of a process because it can change by
itself.p->priority: the process' static priority,
only changed through well-known system calls like
nice(2), POSIX.1b sched_setparam(2) or
4.4BSD/SVR4 setpriority(2).p->rt_priority: realtime priorityp->policy: the scheduling policy, specifies
which scheduling class the task belongs to. Tasks can change
their scheduling class using the sched_setscheduler(2)
system call. The valid values are SCHED_OTHER
(traditional UNIX process), SCHED_FIFO (POSIX.1b
FIFO realtime process) and SCHED_RR (POSIX
round-robin realtime process). One can also OR
SCHED_YIELD to any of these values to signify that
the process decided to yield the CPU, for example by calling
sched_yield(2) system call. A FIFO realtime process will
run until either a) it blocks on I/O, b) it explicitly yields
the CPU or c) it is preempted by another realtime process with
a higher p->rt_priority value.
SCHED_RR is the same as SCHED_FIFO,
except that when its timeslice expires it goes back to the end
of the runqueue.The scheduler's algorithm is simple, despite the great
apparent complexity of the schedule() function. The
function is complex because it implements three scheduling
algorithms in one and also because of the subtle
SMP-specifics.
The apparently 'useless' gotos in schedule() are
there for a purpose - to generate the best optimised (for i386)
code. Also, note that scheduler (like most of the kernel) was
completely rewritten for 2.4, therefore the discussion below does
not apply to 2.2 or earlier kernels.
Let us look at the function in detail:
current->active_mm == NULL then
something is wrong. Current process, even a kernel thread
(current->mm == NULL) must have a valid
p->active_mm at all times.tq_scheduler task queue, process it now. Task
queues provide a kernel mechanism to schedule execution of
functions at a later time. We shall look at it in details
elsewhere.prev and
this_cpu to current task and current CPU
respectively.schedule() was invoked from interrupt
handler (due to a bug) and panic if so.struct schedule_data
*sched_data to point to per-CPU (cacheline-aligned to
prevent cacheline ping-pong) scheduling data area, which
contains the TSC value of last_schedule and the
pointer to last scheduled task structure (TODO:
sched_data is used on SMP only but why does
init_idle() initialises it on UP as well?).runqueue_lock spinlock is taken. Note that we
use spin_lock_irq() because in
schedule() we guarantee that interrupts are
enabled. Therefore, when we unlock runqueue_lock,
we can just re-enable them instead of saving/restoring eflags
(spin_lock_irqsave/restore variant).TASK_RUNNING state, it is left alone; if it is in
TASK_INTERRUPTIBLE state and a signal is pending,
it is moved into TASK_RUNNING state. In all other
cases, it is deleted from the runqueue.next (best candidate to be scheduled) is set
to the idle task of this cpu. However, the goodness of this
candidate is set to a very low value (-1000), in hope that
there is someone better than that.prev (current) task is in
TASK_RUNNING state, then the current goodness is
set to its goodness and it is marked as a better candidate to
be scheduled than the idle task.goodness(), which
treats realtime processes by making their goodness very high
(1000 + p->rt_priority), this being greater
than 1000 guarantees that no SCHED_OTHER process
can win; so they only contend with other realtime processes
that may have a greater p->rt_priority. The
goodness function returns 0 if the process' time slice
(p->counter) is over. For non-realtime
processes, the initial value of goodness is set to
p->counter - this way, the process is less
likely to get CPU if it already had it for a while, i.e.
interactive processes are favoured more than CPU bound number
crunchers. The arch-specific constant
PROC_CHANGE_PENALTY attempts to implement "cpu
affinity" (i.e. give advantage to a process on the same CPU).
It also gives a slight advantage to processes with mm pointing
to current active_mm or to processes with no
(user) address space, i.e. kernel threads.Note that the we drop the
recalculate: { struct task_struct *p; spin_unlock_irq(&runqueue_lock); read_lock(&tasklist_lock); for_each_task(p) p->counter = (p->counter >> 1) + p->priority; read_unlock(&tasklist_lock); spin_lock_irq(&runqueue_lock); }
runqueue_lock before we recalculate. The reason
is that we go through entire set of processes; this can take
a long time, during which the schedule() could
be called on another CPU and select a process with goodness
good enough for that CPU, whilst we on this CPU were forced
to recalculate. Ok, admittedly this is somewhat inconsistent
because while we (on this CPU) are selecting a process with
the best goodness, schedule() running on another
CPU could be recalculating dynamic priorities.
next
points to the task to be scheduled, so we initialise
next->has_cpu to 1 and
next->processor to this_cpu. The
runqueue_lock can now be unlocked.next ==
prev) then we can simply reacquire the global kernel
lock and return, i.e. skip all the hardware-level (registers,
stack etc.) and VM-related (switch page directory, recalculate
active_mm etc.) stuff.switch_to() is architecture
specific. On i386, it is concerned with a) FPU handling, b) LDT
handling, c) reloading segment registers, d) TSS handling and
e) reloading debug registers.Before we go on to examine implementation of wait queues, we
must acquaint ourselves with the Linux standard doubly-linked
list implementation. Wait queues (as well as everything else in
Linux) make heavy use of them and they are called in jargon
"list.h implementation" because the most relevant file is
include/linux/list.h.
The fundamental data structure here is struct
list_head:
struct list_head { struct list_head *next, *prev; }; #define LIST_HEAD_INIT(name) { &(name), &(name) } #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) #define INIT_LIST_HEAD(ptr) do { \ (ptr)->next = (ptr); (ptr)->prev = (ptr); \ } while (0) #define list_entry(ptr, type, member) \ ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member))) #define list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next)
The first three macros are for initialising an empty list by
pointing both next and prev pointers to
itself. It is obvious from C syntactical restrictions which ones
should be used where - for example, LIST_HEAD_INIT()
can be used for structure's element initialisation in
declaration, the second can be used for static variable
initialising declarations and the third can be used inside a
function.
The macro list_entry() gives access to individual
list element, for example (from
fs/file_table.c:fs_may_remount_ro()):
struct super_block { ... struct list_head s_files; ... } *sb = &some_super_block; struct file { ... struct list_head f_list; ... } *file; struct list_head *p; for (p = sb->s_files.next; p != &sb->s_files; p = p->next) { struct file *file = list_entry(p, struct file, f_list); do something to 'file' }
A good example of the use of list_for_each()
macro is in the scheduler where we walk the runqueue looking for
the process with highest goodness:
static LIST_HEAD(runqueue_head); struct list_head *tmp; struct task_struct *p; list_for_each(tmp, &runqueue_head) { p = list_entry(tmp, struct task_struct, run_list); if (can_schedule(p)) { int weight = goodness(p, this_cpu, prev->active_mm); if (weight > c) c = weight, next = p; } }
Here, p->run_list is declared as struct
list_head run_list inside task_struct
structure and serves as anchor to the list. Removing an element
from the list and adding (to head or tail of the list) is done by
list_del()/list_add()/list_add_tail() macros. The
examples below are adding and removing a task from runqueue:
static inline void del_from_runqueue(struct task_struct * p) { nr_running--; list_del(&p->run_list); p->run_list.next = NULL; } static inline void add_to_runqueue(struct task_struct * p) { list_add(&p->run_list, &runqueue_head); nr_running++; } static inline void move_last_runqueue(struct task_struct * p) { list_del(&p->run_list); list_add_tail(&p->run_list, &runqueue_head); } static inline void move_first_runqueue(struct task_struct * p) { list_del(&p->run_list); list_add(&p->run_list, &runqueue_head); }
When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.
Linux implementation allows wake-on semantics using
TASK_EXCLUSIVE flag. With waitqueues, you can either
use a well-known queue and then simply
sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout,
or you can define your own waitqueue and use
add/remove_wait_queue to add and remove yourself
from it and wake_up/wake_up_interruptible to wake up
when needed.
An example of the first usage of waitqueues is interaction
between the page allocator (in
mm/page_alloc.c:__alloc_pages()) and the
kswapd kernel daemon (in
mm/vmscan.c:kswap()), by means of wait queue
kswapd_wait, declared in mm/vmscan.c;
the kswapd daemon sleeps on this queue, and it is
woken up whenever the page allocator needs to free up some
pages.
An example of autonomous waitqueue usage is interaction
between user process requesting data via read(2) system
call and kernel running in the interrupt context to supply the
data. An interrupt handler might look like (simplified
drivers/char/rtc_interrupt()):
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait); void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs) { spin_lock(&rtc_lock); rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS); spin_unlock(&rtc_lock); wake_up_interruptible(&rtc_wait); }
So, the interrupt handler obtains the data by reading from
some device-specific I/O port (CMOS_READ() macro
turns into a couple outb/inb) and then wakes up
whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos) { DECLARE_WAITQUEUE(wait, current); unsigned long data; ssize_t retval; add_wait_queue(&rtc_wait, &wait); current->state = TASK_INTERRUPTIBLE; do { spin_lock_irq(&rtc_lock); data = rtc_irq_data; rtc_irq_data = 0; spin_unlock_irq(&rtc_lock); if (data != 0) break; if (file->f_flags & O_NONBLOCK) { retval = -EAGAIN; goto out; } if (signal_pending(current)) { retval = -ERESTARTSYS; goto out; } schedule(); } while(1); retval = put_user(data, (unsigned long *)buf); if (!retval) retval = sizeof(unsigned long); out: current->state = TASK_RUNNING; remove_wait_queue(&rtc_wait, &wait); return retval; }
What happens in rtc_read() is this:
rtc_wait
waitqueue.TASK_INTERRUPTIBLE
which means it will not be rescheduled after the next time it
sleeps.TASK_RUNNING, remove ourselves from the wait queue
and returnEAGAIN (which is the same as
EWOULDBLOCK)TASK_INTERRUPTIBLE then the scheduler could
schedule us sooner than when the data is available, thus
causing unneeded processing.It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:
static unsigned int rtc_poll(struct file *file, poll_table *wait) { unsigned long l; poll_wait(file, &rtc_wait, wait); spin_lock_irq(&rtc_lock); l = rtc_irq_data; spin_unlock_irq(&rtc_lock); if (l != 0) return POLLIN | POLLRDNORM; return 0; }
All the work is done by the device-independent function
poll_wait() which does the necessary waitqueue
manipulations; all we need to do is point it to the waitqueue
which is woken up by our device-specific interrupt handler.
Now let us turn our attention to kernel timers. Kernel timers
are used to dispatch execution of a particular function (called
'timer handler') at a specified time in the future. The main data
structure is struct timer_list declared in
include/linux/timer.h:
struct timer_list { struct list_head list; unsigned long expires; unsigned long data; void (*function)(unsigned long); volatile int running; };
The list field is for linking into the internal
list, protected by the timerlist_lock spinlock. The
expires field is the value of jiffies
when the function handler should be invoked with
data passed as a parameter. The running
field is used on SMP to test if the timer handler is currently
running on another CPU.
The functions add_timer() and
del_timer() add and remove a given timer to the
list. When a timer expires, it is removed automatically. Before a
timer is used, it MUST be initialised by means of
init_timer() function. And before it is added, the
fields function and expires must be
set.
Sometimes it is reasonable to split the amount of work to be performed inside an interrupt handler into immediate work (e.g. acknowledging the interrupt, updating the stats etc.) and work which can be postponed until later, when interrupts are enabled (e.g. to do some postprocessing on data, wake up processes waiting for this data, etc).
Bottom halves are the oldest mechanism for deferred execution of kernel tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism was added, called 'task queues', which will be the subject of next section.
Bottom halves are serialised by the
global_bh_lock spinlock, i.e. there can only be one
bottom half running on any CPU at a time. However, when
attempting to execute the handler, if global_bh_lock
is not available, the bottom half is marked (i.e. scheduled) for
execution - so processing can continue, as opposed to a busy loop
on global_bh_lock.
There can only be 32 bottom halves registered in total. The functions required to manipulate bottom halves are as follows (all exported to modules):
void init_bh(int nr, void (*routine)(void)):
installs a bottom half handler pointed to by
routine argument into slot nr. The
slot ought to be enumerated in
include/linux/interrupt.h in the form
XXXX_BH, e.g. TIMER_BH or
TQUEUE_BH. Typically, a subsystem's initialisation
routine (init_module() for modules) installs the
required bottom half using this function.void remove_bh(int nr): does the opposite of
init_bh(), i.e. de-installs bottom half installed
at slot nr. There is no error checking performed
there, so, for example remove_bh(32) will
panic/oops the system. Typically, a subsystem's cleanup routine
(cleanup_module() for modules) uses this function
to free up the slot that can later be reused by some other
subsystem. (TODO: wouldn't it be nice to have
/proc/bottom_halves list all registered bottom
halves on the system? That means global_bh_lock
must be made read/write, obviously)void mark_bh(int nr): marks bottom half in
slot nr for execution. Typically, an interrupt
handler will mark its bottom half (hence the name!) for
execution at a "safer time".Bottom halves are globally locked tasklets, so the question
"when are bottom half handlers executed?" is really "when are
tasklets executed?". And the answer is, in two places: a) on each
schedule() and b) on each interrupt/syscall return
path in entry.S (TODO: therefore, the
schedule() case is really boring - it like adding
yet another very very slow interrupt, why not get rid of
handle_softirq label from schedule()
altogether?).
Task queues can be though of as a dynamic extension to old bottom halves. In fact, in the source code they are sometimes referred to as "new" bottom halves. More specifically, the old bottom halves discussed in previous section have these limitations:
So, with task queues, arbitrary number of functions can be
chained and processed one after another at a later time. One
creates a new task queue using the
DECLARE_TASK_QUEUE() macro and queues a task onto it
using the queue_task() function. The task queue then
can be processed using run_task_queue(). Instead of
creating your own task queue (and having to consume it manually)
you can use one of Linux' predefined task queues which are
consumed at well-known points:
tq_timer tasks also run in
interrupt context and thus cannot block.tq_timer). Since the scheduler executed in the
context of the process being re-scheduled, the
tq_scheduler tasks can do anything they like, i.e.
block, use process context data (but why would they want to),
etc.IMMEDIATE_BH, so drivers can
queue_task(task, &tq_immediate) and then
mark_bh(IMMEDIATE_BH) to be consumed in interrupt
context.Unless a driver uses its own task queues, it does not need to
call run_tasks_queues() to process the queue, except
under circumstances explained below.
The reason tq_timer/tq_scheduler task queues are
consumed not only in the usual places but elsewhere (closing tty
device is but one example) becomes clear if one remembers that
the driver can schedule tasks on the queue, and these tasks only
make sense while a particular instance of the device is still
valid - which usually means until the application closes it. So,
the driver may need to call run_task_queue() to
flush the tasks it (and anyone else) has put on the queue,
because allowing them to run at a later time may make no sense -
i.e. the relevant data structures may have been freed/reused by a
different instance. This is the reason you see
run_task_queue() on tq_timer and
tq_scheduler in places other than timer interrupt
and schedule() respectively.
Not yet, will be in future revision.
Not yet, will be in future revision.
There are two mechanisms under Linux for implementing system calls:
Native Linux programs use int 0x80 whilst binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is historically misleading because it also covers lcall27 (e.g. Solaris/x86), but the handler function is called lcall7_func.
When the system boots, the function
arch/i386/kernel/traps.c:trap_init() is called which
sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
the address of system_call entry from
arch/i386/kernel/entry.S.
When a userspace application makes a system call, the
arguments are passed via registers and the application executes
'int 0x80' instruction. This causes a trap into kernel mode and
processor jumps to system_call entry point in
entry.S. What this does is:
NR_syscalls (currently 256), fail with
ENOSYS error.tsk->ptrace &
PF_TRACESYS), do special processing. This is to support
programs like strace (analogue of SVR4 truss(1)) or
debuggers.sys_call_table+4*(syscall_number from
%eax). This table is initialised in the same file
(arch/i386/kernel/entry.S) to point to individual
system call handlers which under Linux are (usually) prefixed
with sys_, e.g. sys_open,
sys_exit, etc. These C system call handlers will
find their arguments on the stack where SAVE_ALL
stored them.schedule() is needed
(tsk->need_resched != 0), checking if there are
signals pending and if so handling them.Linux supports up to 6 arguments for system calls. They are
passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used
temporarily, see _syscall6() in
asm-i386/unistd.h). The system call number is passed
via %eax.
There are two types of atomic operations: bitmaps and
atomic_t. Bitmaps are very convenient for
maintaining a concept of "allocated" or "free" units from some
large collection where each unit is identified by some number,
for example free inodes or free blocks. They are also widely used
for simple locking, for example to provide exclusive access to
open a device. An example of this can be found in
arch/i386/kernel/microcode.c:
/* * Bits in microcode_status. (31 bits of room for future expansion) */ #define MICROCODE_IS_OPEN 0 /* set if device is in use */ static unsigned long microcode_status;
There is no need to initialise microcode_status
to 0 as BSS is zero-cleared under Linux explicitly.
/* * We enforce only one user at a time here with open/close. */ static int microcode_open(struct inode *inode, struct file *file) { if (!capable(CAP_SYS_RAWIO)) return -EPERM; /* one at a time, please */ if (test_and_set_bit(MICROCODE_IS_OPEN, µcode_status)) return -EBUSY; MOD_INC_USE_COUNT; return 0; }
The operations on bitmaps are:
nr in the bitmap pointed to by
addr.nr in the bitmap pointed to by
addr.nr (if set clear, if clear set) in the bitmap
pointed to by addr.nr and return the old bit
value.nr and return the old bit
value.nr and return
the old bit value.These operations use the LOCK_PREFIX macro, which
on SMP kernels evaluates to bus lock instruction prefix and to
nothing on UP. This guarantees atomicity of access in SMP
environment.
Sometimes bit manipulations are not convenient, but instead we
need to perform arithmetic operations - add, subtract, increment
decrement. The typical cases are reference counts (e.g. for
inodes). This facility is provided by the atomic_t
data type and the following operations:
atomic_t variable v.atomic_t variable v to integer
i.i to the value of atomic variable pointed
to by v.i from the value of atomic
variable pointed to by v.i from the value of
atomic variable pointed to by v; return 1 if the
new value is 0, return 0 otherwise.i to v and
return 1 if the result is negative. Return 0 if the result is
greater than or equal to 0. This operation is used for
implementing semaphores.Since the early days of Linux support (early 90s, this century), developers were faced with the classical problem of accessing shared data between different types of context (user process vs interrupt) and different instances of the same context from multiple cpus.
SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch was made to 1.3.37 in October the same year).
If the critical region of code may be executed by either
process context and interrupt context, then the way to protect it
using cli/sti instructions on UP is:
unsigned long flags; save_flags(flags); cli(); /* critical code */ restore_flags(flags);
While this is ok on UP, it obviously is of no use on SMP
because the same code sequence may be executed simultaneously on
another cpu, and while cli() provides protection
against races with interrupt context on each CPU individually, it
provides no protection at all against races between contexts
running on different CPUs. This is where spinlocks are useful
for.
There are three types of spinlocks: vanilla (basic),
read-write and big-reader spinlocks. Read-write spinlocks should
be used when there is a natural tendency of 'many readers and few
writers'. Example of this is access to the list of registered
filesystems (see fs/super.c). The list is guarded by
the file_systems_lock read-write spinlock because
one needs exclusive access only when registering/unregistering a
filesystem, but any process can read the file
/proc/filesystems or use the sysfs(2) system
call to force a read-only scan of the file_systems list. This
makes it sensible to use read-write spinlocks. With read-write
spinlocks, one can have multiple readers at a time but only one
writer and there can be no readers while there is a writer. Btw,
it would be nice if new readers would not get a lock while there
is a writer trying to get a lock, i.e. if Linux could correctly
deal with the issue of potential writer starvation by multiple
readers. This would mean that readers must be blocked while there
is a writer attempting to get the lock. This is not currently the
case and it is not obvious whether this should be fixed - the
argument to the contrary is - readers usually take the lock for a
very short time so should they really be starved while the writer
takes the lock for potentially longer periods?
Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access, with a penalty for writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into any of these two scenarios, one should use basic spinlocks. You cannot block while holding any kind of spinlock.
Spinlocks come in three flavours: plain, _irq()
and _bh().
spin_lock()/spin_unlock(): if you know
the interrupts are always disabled or if you do not race with
interrupt context (e.g. from within interrupt handler), then
you can use this one. It does not touch interrupt state on the
current CPU.spin_lock_irq()/spin_unlock_irq(): if you know
that interrupts are always enabled then you can use this
version, which simply disables (on lock) and re-enables (on
unlock) interrupts on the current CPU. For example,
rtc_read() uses
spin_lock_irq(&rtc_lock) (interrupts are
always enabled inside read()) whilst
rtc_interrupt() uses
spin_lock(&rtc_lock) (interrupts are always
disabled inside interrupt handler). Note that
rtc_read() uses spin_lock_irq() and
not the more generic spin_lock_irqsave() because
on entry to any system call interrupts are always enabled.spin_lock_irqsave()/spin_unlock_irqrestore():
the strongest form, to be used when the interrupt state is not
known, but only if interrupts matter at all, i.e. there is no
point in using it if our interrupt handlers don't execute any
critical code.The reason you cannot use plain spin_lock() if
you race against interrupt handlers is because if you take it and
then an interrupt comes in on the same CPU, it will busy wait for
the lock forever: the lock holder, having been interrupted, will
not continue until the interrupt handler returns.
The most common usage of a spinlock is to access a data structure shared between user process context and interrupt handlers:
spinlock_t my_lock = SPIN_LOCK_UNLOCKED; my_ioctl() { spin_lock_irq(&my_lock); /* critical section */ spin_unlock_irq(&my_lock); } my_irq_handler() { spin_lock(&lock); /* critical section */ spin_unlock(&lock); }
There are a couple of things to note about this example:
ioctl() (arguments and return values
omitted for clarity), must use spin_lock_irq()
because it knows that interrupts are always enabled while
executing the device ioctl() method.my_irq_handler() (again arguments omitted for
clarity) can use plain spin_lock() form because
interrupts are disabled inside an interrupt handler.Sometimes, while accessing a shared data structure, one must perform operations that can block, for example copy data to userspace. The locking primitive available for such scenarios under Linux is called a semaphore. There are two types of semaphores: basic and read-write semaphores. Depending on the initial value of the semaphore, they can be used for either mutual exclusion (initial value of 1) or to provide more sophisticated type of access.
Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic spinlocks: one can have multiple readers at a time but only one writer and there can be no readers while there are writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.
Also, basic semaphores can be interruptible - just use the
operations down/up_interruptible() instead of the
plain down()/up() and check the value returned from
down_interruptible(): it will be non zero if the
operation was interrupted.
Using semaphores for mutual exclusion is ideal in situations where a critical code section may call by reference unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function blocks or not.
A simple example of semaphore usage is in
kernel/sys.c, implementation of
gethostname(2)/sethostname(2) system calls.
asmlinkage long sys_sethostname(char *name, int len) { int errno; if (!capable(CAP_SYS_ADMIN)) return -EPERM; if (len < 0 || len > __NEW_UTS_LEN) return -EINVAL; down_write(&uts_sem); errno = -EFAULT; if (!copy_from_user(system_utsname.nodename, name, len)) { system_utsname.nodename[len] = 0; errno = 0; } up_write(&uts_sem); return errno; } asmlinkage long sys_gethostname(char *name, int len) { int i, errno; if (len < 0) return -EINVAL; down_read(&uts_sem); i = 1 + strlen(system_utsname.nodename); if (i > len) i = len; errno = 0; if (copy_to_user(name, system_utsname.nodename, i)) errno = -EFAULT; up_read(&uts_sem); return errno; }
The points to note about this example are:
copy_from_user()/copy_to_user().
Therefore they could not use any form of spinlock here.Although Linux implementation of semaphores and read-write semaphores is very sophisticated, there are possible scenarios one can think of which are not yet implemented, for example there is no concept of interruptible read-write semaphores. This is obviously because there are no real-world situations which require these exotic flavours of the primitives.
Linux is a monolithic operating system and despite all the modern hype about some "advantages" offered by operating systems based on micro-kernel design, the truth remains (quoting Linus Torvalds himself):
... message passing as the fundamental operation of the
OS is just an exercise in computer science masturbation. It may
feel good, but you don't actually get anything DONE.
Therefore, Linux is and will always be based on a monolithic design, which means that all subsystems run in the same privileged mode and share the same address space; communication between them is achieved by the usual C function call means.
However, although separating kernel functionality into
separate "processes" as is done in micro-kernels is definitely a
bad idea, separating it into dynamically loadable on demand
kernel modules is desirable in some circumstances (e.g. on
machines with low memory or for installation kernels which could
otherwise contain ISA auto-probing device drivers that are
mutually exclusive). The decision whether to include support for
loadable modules is made at compile time and is determined by the
CONFIG_MODULES option. Support for module
autoloading via request_module() mechanism is a
separate compilation option (CONFIG_KMOD).
The following functionality can be implemented as loadable modules under Linux:
/proc and in devfs
(e.g. /dev/cpu/microcode vs
/dev/misc/microcode).There a few things that cannot be implemented as modules under Linux (probably because it makes no sense for them to be modularised):
Linux provides several system calls to assist in loading modules:
caddr_t create_module(const char *name, size_t
size): allocates size bytes using
vmalloc() and maps a module structure at the
beginning thereof. This new module is then linked into the list
headed by module_list. Only a process with
CAP_SYS_MODULE can invoke this system call, others
will get EPERM returned.long init_module(const char *name, struct module
*image): loads the relocated module image and causes the
module's initialisation routine to be invoked. Only a process
with CAP_SYS_MODULE can invoke this system call,
others will get EPERM returned.long delete_module(const char *name): attempts
to unload the module. If name == NULL, attempt is
made to unload all unused modules.long query_module(const char *name, int which, void
*buf, size_t bufsize, size_t *ret): returns information
about a module (or about all modules).The command interface available to users consists of:
Apart from being able to load a module manually using either
insmod or modprobe, it is also possible to have the
module inserted automatically by the kernel when a particular
functionality is required. The kernel interface for this is the
function called request_module(name) which is
exported to modules, so that modules can load other modules as
well. The request_module(name) internally creates a
kernel thread which execs the userspace command modprobe -s -k
module_name, using the standard
exec_usermodehelper() kernel interface (which is
also exported to modules). The function returns 0 on success,
however it is usually not worth checking the return code from
request_module(). Instead, the programming idiom
is:
if (check_some_feature() == NULL) request_module(module); if (check_some_feature() == NULL) return -ENODEV;
For example, this is done by
fs/block_dev.c:get_blkfops() to load a module
block-major-N when attempt is made to open a block
device with major N. Obviously, there is no such
module called block-major-N (Linux developers only
chose sensible names for their modules) but it is mapped to a
proper module name using the file /etc/modules.conf.
However, for most well-known major numbers (and other kinds of
modules) the modprobe/insmod commands know which real
module to load without needing an explicit alias statement in
/etc/modules.conf.
A good example of loading a module is inside the
mount(2) system call. The mount(2) system call
accepts the filesystem type as a string which
fs/super.c:do_mount() then passes on to
fs/super.c:get_fs_type():
static struct file_system_type *get_fs_type(const char *name) { struct file_system_type *fs; read_lock(&file_systems_lock); fs = *(find_filesystem(name)); if (fs && !try_inc_mod_count(fs->owner)) fs = NULL; read_unlock(&file_systems_lock); if (!fs && (request_module(name) == 0)) { read_lock(&file_systems_lock); fs = *(find_filesystem(name)); if (fs && !try_inc_mod_count(fs->owner)) fs = NULL; read_unlock(&file_systems_lock); } return fs; }
A few things to note in this function:
file_systems_lock taken for read (as we are not
modifying the list of registered filesystems).try_inc_mod_count() returned 0 then we consider it
a failure - i.e. if the module is there but is being deleted,
it is as good as if it were not there at all.file_systems_lock because what we
are about to do next (request_module()) is a
blocking operation, and therefore we can't hold a spinlock over
it. Actually, in this specific case, we would have to drop
file_systems_lock anyway, even if
request_module() were guaranteed to be
non-blocking and the module loading were executed in the same
context atomically. The reason for this is that the module's
initialisation function will try to call
register_filesystem(), which will take the same
file_systems_lock read-write spinlock for
write.file_systems_lock spinlock and try to locate the
newly registered filesystem in the list. Note that this is
slightly wrong because it is in principle possible for a bug in
modprobe command to cause it to coredump after it successfully
loaded the requested module, in which case
request_module() will fail even though the new
filesystem will be registered, and yet
get_fs_type() won't find it.When a module is loaded into the kernel, it can refer to any
symbols that are exported as public by the kernel using
EXPORT_SYMBOL() macro or by other currently loaded
modules. If the module uses symbols from another module, it is
marked as depending on that module during dependency
recalculation, achieved by running depmod -a command on
boot (e.g. after installing a new kernel).
Usually, one must match the set of modules with the version of
the kernel interfaces they use, which under Linux simply means
the "kernel version" as there is no special kernel interface
versioning mechanism in general. However, there is a limited
functionality called "module versioning" or
CONFIG_MODVERSIONS which allows to avoid recompiling
modules when switching to a new kernel. What happens here is that
the kernel symbol table is treated differently for internal
access and for access from modules. The elements of public (i.e.
exported) part of the symbol table are built by 32bit
checksumming the C declaration. So, in order to resolve a symbol
used by a module during loading, the loader must match the full
representation of the symbol that includes the checksum; it will
refuse to load the module if these symbols differ. This only
happens when both the kernel and the module are compiled with
module versioning enabled. If either one of them uses the
original symbol names, the loader simply tries to match the
kernel version declared by the module and the one exported by the
kernel and refuses to load if they differ.