<-- home

LKM Development Concepts & Considerations

I "recently" (hey, it's been busy) wrote a post outlining the creation of DrawBridge, a single packet authentication kernel module. It was a fun side project to work on but what's arguably cooler is all the Operating System concepts from classes I've taken that resurfaced and had real world applications on the code I was writing.

So if you're about to embark on writing your first kernel module here are some things you need to consider (with some code examples from DrawBridge):

Loading & Unloading:

The best thing about the linux kernel is that it's opensource and we can verify our understanding of how things are implemented by browsing the source. Bootlin is an awesome resource if you're doing any LKM development. The bulk of the module loading in kernel space can be referenced in linux/kernel/module.c. Typical module loading starts when a usermode process initiates the "init_module" system call. Looking up the syscall in the kernel source we can see that it calls load_module:

SYSCALL_DEFINE3(init_module, void __user *, umod,
        unsigned long, len, const char __user *, uargs)
{
    int err;
    struct load_info info = { };

    err = may_init_module();
    if (err)
        return err;

    pr_debug("init_module: umod=%p, len=%lu, uargs=%p\n",
           umod, len, uargs);

    err = copy_module_from_user(umod, len, &info);
    if (err)
        return err;

    return load_module(&info, uargs, 0);
}

The load_module function allocates memory for the ELF object file, and performs signature verification. If the kernel is compiled with CONFIG_MODULE_SIG=y and the module fails signature verification, the kernel will be tainted:

#ifdef CONFIG_MODULE_SIG
    mod->sig_ok = info->sig_ok;
    if (!mod->sig_ok) {
        pr_notice_once("%s: module verification failed: signature "
                   "and/or required key missing - tainting "
                   "kernel\n", mod->name);
        add_taint_module(mod, TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
    }
#endif

Additionally check_modinfo, which is called from within layout_and_allocate, performs further sanity checks such as vermagic verification and license verification. At the end of load_module the do_init_module function actually starts the module, it can be referenced here. After the call to do_one_initcall the return code of the module's initialization function is verified and the module state is set to MODULE_STATE_LIVE:

/* Start the module */
if (mod->init != NULL)
    ret = do_one_initcall(mod->init);
if (ret < 0) {
    goto fail_free_freeinit;
}
if (ret > 0) {
    pr_warn("%s: '%s'->init suspiciously returned %d, it should "
        "follow 0/-E convention\n"
        "%s: loading module anyway...\n",
        __func__, mod->name, ret, __func__);
    dump_stack();
}

/* Now it's a first class citizen! */
mod->state = MODULE_STATE_LIVE;

Unlike their usermode counterparts like processes which always have at least one running thread, kernel modules don't automatically have a running execution context when loaded. When writing your kernel module, you must have at least two functions: an initialization function which is called when the module is loaded into the kernel ret = do_one_initcall(mod->init);, and a cleanup function which is called just before it is rmmoded. A basic module initialization and unloading looks like the following:


#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>

static int hello_init(void)
{
    printk(KERN_ALERT "Module loaded - Hello from the kernel.\n");
    return 0;
}

static void hello_exit(void)
{
    printk(KERN_ALERT "Goodbye, module unloaded\n");
}

// Register the initialization and exit functions
module_init(knock_xt_init);
module_exit(knock_xt_exit);

When the module above is loaded, hello_init will execute but after that nothing else is done until the module is unloaded and hello_exit is called. During initialization it is up to the programmer to create a new running thread, register callbacks (such as our Netfilter hook), or queue tasks to be performed by other threads. Scheduling deferred work can be done using workqueues or tasklets which which will be handled by kworker threads. A little background on deferred work APIs in the kernel can be found here. For Drawbridge, three seperate initialization actions were performed.


  • Creating a listener thread.

  • Registering the Netfilter hook

  • Register deferred work timer that periodically removes expired connections.

Since I talked about registering the Netfilter hook in the last post and the timer API is pretty straightforward I'll only talk about thread creation here. If you want to look more into the timer code there is a good article on it and the relevant Drawbridge code is located here.

Thread Creation & Cleanup:

To create a thread in the kernel a call to kthread_create is made, which takes a function pointer, function args, and a thread name. In this case our new thread will execute the listen function. Here is the relevant code snippet from nf_conntrack_knock_init:


// Start kernel thread raw socket to listen for triggers
raw_thread = kthread_create(&listen, NULL, MODULE_NAME);

// Increments reference counter - preserve structure even on exit
get_task_struct(raw_thread);

if(IS_ERR(raw_thread)) {
    printk(KERN_INFO "[-] Unable to start child thread\n");
    return PTR_ERR(raw_thread);
}

// Now it is safe to start kthread 
// exiting from it doesn't destroy its task_struct.
wake_up_process(raw_thread);

At this point you're probably wondering what the call to get_task_struct accomplishes. Because we created this thread, we want to be able to stop it when our module is unloaded, and therefore we'll need a valid reference to the thread's task_struct. The methods get_task_struct and put_task_struct manage a task_struct's reference count tsk->usage, which is an atomic integer declared in the task_struct definition:

atomic_t            usage;

The atomic operations on this variable via the two methods above can be referenced in linux/sched/task.h.

After we've increased the reference count with get_task_struct, it should always be safe to access that task_struct until the reference count hits zero. If you call put_task_struct and you're the last one with a reference to it, the kernel knows it can deallocate the slab cache containing the task_struct. Without increasing the reference count, if our child thread exits and we later attempt to access the task_struct of that thread we could enounter undefined behavior and possibly crash our module or the kernel. Here is the corresponding thread cleanup in Drawbridge's cleanup function (the variable named raw_thread in the snippets is the task_struct pointer):


if(raw_thread) {

    err = kthread_stop(raw_thread); // stop the thread
    put_task_struct(raw_thread);    // decrease reference count
    raw_thread = NULL;
    printk(KERN_INFO "[*] Stopped counterpart thread\n");

} else {
    printk(KERN_INFO "[!] no kernel thread to kill\n");
}

The listen function being executed can be found in xt_listen.c here. Now that we created a thread we're responsible for managing it, which leads us to a discussion on scheduling.


Scheduling:

An important aspect of programming in the kernel is that you are responsible for letting the scheduler know that your thread is interruptable and can be preempted for a context switch. Kernel threads are typically given higher priority and are typically allocated more CPU cycles than user mode programs. With great execution priviledge comes great responsibility, if we don't ever allow the scheduler to preempt our thread and allow other threads to execute we can severly impact the performance of the operating system, especially on systems with limited cores.

In our case for the listener thread we'll want the scheduler to allocate a timeslice to us and return to TASK_RUNNING when a network adapter receives a packet. Luckily for us the linux kernel team is on top of it and has implemented a convinient way to determine if the socket has recieved a packet:


/**
 *  skb_queue_empty - check if a queue is empty
 *  @list: queue head
 *
 *  Returns true if the queue is empty, false otherwise.
 */
static inline int skb_queue_empty(const struct sk_buff_head *list) {
    return list->next == (const struct sk_buff *) list;
}

As long as this receive queue is empty we will set our state to TASK_INTERRUPTIBLE and then voluntarily invoke the scheduler with schedule_timeout. When the socket's receive queue is non-empty we will exit the loop and set the thread state to TASK_RUNNING and remove ourselves from the wait queue.


add_wait_queue(&(sock)->sk->sk_wq->wait, &(recv_wait));

// Socket recv queue empty, set interruptable
// release CPU and allow scheduler to preempt the thread
while(skb_queue_empty(&(sock)->sk->sk_receive_queue)) {

    set_current_state(TASK_INTERRUPTIBLE);
    schedule_timeout(2*HZ);

    // check exit condition
    if(kthread_should_stop()) {
        printk(KERN_INFO "[*] returning from child thread\n");
        sock_release(sock);
        kfree(pkt);
        do_exit(0);
    }

}

set_current_state(TASK_RUNNING);
remove_wait_queue(&(sock)->sk->sk_wq->wait, &(recv_wait));

Thread Synchronization:

One of the key shared structures in DrawBridge is the linked list I created to track connection states. Because both the Netfilter context and our listener thread would be reading and writing to this structure at the same time it was important to consider synchronization, otherwise we would run into concurrency issues and possibly some undefined behavior.

To ensure synchronization, I decided to use the RCU mechanism that's available in the kernel. As oppossed to traditional mutual exclusion primitives like mutexes and semaphores, RCU is lock-free (for readers) and works via a publish/subscribe mechanism. By maintaining multiple versions of a structure, RCU is able to ensure concurrent read/write operations are coherent (no partial updates) and will wait for all readers to be done with a structure before freeing it. This ultimately allows RCU readers to use much lighter-weight synchronization compared to traditional lock-based syncronization which requires mutual exclusion by ensuring that all readers be excluded from access beforehand to prevent an updater from changing the object out from under them. So with RCU, we are optimizing for read/search access, which is good for Drawbridge because we need our netfilter hook to parse the state list and make filtering decisions very quickly. The list will only need to be updated when a valid authentication packet comes in, which is much less frequent. So a synchronization method that optimizes for readers is ideal.

The basic primitives of RCU allow a writer to "publish" a structure, giving readers the ability "subscribe" and safely scan data even though that data is being modified concurrently. In DrawBridge one writer is the state_add method in xt_state.c which uses the following code:

// aquire lock
spin_lock(&listmutex);

// add item
list_add_rcu(&(state->list), &((*head)->list));

// release lock
spin_unlock(&listmutex);

At this point you're probably noticing the mutex and saying "Hey, I thought this was lock-free!". It should be for readers but not for writers, we don't want multiple writers working concurrently 1) because that's not what RCU was designed for and 2) we don't want two new entries to be added to the head of the list simultaneously. This does not prevent list_add_rcu from executing concurrently with RCU readers because our readers won't use this mutex. list_add_rcu is just a wrapper for the lower level RCU primative rcu_assign_pointer which publishes the new "next" pointer for the head of the list rcu_assign_pointer(list_next_rcu(prev), new); to point to our new entry. All of this can be seen in linux/rculist.h. Let's see how our reader code "subscribes" to the state list by looking at the state_lookup method which is called by our Netfilter hook when a new connection comes in:

rcu_read_lock();

list_for_each_entry_rcu(state, &(head->list), list) {

    if(state->type == 4 && state->src.addr_4 == src 
                    && state->port == port) {
        rcu_read_unlock();
        return 1;
    } else if (state->type == 6 
                    && ipv6_addr_cmp(&(state->src.addr_6), src_6) == 0 
                    && state->port == port) {
        rcu_read_unlock();
        return 1;
    } 
}
rcu_read_unlock();

The rcu_read_lock and rcu_read_unlock are not traditional locks. They are used to mark the start and end of "RCU read-side critical sections", which are not permitted to block or sleep and are guaranteed to have completed when a given CPU executes a context switch (as long as the kernel has been compiled with CONFIG_PREEMPT=y). This is one mechanism for RCU implementations to verify reader completion without explicitly tracking active readers. In the comments for list_for_each_entry_rcu it explains "this list-traversal primitive may safely run concurrently with the _rcu list-mutation primitives such as list_add_rcu() as long as the traversal is guarded by rcu_read_lock()."

The downside to using RCU is that it's possible for some readers using list_for_each_entry_rcu to not see the new element being added by list_add_rcu immediately if it's being added at at the same time, we are only guaranteed that it will see a valid well-formed list. So it is possible for the state_lookup to not see the most up-to-date list, but if we had to synchronize readers with the writers our search performance would be a lot worse. And functionally this doesn't impact DrawBridge as it's unlikely your authentication packet and your initial connection will be processed simultaneously.

Versioning:

Unfortunately one of the headaches with writing kernel modules is supporting multiple kernel versions. When porting the drawbridge to 3.X kernels I encountered a lot of changes that needed to be accounted for. One of those functions that changed a lot between versions was sock_recvmsg. If you look at this kernel commit you can see them dropping the "size" argument off of sock_recvmsg for the 4.7-4.20 kernels which means if you made a kernel module that supported these versions and an older kernel you'd have to call sock_recvmsg in two different ways. Luckily I found what appears to be a more stable API call kernel_recvmsg as you can see in my commit diff.

However this isn't always the case. Sometimes you just have to adjust your code depending on kernel version. There is a handy preprocessor statement to do this that follows this format:

#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
    msg.msg_iocb = NULL;
    iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)&iov, 1, len);
#else
    msg.msg_iov = &iov;
    msg.msg_iovlen = len;
#endif       

So if the kernel version is above or equal to 3.19, we'll use the newer iov_iter_init method to initialize our struct msghdr msg's iov iterator, and if it's older we can just set the struct kvec iov directly. All of this is located in my ksocket_recieve method in xt_listen.c.

That's all I've got folks, hopefully all of this discussion is somewhat helpful. Happy coding!


References:

Professor Gavrilovska's Graduate Introduction to Operating Systems

LWN: Sleeping and Waking Up

LWN: What is RCU fundamentally?

Usenix: Reader-Writer-Locking/RCU Analogy