GeistHaus
log in · sign up

https://staticthinking.wordpress.com/feed/atom

atom
10 posts
Polling state
Status active
Last polled May 19, 2026 08:45 UTC
Next poll May 20, 2026 11:36 UTC
Poll interval 86400s
Last-Modified Sat, 18 Apr 2026 14:01:28 GMT

Posts

No AI Slop
Uncategorized
AI is a powerful tool and the kernel community uses AI. However, if an AI writes something for you and you can't be bothered to read it then don't send it. Just delete it.
Show full content

You have probably been directed here because someone suspects you have sent AI Slop.

AI is a powerful tool and the kernel community uses AI. However, if an AI writes something for you and you can’t be bothered to read it then don’t send it. Just delete it. As soon as people suspect you are sending AI Slop they will delete the whole email thread. No one has time for AI Slop.

Email Conversations: If someone asks you a question and you enter the question into a AI text generator, then don’t send us the AI answer. Just delete it. We already have AI ourselves so sending us the AI output is a waste of time and will only make everyone cross.

Bug Reports: AI bug reports are interesting and welcome. Please review them first before sending. If you don’t have the skills to review them, then ask someone on Kernel Newbies for help or something. If you can create a test program to demonstrate the bug that’s best. Connect the bug report to a specific commit and report it to the correct people. Do not send mass bug reports, because the sad truth is that no one reads them mass bug reports even when the bugs are real.

Code: Review your patches before sending them. When you send a patch upstream then dozens of people are forced to read it. If you can’t be bothered to read your code before sending it, then just delete it. See the kernel AI policy document for more details: https://lwn.net/Articles/1045440/

http://staticthinking.wordpress.com/?p=538
Extensions
Reviewing Magical Cleanup Functions
Uncategorizedkernellinuxprogramming
How to review Magical One Function Cleanup.
Show full content

I’ve previously blogged about how to write error handling. In that blog I mention an anti-pattern called One Err Style cleanup which has just one vaguely named error label handle all the cleanup:

err:
        free(foo);
        free(bar);
        return ret;

I also mentioned a sub category of this called One Magical Cleanup Function style which takes it a step further and hides all the cleanup in a separate function.

err:
        free_stuff(p);
        return ret;

The name is based on the time where someone told me to re-write my code using this style and that having just one cleanup function removes all the complexity as if by “magic”. Unfortunately in real life, this style of error handling looks simple but it is the most complex and most bug prone.

What static analysis does is it warns about suspicious code. I think a warning is useful if it points to a bug more than 50% of the time in new, unfixed code. These magical cleanup functions are buggy about 50% of the time so they’re worth reviewing and re-writing whenever you see them. This blog explains my review process.

Ideally, we would only free things which are allocated but with magical cleanup we need to handle resources which might be:

  • Non-existent such as “foo->bar->baz” where “foo->bar” is NULL
  • Uninitialized
  • NULL
  • Error pointer
  • Allocated
  • Freed

The other common bug with magical cleanup functions is that they are so complicated they lead to leaks.

Normally, I open a second text editor to keep track of my notes because you have to line up every allocation with a free. Let’s take a real would example. I have anonymized this example because these are mistakes that everyone makes and I don’t want to single out any specific driver.

fail:
        my_deinit_function(dev);

        return ret;

The deinit function looks like this:

void my_deinit_function(struct my_dev *dev)
{
        if (dev->input_dev)
                input_unregister_device(dev->input_dev);

        kfree(dev->priv_data);
        dev->priv_data = NULL;
        ...
}

On the first line, it assumes that if dev->input_dev is non-NULL we should call input_unregister_device(). Can it be uninitialized, an error pointer or freed? We need to look at where it’s assigned.

static int my_register_input_device(struct my_dev *dev)
{
	int ret;

        dev->input_dev = devm_input_allocate_device(dev->dev);
        if (!dev->input_dev)
                return -ENOMEM;

        ret = input_register_device(dev->input_dev);
        if (ret)
        	return ret;
        	
	return 0;
}

We can see it’s either NULL or valid. But it may or may not be registered successfully. What happens when we call input_unregister_device() on an unregistered device? Let’s take a look:

static void __input_unregister_device(struct input_dev *dev)
{
        struct input_handle *handle, *next;

        input_disconnect_device(dev);

        scoped_guard(mutex, &input_mutex) {
                list_for_each_entry_safe(handle, next, &dev->h_list, d_node)
                        handle->handler->disconnect(handle);
                WARN_ON(!list_empty(&dev->h_list));

                del_timer_sync(&dev->timer);
                list_del_init(&dev->node);

                input_wakeup_procfs_readers();
        }

        device_del(&dev->dev);
}

I don’t know what input_disconnect_device() does. I think that the list will be empty for unregistered input devices so that’s probably fine. But the thing which jumps out right away is the last line which calls device_del(&dev->dev). With reference counting it’s really important to know if you have bumped the reference or not so here the device_del() is a bug. We’re done. This error handling is buggy and will need to be re-written.

By the way, if you look at the my_deinit_function() again it frees pointers and then sets them to NULL.

        kfree(dev->priv_data);
        dev->priv_data = NULL;

Sometimes the idea with that is if you call the free function twice then the second time it will not result in a double free. However, here the dev->input_dev pointer was not set to NULL so it’s not consistent.

Another thing that you will see is code which sets flags to say when to free or not.

int my_alloc_function(struct foo *p)
{
	p->one = alloc();
	if (!p->one)
		return -ENOMEM;
	p->two = alloc();
	if (p->two) {
		free(p->one);
		return -ENOMEM;
	}
	p->registered = true;
	return 0;
}

int my_free(struct foo *p)
{
	if (!p->registered)
		return;
	free(p->two);
	free(p->one);
	p->registered = false;
}  

With this code, you can call my_free() from any point and it will do the right thing. It can even be called twice and it will work fine. However, eventually someone will add new code after the “p->registered = true;” assignment. People always add new code to the end of functions. That will lead to a bug because the p->registered variable means that *everything* has been registered successfully.

This blog has gone on long enough. I have only analyzed the first line of the cleanup function. We got a bit lucky because the first line was buggy but there were other bugs in the lines I replaced with an ellipsis.

We had to look at several different functions to really understand that first line of code. It wasn’t a magical experience. Writing error handling in the boring methodical way ends up being simpler in the end.

http://staticthinking.wordpress.com/?p=468
Extensions
Smatching an Rsync Bug
Uncategorizedcodingprogrammingtechnology
An investigation into whether Smatch would have prevented a recent security issue in rsync.
Show full content

LWN recently had an article about a buffer overflow in Rsync. LWN subscriber dskoll asked, “wouldn’t a static analyzer have caught this?” The blog attempts to answer that question.

Smatch is the only open source static checker which does cross function analysis so it’s the only one which can work here.

Smatch is generally only tuned for the kernel. In particular, in the kernel I taint some functions as handling user data or network data. User controlled data is checked more carefully and triggers more warnings. I haven’t done that for userspace.

Since I test Smatch on the kernel there are lots of important kernel functions which have special handling. Maybe I missed a security bug in that code and want to be more careful going forward. In terms of false positives, probably userspace code will have lots of false positives which I haven’t even thought about before. I have never tested Smatch on the rsync code.

Also this bug was a memcpy() overflow and the Smatch check for memcpy() overflows is sort of junk. It’s on my TODO list to re-write that.

But let’s try it. We’re going to take some shortcuts in this blog instead of doing it the proper way. I’m going to write down my findings as I go along.

$ cd rysnc_code/
$ ./configure CC=~/progs/smatch/devel/cgcc
$ CHECK="/home/dcarpenter/progs/smatch/devel/smatch --info " make io.o | tee out

Ugh! Where is the output? I expect the output to show up on stdout. It turns out there is a bug in Smatch, where it was writing the output to io.o and then the compiler immediately overwrites it. I’ve fixed that.

$ ~/progs/smatch/devel/smatch_data/db/create_db.sh out
$ smdb.py return_states read_sum_head | grep s2length | grep PARAM_SET

io.c | read_sum_head | 196 | | PARAM_SET | 1 | $->s2length | 0-64 |

This says that $->s2length is set to 0-64 which is what we wanted. Great. Now, let’s try run it on sender.o.

$ CHECK="~/progs/smatch/devel/smatch" make sender.o

Nope. Nothing. Let’s add some debug code to see what’s going wrong.

diff --git a/sender.c b/sender.c
index 3d4f052e9bdb..46245839b184 100644
--- a/sender.c
+++ b/sender.c
@@ -67,6 +67,7 @@ BOOL extra_flist_sending_enabled;
 /**
  * Receive the checksums for a buffer
  **/
+#include "/home/dcarpenter/progs/smatch/devel/check_debug.h"
 static struct sum_struct *receive_sums(int f)
 {
        struct sum_struct *s = new(struct sum_struct);
@@ -97,6 +98,8 @@ static struct sum_struct *receive_sums(int f)
 
        for (i = 0; i < s->count; i++) {
                s->sums[i].sum1 = read_int(f);
+               __smatch_buf_size(s->sums[i].sum2);
+               __smatch_implied(s->s2length);
                read_buf(f, s->sums[i].sum2, s->s2length);
 
                s->sums[i].offset = offset;

And re-run Smatch.

rm sender.o ; CHECK="~/progs/smatch/devel/smatch" make sender.o | tee out

sender.c:101 receive_sums() buf size: 's->sums[i]->sum2' 16 elements, 16 bytes
sender.c:102 receive_sums() implied: s->s2length = '0-64'

So it has the right information, but it doesn’t recognize read_buf() as a memcpy() function. Smatch has a way to automatically generate a list of paired buffer/length parameters. The generated file is here. The code to generate this list looks for functions which are called like frob(p, sizeof(*p)) and there are two read_buf() callers which do it that way. If I run smatch_scripts/gen_sizeof_param.sh to generate the file then it’s now listed as a memcpy() function. From here on out we will need to add -p=rsync when we’re running Smatch.

$ rm io.o ;CHECK="/home/dcarpenter/progs/smatch/devel/smatch --info " make io.o | tee out
$ ~/progs/smatch/devel/smatch_scripts/gen_sizeof_param.sh out -p=rsync
$ mv rsync.sizeof_param ~/progs/smatch/test/smatch_data/

(This is not the proper way to generate this file. I’m just doing the minimal stuff to find this exact bug in rsync).

Action Item 1: Using sizeof() to find memcpy functions mostly works, but it would be better to look at actual memcpy() calls as well.

But UGH!!! I doesn’t work. The check_memcpy.c code does not think the s->s2length variable is suspicious enough to trigger a warning.

static int get_the_max(struct expression *expr, sval_t *sval)
{
	struct range_list *rl;

	if (get_hard_max(expr, sval))
		return 1;
	if (!option_spammy)
		return 0;
	if (get_fuzzy_max(expr, sval))
		return 1;
	if (!get_user_rl(expr, &rl))
		return 0;
	*sval = rl_max(rl);
	return 1;
}

The problem is there are levels of certainty. If you see an “foo – 1” then you know the result it’s at least “UINT_MAX – 1”. But if you see code like “if (x > 10) {” that’s a different level of certainty about the actual limit upper bounds. Here the 0-64 is supposed to count for get_hard_max().

Action Item 2: The 0-64 PARAM_SET should have been recorded as a hard max. I don’t know how hard it is to fix this but I’ll investigate.

Action Item 3: We need to add annotations for the functions which handle untrusted data in userspace. (I’m not going to do this).

Action Item 4: The “s->sums[i].sum1” buffer is fixed at 16 bytes. It should automatically trigger a warning if the length is 0-64. I think the right thing to do is to break the check_memcpy.c into many small checks. One that looks for memcpy() into fixed size buffers. One that looks for user controlled lengths. Etc. The check_memcpy.c file has lots of code to handle false positives like we used to have variable length arrays with one element and so on. All those structs have been re-written in the kernel but I want to keep that code in Smatch. So I’m just going to:

cp check_memcpy_overflow.c check_memcpy_fixed_size.c

And also I won’t do any additional cleanups because I want the diff to be minimal for this blog. Here is the diff:

--- check_memcpy_overflow.c     2025-01-24 12:50:38.428530006 +0300
+++ check_memcpy_fixed_size.c   2025-01-24 12:51:40.876456026 +0300
@@ -249,14 +249,16 @@
        struct expression *dest;
        struct expression *limit;
        char *dest_name = NULL;
+       struct range_list *rl;
        sval_t needed;
        int has;

        dest = get_argument_from_call_expr(expr->args, limiter->buf_arg);
        limit = get_argument_from_call_expr(expr->args, limiter->limit_arg);
-       if (!get_the_max(limit, &needed))
+       if (!get_implied_rl(limit, &rl))
                return;
-       has = get_array_size_bytes_max(dest);
+       needed = rl_max(rl);
+       has = get_array_size_bytes(dest);
        if (!has)
                return;
        if (has >= needed.value)
@@ -287,7 +289,7 @@
                return;

        dest_name = expr_to_str(dest);
-       sm_error("%s() '%s' too small (%d vs %s)", fn, dest_name, has, sval_to_str(needed));
+       sm_error("fixed_size %s() '%s' too small (%d vs %s)", fn, dest_name, has, show_rl(rl));
        free_string(dest_name);
 }

@@ -371,7 +373,7 @@
        clear_token_alloc();
 }

-void check_memcpy_overflow(int id)
+void check_memcpy_fixed_size(int id)
 {
        register_funcs_from_file();
        register_ignored_structs_from_file();

There will be some duplicated warnings if both checks trigger. Which is fine. Here is the new warning:

$ rm sender.o ; CHECK="~/progs/smatch/test/smatch -p=rsync" make sender.o

sender.c:100 receive_sums() error: fixed_size read_buf() 's->sums[i]->sum2' too small (16 vs 0-64)

Let me cross Action Item 4 from my TODO list. I’ll test this check on the kernel over the weekend and push next week.

Action Item 5: Sometimes the flow analysis is not as obvious as it was here. For example, we could verify that all the lengths were correct and then do the memcpy() in another thread. Make a list of all the “valid” lengths for (struct sum_struct)->s2length. Here valid means that we saw a bounds condition like:

if (sum->s2length < 0 || sum->s2length > 64) {

The valid lengths are 0-64. Next when when we have a memcpy() check that the “valid” lengths match the actually valid lengths for the memcpy().

So the answer to dskoll’s original question “wouldn’t a static analyzer have caught this?” When I started writing this bug the answer was probably no, but by the end the answer was yes. Hooray!

http://staticthinking.wordpress.com/?p=475
Extensions
Static analysis and style
Uncategorized
Bloviations on the topic of static analysis and style debates.
Show full content

I gave a talk at Linaro Connect about static analysis in the kernel. I included a couple controversial slides about style. I did that deliberately because controversy is fun! But when you say something controversial, you should to be prepared for questions and feedback instead of just standing there like some kind of idiotic deer staring into the headlights of a 16 wheel truck.

The first question was whether “if (x < 0 || x >= 0) {” is really more readable than “if (x >= 10) {” and if so then should we ignore static checkers that warn about this. This was an opinion that I copied from Linus Torvalds. When Linus Torvalds explains with 100% confidence that checking for negatives improves readability, then everyone says, “Oh… Wow. Good point!” But when I say it, I can’t project enough confidence to convince anyone that I believe it myself. In the end, what I was trying to say about this was that I want to stay out of that debate. Delete the “x < 0” or don’t. Either way works. Static checkers should not print a warning when the code clearly works as intended.

Yes, I did say that you should ignore static checkers when they are wrong. That’s true. But also it’s true that you can’t fight checkpatch. Give in and give up. #HashTag #MotivationalSlogan

The other question was if I was in a position to enforce my dumb style opinions. The answer is that I’m not. That seemed like a very reassuring answer for everyone in the audience. This is similar to the first question about how I want avoid style debates. I always like to brag about how noble I am with regards to ignoring other people’s wrong style opinions.

But on reflection, ACTUALLY, I do wish that I could enforce some style rules!

The first category of rules is from my experience reviewing static checker warnings. There are some styles which are just more buggy than others and I wish people would avoid them.

The second rule that I’d make is “Don’t write code which looks buggy”. Static analysis finds code that looks buggy. It is so frustrating for me when the code looks totally buggy but actually it works. Bugs are easy to fix. Or when the bugs are complicated, they’re at least rewarding to fix. But code that looks buggy takes a long time to review and isn’t rewarding at all.

In my slides, I mention one kind of false positive where Smatch warns about missing error codes but it’s actually intentional. Sometimes people add a comment saying that /* this is a success path */ and that’s fine. Comments are good even. But another option would be to add a “ret = 0;” before the goto. People don’t like to do that because it’s redundant. Sure. Fine. But the compiler knows it’s redundant too and will remove it. It doesn’t affect the compiled code. With a “ret = 0;” the code is self documenting and the comment can be removed. Humans are happy, static checkers are happy, compilers are happy.

The third category of rules I’d like to create would be about writing code which is easy to analyze. I don’t necessarily have a clear idea what this is. But I find it ironic that there are people who refuse to re-write C code in a more readable format and yet these very same people love Rust. Love love love. That’s main thing which Rust is, is *very* *very* strict static analysis rules. We could many of the same things in C with static checker rules. For example, the Cake Project re-implements Rust rules in C. I’m not really convinced about Cake but the project is interesting and shows the kind of thing which is possible.

The example of strict rules in my Linaro Connect talk was that we could say you’re not allowed to add a scope based cleanup.h pointer to a list until you’ve called no_free_ptr() on the pointer. It’s not a bug to do that, but it’s so suspicious that we should treat it as a bug. We could make all kinds of rules like this if we wanted to. #Vague #HandWavy Part of the reason to make rules about the cleanup.h code is because that stuff is so new and we could enforce good hygiene before people get used to writing code in bad ways.

This third category of style rules means designing code for static analysis from the start. That’s probably not where we are yet. But I think as time goes on, that’s where we will go. Eventually.

http://staticthinking.wordpress.com/?p=443
Extensions
Sleeping in Atomic Warnings
Uncategorized
TL/DR; Smatch has a warning "warn: sleeping in atomic context". People sometimes want to know why they don't see these warnings when they run Smatch. It requires the cross function database. Generating the cross function database is really time consuming and probably not worth it. Instead try to reproduce the warning at runtime by enabling the CONFIG_DEBUG_ATOMIC_SLEEP=y option.
Show full content

TL/DR; Smatch has a warning “warn: sleeping in atomic context”. People sometimes want to know why they don’t see these warnings when they run Smatch. It requires the cross function database. Generating the cross function database is really time consuming and probably not worth it. Instead try to reproduce the warning at runtime by enabling the CONFIG_DEBUG_ATOMIC_SLEEP=y option.

Boring explanation follows:

Kernel locks can be divided into two classes, locks which can sleep such as mutexes and locks which don’t sleep such as spinlocks. Sleeping in this context means calling schedule() to run a different task. So if you try take a mutex and a different task is already holding the mutex, your task will go to sleep and let other stuff run. Periodically, your task will wake up and see if the lock is available. On the other hand, if you take spinlock, your task will just keep testing over and over to see if the lock is available yet and it won’t let anything else run.

One kind of bug is called sleeping in atomic bug where you take a spinlock and then sleep. Spinlocks are supposed to be fast and they’re not a time for going to sleep on the job. Sleeping will slow everything down. The name of this bug is “Sleeping in Atomic”.

There is a kind of deadlock bug associated with this bug. To be honest, this was much more of an issue in olden times before everyone had 8 cores in their phone. Imagine you have only 1 CPU and you’re holding a spinlock, then you go to sleep, then another process wakes up and it also wants the spinlock. The second spinlock will spin forever waiting for the first sleeping process to give up the lock. You will have to hold down the power button on your computer for 30 seconds and all your work will be lost.

These days, with multiple cores, I think maybe it’s less likely the two processes will end up on the same core. Or maybe the first sleeping process can be woken up on a different core? I’m not an expert on schedulers so I don’t know. But I feel like we used to hit these deadlocks more often back in the day.

Regardless, sleeping in atomic is a bug. Places which sleep call the might_sleep() function that checks if we are allowed to sleep and prints a stack trace if we are not. This checking is a bit slow so it’s not enabled by default but you can turn it on with CONFIG_DEBUG_ATOMIC_SLEEP=y

There is also a Smatch check for this bug. There are actually a few categories of code which are not allowed to sleep. Code that’s in hard IRQ context, places where IRQs are disabled, and when the preempt count is non-zero. (To make it simpler than it is, pretend that the preempt count is just the number of spinlocks you are holding). Smatch only tracks the preempt count. It turns out that tracking when we’re in hard IRQ is more difficult than I had assumed. Checking for when IRQs are disabled should be the same as tracking preempt count but I have not done this work yet (as I write this in May 2024 there is some pressure to add this code so I probably will).

Tracking the preempt count works like this.

The first part is check_preempt_info.c. Every return is either no change, preempt enabled, preempt disabled, or too complicated. Ignoring the “too complicated” returns is really magic. If we take a spinlock and then we drop it or we take two spinlocks, that’s all too complicated and Smatch ignores it. This cuts down on the false positives and particularly the complicated false positives. Some of these complicated functions get added back through manual annotations.

Then there is a separate module, check_preempt.c, that tracks whether a function is called with a non-zero preempt count. (To make it simple, we say that the preempt count is 1 max when it is called). Then it uses the information from the return states to increment or decrement the preempt count. The preempt count can go negative and that’s fine. It records in the caller_info table when we call a function and we have a preempt count that is greater than zero.

The next module, check_sleep_info.c, tracks when functions sleep. One sleeping function is mutex_lock(), but we don’t want the error to be printed inside mutex_lock(), we want the warning to show up in the function that calls mutex_lock(). So we track the functions which *always* sleep on every path. That way we can move the error message into a more useful location, closer to the lock.

The final module, check_scheduling_in_atomic.c, hooks into the sleeping code and whenever we call a sleeping function then it looks for if we have a positive preempt count and prints a warning.

Then to review the warnings, use smdb.py to see where the function is called with preempt disabled.

drivers/usb/dwc3/gadget.c:2666 dwc3_gadget_soft_disconnect() warn: sleeping in atomic context

$ smdb.py preempt dwc3_gadget_soft_disconnect
dwc3_suspend_common() <- disables preempt
-> dwc3_gadget_suspend()
   -> dwc3_gadget_soft_disconnect()

The code looks like this:

drivers/usb/dwc3/core.c dwc3_suspend_common()

2296 spin_lock_irqsave(&dwc->lock, flags);
2297 dwc3_gadget_suspend(dwc);
2298 spin_unlock_irqrestore(&dwc->lock, flags);
drivers/usb/dwc3/gadget.c dwc3_gadget_suspend()

4708 ret = dwc3_gadget_soft_disconnect(dwc);
drivers/usb/dwc3/gadget.c dwc3_gadget_soft_disconnect()

2663 if (dwc->ep0state != EP0_SETUP_PHASE) {
2664         reinit_completion(&dwc->ep0_in_setup);
2665
2666         ret = wait_for_completion_timeout(...

The dwc3_suspend_common() function takes a spinlock and in dwc3_gadget_soft_disconnect() we call wait_for_completion_timeout() which sleeps.

The if statement on line 2663 in dwc3_gadget_soft_disconnect() means that it will not sleep during the SETUP_PHASE. Perhaps the code in dwc3_suspend_common() is only called during setup phase? In this example, there are three functions in the call tree. Smatch loses a bit of context every time we cross a function boundary. So these sleeping in atomic warnings tend to have false positives.

You have to look at several functions so they are complicated to review manually as well. The patch which introduces the bug could have been anywhere in the in the call tree so it’s easy to blame the wrong person for introducing the bug. These warnings are a giant pain in the behind for me.

If the spinlock and the sleep are in different functions then you need to build the cross function database. It’s not hard to do that, but it takes a long time (around 5 hours depending on the system?). The command is smatch_scripts/build_kernel_data.sh.

The way the Smatch cross function database works is if we’re holding a spinlock then it adds that to the caller info table. But then when we rebuild the database a second time then we know about more functions which are called under spinlock. So we have to rebuild it over and over and each time it adds another branch to the call tree. We have to keep rebuilding it until the spinlock information and the sleep information meet. In the above example, we’d have to rebuild the database three times.

I rebuild my cross function database every night based on linux-next. Linux-next is a fast moving target so sometimes the data is slightly inconsistent, where Smatch thinks we’re holding a spinlock five functions away, but we’re not. It is annoying, but I’m used to dealing with that. In that situation there would be no “<- disables preempt” marker in the smdb.py output.

People sometimes want to verify that their fix silences the Smatch warning. Hopefully, this blog provides enough information to know the answer without actually going through the drudgery of actually reproducing the warning.

http://staticthinking.wordpress.com/?p=411
Extensions
return 0 is better than return ret
Uncategorized
There are a lot of people who write "return ret;" when they mean "return 0;" I feel like I'm the person who cares about this the most in the world, but hopefully after reading this blog you will notice it as well and it will annoy you.
Show full content

There are a lot of people who write “return ret;” when they mean “return 0;” I feel like I’m the person who cares about this the most in the world, but hopefully after reading this blog you will notice it as well and it will annoy you.

Most functions in the Linux kernel return zero on success and return negative error codes on failure. There are some functions which return negative error codes or a positive number on success. I have previously written that callers should check “if (ret) ” for functions which return zero and only use “if (ret < 0)” for functions which return positive values. But other people disagree. The sound subsystem in particular. Whatever. But if you’re going to always check for negatives please return a literal zero instead of “return ret;”

Consider this code:

ret = frob();
if (ret < 0)
return ret;

frob();
frob();
frob();

return ret;

It fills you with rage, right? If it had been written as “return 0;” it would have been immediately obvious. But the way it’s written, it could be any non-negative number like three or one hundred. Who knows? Now you have to scroll back to see the “ret = frob();” assignment and then you have to jump into the frob() function to see what that returns. Normally when you jump into frob(), it’s pretty obvious that it returns zero or negative error codes. But imagine if that function followed the same anti-pattern and you had to jump further up the call tree. Imagine it took you hours and hours to walk the call tree back until eventually you got so enraged you grabbed your computer monitor and smashed it on the floor. Imagine that! WOW! You need to talk to someone about your anger issues!

Quite often the code is obvious but annoying:

ret = frob();
if (ret) {
printf("error: frob failed\n");
return ret;
}

return ret;

This code works as intended but it still is less readable than “return 0;”.

I am an amateur snake enthusiast but there are some people who make their career around snakes. When the professionals go into the forest they spot things which the rest of us don’t see. They’ve trained their eyes and they understand the habitat. When the rest of see a snake, we see a green snake but a professional can identify the species from just a glance. “Did you see how dark its eyes are? It’s a Western Natal Green Snake.”

It’s the same for kernel developers. We look at code all day long. Things like a return 0; statement stand out to us and we’re subconsciously looking for it.

There is one time where return ret is especially annoying and that is this:

ret = short_cut();
if (!ret)
return ret;
frob();
frob();

What this code is saying is that if the short_cut() succeeds then we can return directly, but otherwise we have to do it the long way. But to the eye the short cut looks like an error path.

if (!ret)
123456789
return ret;
0123456789012345678

The code has 38 characters in it and 37 of them are saying “this is an error path” and one character is saying this is a success path. I review it every time someone adds code like this to the kernel because sometimes the “!” character was added unintentionally. Normally, bugs like this would be caught in testing but people don’t necessarily have the hardware to test properly. Or people say, “Oops. I did test this patch but I accidentally sent the wrong version.”

The thing is that when the code is obviously buggy, that takes very little time to address. It’s a one liner patch. The real frustration is when the code is ambiguous. You can stare at the code for a long time and then eventually you decide it works correctly. This is time consuming and there is no reward at the end.

Instead take a look at this code:

ret = short_cut();
if (!ret)
return 0;
frob();
frob();

When it’s written like this, it’s obvious that the author returned zero intentionally and that it’s a success path.

http://staticthinking.wordpress.com/?p=387
Extensions
When to use == 0
Uncategorized
This is no longer a debate in the Linux kernel, but I saw some code written in out of date style recently and it was a moment of reminiscing. You used to see code like this: You should never use == NULL or != NULL. This rule is enforced by checkpatch. Comparing against zero is […]
Show full content

This is no longer a debate in the Linux kernel, but I saw some code written in out of date style recently and it was a moment of reminiscing. You used to see code like this:

ret = frob();
if (ret != 0)
return ret;

You should never use == NULL or != NULL. This rule is enforced by checkpatch. Comparing against zero is more subtle because there are a few situations where == 0 and != 0 is appropriate.

One place where == comparisons with zero is appropriate is for NUL terminators. Always test against “if (*p == ‘\0’)”. Don’t use ‘if (!*p)” and especially don’t use “if (*p == 0)”. NUL terminators are really important and they are their own thing. Make them stand out. Don’t take short cuts.

Another time to use to use == comparisons is with enums. Don’t say if (!direction). Say if (direction == READ)”.

In the example code at the start of this blog, “ret” is not a number like 1, 2, 3, etc. “ret” means is there an error code or not? It should just be written as”if (ret) “. Some people think “if (ret != 0) is more readable. But then I used to tell them to make it even more readable by writing “if (ret != 0 != 0)”. I still remember this as being the wittiest thing I have ever said in my life and it’s the reason why this blog entry exists.

If you have a condition like “if (len == 0) “. That’s perfectly fine and readable. To be honest, “if (!len)” is also okay. In this situation a zero length has a special negative meaning of absence. Zero length is different from every other length. I would probably write the former rather than the latter but only a nutter would complain either way.

The place where I would complain is if there is no special meaning of absence. For example, the first element in an array is not different from the second element. It should be “if (index == 0) “. This is especially true if there are other tests on the same variable. Don’t write “if (!value) … else if (value == 1) … else if (value == 2). Use == 0, == 1 and == 2 consistently.

The final time to use == 0 is for strcmp() type functions. In those functions, you’re comparing two strings. The strcmp() functions return < 0 if the first parameter is less than the second. They return > 0 if the first parameter is greater than the second. And they return == 0 if the parameters are equal. Once you know that then “if (strcmp(a, b) == 0)” is very readable. Writing it as “if (!strcmp(a, b))” is so awkward.

http://staticthinking.wordpress.com/?p=378
Extensions
strcpy strncpy() strlcpy() and strscpy()
Uncategorized
The kernel has a number of strcpy() functions that copy a string from one pointer to another. This blog is a short guide. strcpy() is dangerous because it has no bounds checking and can lead to a buffer overflow. strncpy() will not overflow the destination buffer. But the problem is that if it has to […]
Show full content

The kernel has a number of strcpy() functions that copy a string from one pointer to another. This blog is a short guide.

strcpy() is dangerous because it has no bounds checking and can lead to a buffer overflow.

strncpy() will not overflow the destination buffer. But the problem is that if it has to truncate the string, then it will not add a NUL terminator at the end. The other thing to note about strncpy() is that it always writes n characters. If the src string doesn’t have n characters then it will fill the remaining parts of the string with zeroes. This feature is sometimes used in the kernel to prevent information leaks. In the kernel, when you allocate memory with kmalloc() then the old information is still there so you have to clear it out before you copy it to the user. strncpy() will copy over the whole buffer. (This is called zero padding).

I’m going to note here that although the problem with strncpy() is that it doesn’t put a NUL terminator on the end of the buffer, there are some places which want this behavior. There are buffers where we read to the first NUL character or up to 16 characters. I am not a huge fan of this behavior but it does exist. One piece of social advice that I can give is that you should not get into arguments with people who write code like this. These are not people motivated by reason.

strlcpy() will not overflow the destination buffer and it always adds a NUL terminator. However it doesn’t pad the string with zeros. But the main thing wrong with strlcpy() is that it does a strlen() on the src string. So if the src string is not NULL terminated then you can end up reading beyond the end of the buffer. If you get really unlucky then the memory beyond the end of the buffer might not be mapped and the kernel would Oops.

strscpy() and strscpy_pad() these functions don’t overflow, and they add a NUL terminator and the don’t do a strlen(src). The strscpy() function does not zero pad the string and the strscpy_pad() function does.

Obviously in new code, you should always use strscpy() and strscpy_pad(). Or you can use seq_buf_printf() or various other options. But don’t use strcpy(), strncpy() and strlcpy(). Eventually we want to get rid of these functions.

If you are updating code from the old strcpy() functions to the new functions then we want you to do some analysis. If you are converting from strcpy() then it’s simple to use strscpy() but you must analyze if the original code was buggy. Converting from strlcpy() to strscpy() is easy as well, but you must say if the src string is NUL terminated or not. Although I say that it’s easy to make this transition, actually keep your eye out for alternative potentially better ways like using snprintf() instead.

Converting from strncpy() is more tricky. You have to analyze if the resulting string could fit or if the NUL terminator was potentially left off. The other thing to decide is whether to use strscpy() or strscpy_pad(). Is the original destination buffer zeroed already? Does it get copied to user space? This must all be explained in the commit message.

So to recap here are the following questions you must ask when converting a an old strcpy(), strncpy() or strlcpy() function.

1) The destination buffer: How big is it? What is in the destination buffer before we start? Is the destination buffer zeroed already?
2) The source buffer: Where does it come from? Is it a string literal? Does it come from the user? Is it NUL terminated? How long is it?
3) The length: What length is it? Hopefully, it’s the length of the destination buffer but for strncpy() it might be “len – 1”. Is it longer than source buffer?

4) The return value: Very few callers check the return value, but if they do then the values are different.
5) The final string: Does the surrounding code assume that it is NUL terminated? Does the following code manually add a NUL terminator? Do we copy the string to the user?

So a typical commit messages might say something like these:

“Checkpatch complains that strscpy() is better than strcpy(). In this case, we are copying a string literal to a buffer which is 64 characters long. So strcpy() works fine, but let’s change it anyway to make checkpatch happy. (Don’t add a Fixes tag because it’s not a bug).”

“Checkpatch complains that strscpy() is better than strncpy(). This code manually adds a NUL terminator so it’s fine, but using strscpy() is cleaner. In this case, we are using strncpy() to copy data to &foo.str. We copy the &foo.str data to the user. The &foo.str already contains data from the user because of the copy_from_user() at the top of the function so it’s not an information leak if we use strscpy(). But let’s be more elegant than that and strscpy_pad() to zero out the rest of the buffer.”

“Checkpatch complains that strscpy() is better than strcpy(). In this case we are copying a 16 byte string into a 12 byte buffer so we need to make the destination buffer larger as well. (Add a Fixes tag, CC stable)”.

“We have been trying to get rid of strlcpy() functions. The problem with strlcpy() is that it does a strlen() on the source buffer which can cause a problem if the src buffer is not NUL terminated. In this case the src buffer is a string literal so that’s not an issue. When the src buffer is a string literal then strscpy() is a one for one replacement with strlcpy() so this change will not affect runtime.”

“Checkpatch complains that strscpy() is better than strlcpy(). The problem with strlcpy() is that it does a strlen() on the source buffer. That is a problem in this case because the src buffer comes from the user and it may not be NUL terminated. If the strlen() reads beyond the end of the buffer there is a chance that it will try read unmapped memory and cause a kernel Oops. (Add a Fixes tag, CC stable).

http://staticthinking.wordpress.com/?p=349
Extensions
Ignore old warnings
Uncategorized
I really believe in static checkers but sometimes static checker advice is wrong and harmful. For example, earlier in the week we made a buffer 16 bytes larger to silence checker false positive. Another time we got into a pointless argument about how to silence a different checker false positive. The best way to deal […]
Show full content

I really believe in static checkers but sometimes static checker advice is wrong and harmful. For example, earlier in the week we made a buffer 16 bytes larger to silence checker false positive. Another time we got into a pointless argument about how to silence a different checker false positive.

The best way to deal with false positives is just to mark them as old and ignore them. That way you get the best of both worlds. You get the useful warnings but you don’t have to worry about working around checker issues.

People have the idea that we should “Fix every warning” or have a “static checker clean code”. I don’t like this idea.

It limits me to only certain kinds of checks. I think even warnings with a false positive rate of 80% are useful if the other 20% are severe bugs such as security vulnerabilities. There are tons of Smatch checks which I can’t publish because they have too many false positives but they’re still useful.

Forcing static checker clean code puts a lot of burden on me to work around every false positive. Which is good. But at the same time, I only have so many hours in the day and trying to silence every false positive comes at the expense of other work I could be doing. I also just feel bad when I see people making their code worse because of a problem in Smatch.

People sometimes ask me to create a shared database of false positives. But really if it’s in the kernel and it’s an old warning and it’s in the allyesconfig for either x86 or ARM, then I have reviewed the warning. Just ignore everything old.

Smatch comes with a script to track new vs old bugs. I run Smatch on linux-next every day. Then I do new_bugs.pl smatch_warns.txt > err-list. I review the warnings with the `smatch_scripts/summarize_errs.sh err-list` script. Finally, I store the warnings using `new_bugs.pl –store err-list`.

If I patch a file, then that’s when I review all the warnings, both old and new.

The Intel Zero Day kbuild project follows the same system of only reporting new warnings but posting a list of old warnings for modified files. Everyone eventually comes up with this solution if they work with static checker warnings a lot.

http://staticthinking.wordpress.com/?p=341
Extensions
Writing a check for zero IRQ error codes
Uncategorized
In the earliest git release of the Linux kernel the platform_get_irq() function used to return zero on error. It’s hard for me to know why this is. I do think that there is a sense where unsigned int is “cleaner” for this purpose. From a practical perspective negative error codes are standard. Also are we […]
Show full content

In the earliest git release of the Linux kernel the platform_get_irq() function used to return zero on error. It’s hard for me to know why this is. I do think that there is a sense where unsigned int is “cleaner” for this purpose. From a practical perspective negative error codes are standard. Also are we more likely to need BIT(31) or IRQ number zero? In retrospect, probably we should have made it return a signed int.

Eventually in 2006, we changed platform_get_irq() to return -ENXIO on error in commit 305b3228f9ff ([PATCH] driver core: platform_get_irq*(): return -ENXIO on error). The reason for this change was there were new architectures created where zero was valid IRQ number. I have asked people about this and they say those architectures have been removed and zero is no longer a valid IRQ number and will never be again.

Let’s just take a moment to look at that commit. This is an artifact from a very different time in kernel development when we didn’t know what we were doing. The commit JUST BROKE EVERY SINGLE DRIVER!!! In those days we used to break things and then fix them before the kernel was released to the public. These days we might still occasionally break things, but only by mistake and not deliberately.

Also what a headache for backporting! The code would compile without complaining either way but in old kernels the correct check was “if (!irq) {” and in the new kernels the check was “if (irq < 0) {“. Developers were confused so it became quite common to write “if (irq <= 0) {” which is not correct but it works for everyone except for the people who have that weird 2006 architecture where IRQ zero is valid. These days no one backports to kernels from 2006 and everyone should check “if (irq < 0) {“.

Unfortunately, although platform_get_irq() and platform_get_irq_byname() were updated a number of other IRQ functions were not. For example, msi_get_virq() still returns zero. So we are left with a confusing legacy mix of returns.

This causes a number of bugs. 1) People don’t know whether to check for zero or negative. 2) Developers forget to set the error code. 3) Signedness bugs. Let’s discuss these in reverse order.

A lot of kernel code declares irq numbers unsigned int. For example, request_irq() does this. So for a long time, until we turned on the -Wtype-limits warning, checking “if (u32_irq < 0) {” was a very common bug. Earlier I mentioned that a common format is “if (irq <= 0) {” and the -Wtype-limits warning will not catch a signedness bug in this format. The Smatch check for unsigned error codes will detect this bug.

In the kernel the normal way to write error handling code is:

ret = frob();
if (ret)
        goto free_something;

But dealing with zero returns is more complicated.

irq = returns_zero_failure();
if (!irq) {
        ret = -EINVAL;
        goto free_something;
}

irq = returns_zero_or_negative_on_failure();
if (irq <= 0) {
        ret = irq ?: -EINVAL;
        goto free_something;
}

The “ret = ” step is just another thing to forget to do. The mixed case is even more complicated and easy to get wrong. These are common bugs. Smatch will detect certain cases of this bug but the situation could be improved.

Anyway, the problem I want to address is where people check for negatives instead of zero. Let’s register a table, and a return_implies_param_key_exact() hook. The only real information that we need from the table is the function names. It feels wrong to add all the unnecessary information but I did it in case I ever create a generic way to register a full table as a one liner.


static struct ref_func_info zero_table[] = {
        { "irq_of_parse_and_map", ZERO_ERROR, -1, "$", &int_zero, &int_zero, NULL },
        { "msi_get_virq", ZERO_ERROR, -1, "$", &int_zero, &int_zero, NULL },
};

static void returns_zero_error(struct expression *expr, const char *name, struct symbol *sym, void *data)
{
        set_state(my_id, name, sym, &fail);
}

Let’s make this check work across functions as well. Add ZERO_ERROR to the list of db types in smatch.h. We can re-use the returns_zero_error() function to handle selects. The only thing we need to add is a insert function:

static void return_info_callback(int return_id, char *return_ranges, struct expression *returned_expr, int param, const char *printed_name, struct sm_state *sm)
{
        if (param != -1 ||
            strcmp(printed_name, "$") != 0)
                return;
        if (!returned_expr || 
            implied_not_equal(returned_expr, 0))
                return;
        if (!slist_has_state(sm->possible, &fail))
                return;

        sql_insert_return_states(return_id,
                return_ranges, ZERO_ERROR,
                param, printed_name, "");
}

At this point, it occurred to me to warn about places which return a mix of zero and error codes for error. And I thought, I could just add an if statement to the function above:

if (holds_kernel_error_codes(returned_expr))
        sm_warning("mixed zero and negative");

Then I thought, maybe the insert hooks are only run when we need to insert something because we’re using –info or because the function is inlined? Testing shows that it does seem to work, but let’s do this properly and make a separate hook.

static void match_return(struct expression *expr)
{
        char *name;

        if (!expr || implied_not_equal(expr, 0))
                return;
        if (!expr_has_possible_state(my_id,
                        expr, &fail))
                return;
        if (!holds_kernel_error_codes(expr))
                return;

        name = expr_to_str(expr);
        sm_warning("both zero and negative are errors '%s'", name);
        free_string(name);
}

The only thing left is to make a warning hook. Okay checks are “if (irq == 0) {“, “if (irq != 0) {” and “if (irq <= 0) {“. The comparisons that are wrong are “if (if < 0) {” and “if (irq >= 0) {” but the “>= 0” comparison can be either signed or unsigned. The “< 0” could be unsigned too, but that already generates a warning so we don’t need to check for it.

static void match_condition(struct expression *expr)
{
        if (expr->type != EXPR_COMPARE ||
            !expr_is_zero(expr->right))
                return;
        if (expr->op != '<' &&
            expr->op != SPECIAL_GTE &&
            expr->op != SPECIAL_UNSIGNED_GTE)
                return;

        if (!expr_has_possible_state(my_id,
                        expr->left, &fail))
                return;

        sm_warning("zero is also failure");
}

Here is the check_kernel_irq_of_parse_and_map() function that ties it all together:

void check_kernel_irq_of_parse_and_map(int id)
{
        int i;

        if (option_project != PROJ_KERNEL)
                return;

        my_id = id;

        for (i = 0; i < ARRAY_SIZE(zero_table); 
             i++) {
                struct ref_func_info *info = &zero_table[i];
                return_implies_param_key_exact(
                        info->name,
                        *info->implies_start,
                        *info->implies_end,
                        returns_zero_error,
                        info->param,
                        info->key, info);
        }

        add_modification_hook(my_id,
                              &set_undefined);
        add_hook(&match_condition, CONDITION_HOOK);
        add_hook(&match_return, RETURN_HOOK);

        add_return_info_callback(my_id, 
                        return_info_callback);
        select_return_param_key(ZERO_ERROR, 
                        &returns_zero_error);
}

Thanks for reading to the end. I will post again after I get some results from this test.

http://staticthinking.wordpress.com/?p=330
Extensions