On online judging, part 1: the Linux sandbox

I've been operating a semi-large online judge for the better part of 4 years at the time of writing. People sometimes ask me how it works, so I figured it'd be a worthwhile task to document, in general, what goes into writing an online judge. So let's start with the basics: you need a sandbox to run user code in. Let's focus on Linux for now, since it's a significantly easier platform to target.

Requirements

When we started writing the DMOJ, there were two hard requirements:

Servers are expensive, so the sandbox must be able to run on "free" hosting providers. This means no root access, and typically ancient kernel versions.
The sandbox must be easy to expand to support more runtimes. Language-specific sandboxes are unacceptable, simply due to the effort of maintaining them.

Corollary: by being able to run without root access, you can easily push a "Folding@home"-type setup in the context of online judging: have users donate their machines to judge submissions. This cuts your costs, while making your community feel more involved in how your online judge is being run.

Problem: by throwing out root access and new kernels, you effectively eliminate the most popular choices for sandboxing. setuid(3) hacks, chroot(1) solutions, or cgroups(7) are out the window. But, you are left with a powerful tool that's more than sufficient for online judging purposes (but not for full-scale virtualization): ptrace(2).

A `ptrace(2)`-based sandbox

ptrace(2) is a very powerful Linux syscall. It's what allows fundamental development tools like gdb to operate. With ptrace(2), it's trivial to set up an interception mechanism for syscalls, which is exactly what we need to do to sandbox an application. The strace command is an easy way to see the usefulness of ptrace(2), as we'll see shortly.

Let's consider a simple Python script (which should not be allowed to run on an online judge, because it's dangerous!)

with open('file.txt', 'w') as f:
    f.write("I shouldn't be able to write files!")

Writing it to a file, say, bad.py, and running strace python bad.py produces a lot of output. If we specifically search for where file.txt appears in the output, we find the open(2) syscall was logged (emphasis mine):

open("file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0666, st_size=0, ...}) = 0
fstat(3, {st_mode=S_IFREG|0666, st_size=0, ...}) = 0
write(3, "I shouldn't be able to write fil"..., 35) = 35

open(2) was called with file.txt as an argument; the syscall was successful and returned a new file descriptor, 3. Later, the write(2) syscall is used to write our test string to the file.

The beauty of ptrace(2) is that we can not only observe when syscalls happen, we can also take an active role and ask the kernel to notify us before a syscall happens. At that time, we can check to see if the syscall should be allowed to go through (in our example, it shouldn't), and if not, terminate the process.

This does add some overhead to processes, but in general runtime in programming problems is dominated by CPU-bound operations, not waiting for I/O to complete. Typically, the overhead incurred by sandboxing individual syscalls is < 10%, and can be subtracted from total runtime at the end.

`ptrace(2)` Sandboxing: Limitations

There are limitations with a ptrace(2) approach that make it unusable for implementing sandboxes for arbitrary applications. This doesn't affect an online judge, which deals with a very specific subset of applications, but they are worth noting nonetheless.

You have to maintain a syscall whitelist. For each runtime you wish to support, you must determine what syscalls they need access to, and whitelist those as appropriate. Some runtimes use some really crazy syscalls in really obscure ways, and you have to be prepared to handle that. Runtimes also update, in which case you may find yourself needing to allow more syscalls (this is where unit testing is crucial).
You need separate implementations for different platforms. You'll be digging into CPU registers to verify syscall parameters, so you'll need different mechanisms on x86, x64, x32, and ARM. Syscalls also differ between Linux and FreeBSD (FreeBSD has even more!) My collegue Guanzhong has published a more in-depth article on the difference between Linux and FreeBSD ptrace(2)-based sandboxing.
You can only run one thread. For security, you must not allow fork(2) or clone(2), otherwise you can have race conditions where, while your handler is checking the validity of a thread's syscall parameters, a second thread comes around and swaps them while you're still processing the first thread. If you must allow multithreading, you have to suspend all threads before checking syscall parameters — this absolutely kills multithreading performance for any real application.

`ptrace(2)` Sandboxing: Benefits

Apart from the previously-stated design goals of running without root as well as on old kernels, there are some cool things you can do with a ptrace(2) sandbox that you simply can't with anything else.

Redirect open(2) syscalls. Some programming contests require you to read from files, while others from standard input. It'd be nice if you only had to implement one of the above and support all problems — with a ptrace(2) sandbox, you can: simply override the open(2) syscall and return the file descriptor of standard input/output, as necessary. Sure, your users won't be able to fseek(3), but that's not a concern for legitimate problems.

`ptrace(2)` Sandboxing: in Hindsight

Is ptrace(2) the best choice for an arbitrary online judge? Maybe not. It's definitely better than a setuid(3) hack or chroot(1) maintainability nightmare, but cgroups(7) are a promising alternative if requiring root and new kernels is not a concern. It's what virtualization products like Docker use, and they're pretty powerful.

Is the difference between ptrace(2) and cgroups(7) big enough to consider switching from one to another? Probably not.

Bonus: Compilers Need to be Sandboxed Too!

I always start with this one whenever someone asks about "something cool" about online judging, but here I've put it at the end as a reward for reading this far ;)

It's a really bad idea to consider that compilers are sensible programs that will run without issues on any given input. That's blatantly false, and a bad compiler implementation can cripple your judge. Below are my top 3 favourite issues I've encountered. The fun part is that you can submit these (at your own discretion) to existing online judges, and many of them don't behave… nicely when given them as input.

Ah, the good ol' endless compile

Target: any gcc compiler

#include "/dev/random"

int main() {
    return 0;
}

Linear compile times

Target: any gcc compiler; bug tracker link

#include <bitset>
#include <climits>

std::bitset<UINT_MAX> justAVar;

int main()
{
   return 0;
}

OpenJDK 7 JVM bug

This one takes a bit more reading to understand, but it pays off. Here's the relevant DMOJ ticket as well as the associated IcedTea bug report filed.

Conclusion

Well, I hope you've enjoyed reading this far. Perhaps you've even learned something new! There's a lot more that goes into an online judge, and if time allows I'll be following up this post with more on the topic.