On online judging, part 5: optimizing ptrace filtering with seccomp

In part 1 of this series, I mentioned that the overhead of a pure ptrace-based sandbox is about 10%. In hindsight, this number is very optimistic — it can be as high as 50% for some workloads — but understanding why requires a bit of background on how the judge keeps track of submission time.

In this post, we'll discuss both submission time-keeping, and a simple but effective method to reduce sandboxing overhead using seccomp alongside ptrace.

Understanding the Problem

Any judge needs to keep track of how long a submission runs, to implement things like time limits.

A simple method of accounting for time spent in a submission involves continuously waiting for a process to be signalled, and keeping track of how long was spent wait-ing. This is more "fair" than strictly timing how long it takes until the process exits, since it excludes the time spent filtering syscalls in the judge code.

total_time = 0
while True:
  start = time()
  wait for process to be signalled
  end = time()

  total_time += end - start
  if total_time > time_limit:
    kill process

  # If signal was for a syscall event (SIGTRAP) and not SIGWINCH, validate it
  resume process

We receive signals for all syscalls invocations when ptrace-ing, but in case a process doesn't use any for a long time, we also have set up a task to periodically send SIGWINCH (a harmless signal that's normally ignored) just to force wait to return for time-keeping purposes.

while True:
  signal process with SIGWINCH
  sleep(0.1)

Then, it's very simple to compute a naive measure of overhead: divide total_time by the total CPU time used by the process. Indeed, for most submissions, this figure will be less than 10% — but it doesn't tell the whole story. What we don't (and can't) easily measure is the overhead of the context switch from the submission to the tracer.

Every time a context switch happens, the submission's performance suffers from the invalidation of the memory it was relying on to be cached. For instance, tight loops over multi-dimensional matrices are very common, so you'll often see user code that looks like this:

matrix = ... # 10005 by 10005 matrix

for i in range(10005):
  for j in range(10005):
    val = ... matrix[i][j] ... # Some computation using the indices
    if <some condition>:
      print(val)  # This will cause a `write` syscall

This code should be very easy for the processor to optimize — the memory accesses are predictable, any there likely would be very few page faults when running as a result of the memory pages used (for matrix) being in the processor's TLB (cache).

When we force a context switch, however, we greatly increase the likelyhood of those cache entries being flushed by the time we resume it (on processors without process-context identifiers, e.g. pre-2010 Intel processors, we basically guarantee a TLB flush). So, when the process resumes, it will bear the overhead of having to miss the cache for memory accesses that should have been in cache if running without the tracer.

As a result, when comparing the time the process takes to execute without the tracer versus with the tracer, we can (for some extreme cases) see 2x or greater speedups, despite the overhead continuing to be reported as less than 10%.

A Brief Introduction to seccomp

Now that we've dissected the problem, let's talk about the solution: seccomp.

seccomp was initially introduced in kernel version 2.6.12 (2005), and allowed a one-way transition for a process to a secure state where it could only invoke the read, write, exit, and sigreturn syscalls. This may be useful for some secure computing projects, but it's not enough for running modern runtimes like Python. Thankfully, that's no longer the case.

In 2012, as part of the 3.5 kernel, seccomp was extended to provide programmable BPF filters. With this new functionality, it's possible to write more expressive filters to implement syscall validation in the kernel, without trapping into ptrace. This LKML article provides a good, accessible overview of how seccomp BPF works.

The key point is that we can add rules to seccomp to unconditionally allow very frequent, safe syscalls, and instruct it to hand control of the process over to a ptrace tracer when a dangerous syscall is encountered.

// In the tracer:
ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACESECCOMP /* instead of PTRACE_O_TRACESYSGOOD */);

// In the submission, before `exec`-ing:
scmp_filter_ctx ctx = seccomp_init(SECCOMP_RET_TRACE(0));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_load(ctx);
seccomp_release(ctx);
execve(...);

With the above filter, read and write will always be unconditionally allowed, while e.g. open will cause a ptrace event and stop the process for inspection. That's just what we need!

Comparing ptrace and ptrace + seccomp Tracing

So far, we've established that ptrace is slow, and that seccomp can make things better.

However, their interfaces are somewhat different, so it's worth discussing how standard ptrace actions (like cancelling syscalls and changing their return values) can be implemented with seccomp-enabled ptrace.

Let's say a user submission wants to open a file, and we end up denying it because it's a file they shouldn't be accessing. We want to open to return ENOENT, so that the user submission can respond accordingly. Here's roughly what happens under the hood with ptrace:

  • user submission calls open
  • [pre-syscall event transfers control to judge]
  • judge reads syscall number and arguments from registers; validates them
  • judge sets syscall number to something harmless and fast (getpid)
  • [judge resumes process]
  • kernel executes getpid
  • [post-syscall event transfers control to judge]
  • judge sets return value register to ENOENT, thereby "cancelling" the open syscall
  • [judge resumes process]
  • user submission's open call returns with ENOENT

We can see that for every syscall the user submission performs, we trap and stop the process two times. This is not at all cache-friendly to the submission.

When tracing with seccomp-enabled ptrace, however, we do not have pre- and post-syscall events — we are only notified before an event takes place. To support functionality like cancelling syscalls, seccomp allows tracers to set the syscall register to -1 on the pre-syscall event. This will instruct the kernel will skip the syscall, returning the register set as-is.

  • user submission calls open
  • [pre-syscall event transfers control to judge]
  • judge reads syscall number and arguments from registers; validates them
  • judge sets syscall number to -1, and return value register to ENOENT
  • [judge resumes process]
  • user submission's open call returns with ENOENT

Here, we only have two context switches as opposed to four, but more importantly, we only run through this logic on unsafe syscalls like open. Syscalls that dominate a submission's lifespan, like read or write, never trigger seccomp to signal the process. That's a huge win for performance.

Conclusion

As of January 2019, we've been taking a seccomp + ptrace approach in the sandbox for the DMOJ, so as always you may peek at the source code to see how the ideas expressed in this post can be implemented in practice.

Empirically, speedups have been noticeable since upgrading the sandbox to use seccomp. They are particularly apparent for interactive tasks, which require frequent flushing of standard output, but can be felt to some extent across most problems.

There are further optimizations possible with this approach (for instance, the order rules are added to the filter matters, since they're evaluated in the order they were added — so there's an incentive to having the most common syscalls listed first), but perhaps they'll be the subject of a future post.