On online judging, part 5: optimizing ptrace
filtering with seccomp
In part 1 of this series, I mentioned that the overhead of a pure
ptrace
-based sandbox is about 10%. In hindsight, this number is very optimistic — it can be as high as 50%
for some workloads — but understanding why requires a bit of background on how the judge keeps track of
submission time.
In this post, we'll discuss both submission time-keeping, and a simple but effective method to reduce sandboxing
overhead using seccomp
alongside ptrace
.
Understanding the Problem
Any judge needs to keep track of how long a submission runs, to implement things like time limits.
A simple method of accounting for time spent in a submission involves continuously waiting for a process to be
signalled, and keeping track of how long was spent wait
-ing. This is more "fair" than strictly timing how long
it takes until the process exits, since it excludes the time spent filtering syscalls in the judge code.
total_time = 0
while True:
start = time()
wait for process to be signalled
end = time()
total_time += end - start
if total_time > time_limit:
kill process
# If signal was for a syscall event (SIGTRAP) and not SIGWINCH, validate it
resume process
We receive signals for all syscalls invocations when ptrace
-ing, but in case a process doesn't use any for a long
time, we also have set up a task to periodically send SIGWINCH
(a harmless signal that's normally ignored) just to
force wait
to return for time-keeping purposes.
while True:
signal process with SIGWINCH
sleep(0.1)
Then, it's very simple to compute a naive measure of overhead: divide total_time
by the total CPU time used
by the process. Indeed, for most submissions, this figure will be less than 10% — but it doesn't tell the
whole story. What we don't (and can't) easily measure is the overhead of the context switch from the submission to the
tracer.
Every time a context switch happens, the submission's performance suffers from the invalidation of the memory it was relying on to be cached. For instance, tight loops over multi-dimensional matrices are very common, so you'll often see user code that looks like this:
matrix = ... # 10005 by 10005 matrix
for i in range(10005):
for j in range(10005):
val = ... matrix[i][j] ... # Some computation using the indices
if <some condition>:
print(val) # This will cause a `write` syscall
This code should be very easy for the processor to optimize — the memory accesses are predictable, any there likely
would be very few page faults when running as a result of the memory pages used (for matrix
) being in the processor's TLB
(cache).
When we force a context switch, however, we greatly increase the likelyhood of those cache entries being flushed by the time we resume it (on processors without process-context identifiers, e.g. pre-2010 Intel processors, we basically guarantee a TLB flush). So, when the process resumes, it will bear the overhead of having to miss the cache for memory accesses that should have been in cache if running without the tracer.
As a result, when comparing the time the process takes to execute without the tracer versus with the tracer, we can (for some extreme cases) see 2x or greater speedups, despite the overhead continuing to be reported as less than 10%.
A Brief Introduction to seccomp
Now that we've dissected the problem, let's talk about the solution: seccomp
.
seccomp
was initially introduced in kernel version 2.6.12 (2005), and allowed a one-way transition for a process to a
secure state where it could only invoke the read
, write
, exit
, and sigreturn
syscalls. This may be useful for
some secure computing projects, but it's not enough for running modern runtimes like Python. Thankfully, that's no longer the case.
In 2012, as part of the 3.5 kernel, seccomp
was extended to provide programmable BPF filters. With this new
functionality, it's possible to write more expressive filters to implement syscall validation in the kernel, without
trapping into ptrace
. This LKML article provides a good, accessible overview
of how seccomp
BPF works.
The key point is that we can add rules to seccomp
to unconditionally allow very frequent, safe syscalls, and instruct it to
hand control of the process over to a ptrace
tracer when a dangerous syscall is encountered.
// In the tracer:
ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACESECCOMP /* instead of PTRACE_O_TRACESYSGOOD */);
// In the submission, before `exec`-ing:
scmp_filter_ctx ctx = seccomp_init(SECCOMP_RET_TRACE(0));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_load(ctx);
seccomp_release(ctx);
execve(...);
With the above filter, read
and write
will always be unconditionally allowed, while e.g. open
will cause a ptrace
event and stop the process for inspection. That's just what we need!
Comparing ptrace
and ptrace
+ seccomp
Tracing
So far, we've established that ptrace
is slow, and that seccomp
can make things better.
However, their interfaces are somewhat different, so it's worth discussing how standard ptrace
actions (like cancelling syscalls
and changing their return values) can be implemented with seccomp
-enabled ptrace
.
Let's say a user submission wants to open
a file, and we end up denying it because it's a file they shouldn't be accessing. We want to open
to
return ENOENT
, so that the user submission can respond accordingly. Here's roughly what happens under the hood with ptrace
:
- user submission calls
open
- [pre-syscall event transfers control to judge]
- judge reads syscall number and arguments from registers; validates them
- judge sets syscall number to something harmless and fast (
getpid
) - [judge resumes process]
- kernel executes
getpid
- [post-syscall event transfers control to judge]
- judge sets return value register to
ENOENT
, thereby "cancelling" theopen
syscall - [judge resumes process]
- user submission's
open
call returns withENOENT
We can see that for every syscall the user submission performs, we trap and stop the process two times. This is not at all cache-friendly to the submission.
When tracing with seccomp
-enabled ptrace
, however, we do not have pre- and post-syscall events — we are only notified
before an event takes place. To support functionality like cancelling syscalls, seccomp
allows tracers to
set the syscall register to -1
on the pre-syscall event. This will instruct the kernel will skip the syscall,
returning the register set as-is.
- user submission calls
open
- [pre-syscall event transfers control to judge]
- judge reads syscall number and arguments from registers; validates them
- judge sets syscall number to
-1
, and return value register toENOENT
- [judge resumes process]
- user submission's
open
call returns withENOENT
Here, we only have two context switches as opposed to four, but more importantly, we only run through this logic
on unsafe syscalls like open
. Syscalls that dominate a submission's lifespan, like read
or write
, never trigger
seccomp
to signal the process. That's a huge win for performance.
Conclusion
As of January 2019, we've been taking a seccomp
+ ptrace
approach in the sandbox for the DMOJ,
so as always you may peek at the source code to see how the
ideas expressed in this post can be implemented in practice.
Empirically, speedups have been noticeable since upgrading the sandbox to use seccomp
. They are particularly apparent
for interactive tasks, which require frequent flushing of standard output, but can be felt to some extent across most problems.
There are further optimizations possible with this approach (for instance, the order rules are added to the filter matters, since they're evaluated in the order they were added — so there's an incentive to having the most common syscalls listed first), but perhaps they'll be the subject of a future post.