Sometimes, the kernel lies about process memory usage

Here's a short systems debugging story.

On dmoj.ca, we run user-submitted solutions to algorithmic programming problems against a set of input files, and judge their output for correctness. One metric by which solutions are ranked on our leaderboards is memory usage. A user recently reported that some code they had submitted was reported as having consumed 4 KiB of memory, despite their code allocating a 128 KiB array. How come?

This is a story about how sometimes, the kernel lies about memory usage — all in the name of performance.

To start with, here's the (slightly edited) submission in question, solving1 this problem in Zig:

const std = @import("std");

pub fn main() !void {
    @setRuntimeSafety(false);
    const allocator: *std.mem.Allocator = std.heap.page_allocator;
    const input: []u8 = try allocator.alloc(u8, 131072);
    
    const stdin = std.io.getStdIn().inStream();
    const n = try stdin.readAll(input);

    var s: u32 = 0;
    var t: u32 = 0;
    for (input[0..n]) |b| {
        if (b == 's' or b == 'S') s += 1;
        if (b == 't' or b == 'T') t += 1;
    }

    const stdout = std.io.getStdOut().outStream();
    try stdout.writeAll(if (s >= t) "French" else "English");
}

Now, I don't actually know Zig, but this code seems to pretty clearly allocate a 128 KiB array on the heap.

An early thought I had was, "what if the array is only allocated virtual address space for, but never faulted in its entirety since the input to the program is small?" Then I checked, and it turns out the input to this problem is quite large.

The way the judge determines memory usage is by wait4(2)ing on the submission process until it exits, and then parsing the contents of /proc/${pid}/status for VmHWM2 — the "high watermark RSS usage" of the process.

Anyway, enough beating around the bush, let's run some code in GDB.

$ zig build-exe --release-safe test.zig --name test
$ gdb ./test
(gdb) catch syscall exit_group
Catchpoint 1 (syscall 'exit_group' [231])
(gdb) run
Starting program: /tmp/test 

I set up a breakpoint on exit_group(2) so that we can inspect the state right before the process exits.

Since the program asked for some input, I gave it some sample test data from the problem.

3
The red cat sat on the mat.
Why are you so sad cat?
Don't ask that.
^D

English

Catchpoint 1 (call to syscall exit_group), std.os.linux.exit_group (status=<optimized out>)
    at /opt/zig/lib/zig/std/os/linux.zig:556
556	   unreachable;

The program read the input, outputted the answer (English), and hit our breakpoint on exit_group(2). Time to poke around and see what we can find.

First, we can confirm that VmRSS and VmHWM for this process indeed still say 4 KiB.

$ grep -E 'Vm(HWM|RSS)' /proc/9253/status 
VmHWM:	     4 kB
VmRSS:	     4 kB

That's certainly odd, but confirms what the user reported.

Back to GDB, where is this input array actually located?

(gdb) info proc mappings
process 9253
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x200000           0x202000     0x2000        0x0 /tmp/test
            0x202000           0x213000    0x11000     0x1000 /tmp/test
            0x213000           0x214000     0x1000    0x11000 /tmp/test
      0x7ffff7fd9000     0x7ffff7ff9000    0x20000        0x0 
      0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0 [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0 [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]

Making an educated guess, it's located at 0x7ffff7fd9000, since the size 0x20000 is 128 KiB.

(gdb) x/32c 0x7ffff7fd9000
0x7ffff7fd9000:	51 '3'	10 '\n'	84 'T'	104 'h'	101 'e'	32 ' '	114 'r'	101 'e'
0x7ffff7fd9008:	100 'd'	32 ' '	99 'c'	97 'a'	116 't'	32 ' '	115 's'	97 'a'
0x7ffff7fd9010:	116 't'	32 ' '	111 'o'	110 'n'	32 ' '	116 't'	104 'h'	101 'e'
0x7ffff7fd9018:	32 ' '	109 'm'	97 'a'	116 't'	46 '.'	10 '\n' 87  'W' 104 'h'

Bingo, that's our sample input.

After staring at this for a few minutes, I had a bit of inspiration and took a look at /proc/${pid}/smaps, which reports per-segment information in more detail.

$ cat /proc/9253/smaps
...
7ffff7fd9000-7ffff7ff9000 rw-p 00000000 00:00 0 
Size:                128 kB
...
Rss:                 128 kB
...

Here, Rss is clearly being reported as 128 KiB for our allocation. Why is VmRSS not agreeing?

Armed with the knowledge that something funky is up with the RSS reporting, I turned to the time-honored tradition of grepping the kernel source for vague strings. In this case, searching for rss quickly brings up a likely culprit in lines 191-200 of mm/memory.c (as of kernel v5.10).

/* sync counter once per 64 page faults */
#define TASK_RSS_EVENTS_THRESH	(64)
static void check_sync_rss_stat(struct task_struct *task)
{
	if (unlikely(task != current))
		return;
	if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
		sync_mm_rss(task->mm);
}
#else /* SPLIT_RSS_COUNTING */

"Sync counter once every 64 page faults", well, that'd do it. When, and why was this code added?

((v5.10)) $ git log -L191,+1:mm/memory.c
commit 34e55232e59f7b19050267a05ff1226e5cd122a5
Author: KAMEZAWA Hiroyuki <[email protected]>
Date:   Fri Mar 5 13:41:40 2010 -0800

    mm: avoid false sharing of mm_counter
    
    Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.
    
    This patch adds per-thread cache for mm_counter.  RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.
    
    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.
    
     rough estimation with small benchmark on parallel thread (2threads) shows
     [before]
         4.5 cache-miss/faults
     [after]
         4.0 cache-miss/faults
     Anyway, the most contended object is mmap_sem if the number of threads grows.
    
    [[email protected]: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
    Cc: Minchan Kim <[email protected]>
    Cc: Christoph Lameter <[email protected]>
    Cc: Lee Schermerhorn <[email protected]>
    Cc: David Rientjes <[email protected]>
    Signed-off-by: Andrew Morton <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -125,0 +152,1 @@
+/* sync counter once per 64 page faults */

So, this is a performance optimization. Instead of all threads contending on updates to the same global RSS counters, each thread maintains their own counters and only update the global counters every 64 page faults. From afar and without any deeper context, this seems reasonable.

Since Zig doesn't link libc in all its 1.8 MiB glory, our Zig program finishes executing having faulted less than 64 pages, and is therefore incorrectly reported in the global counters — and thus VmHWM — as having only faulted a single 4 KiB page.

Three things stand out to me from this:

  1. There is no real workaround if you care about the HWM, like we do. /proc/${pid}/smaps provides accurate info for RSS, but not HWM (which only makes sense globally, not per-segment).
  2. There is no way to turn this off, even at compile time. For getting accurate results on dmoj.ca, we're now running a patched kernel where we quite literally comment this code out. Maintaining patched kernels makes me sad.
  3. The global counters are not synced on thread exit. This one kind of sounds like a bug; one can imagine a program mmap(2)ing a large chunk of memory and spinning up many threads that each fault 63 pages before exiting. VmRSS and VmHWM sound like they'd be wildly off in this case.

It turns out we're not the first to run into this inaccuracy. Prior to a patch from October 2020 titled "Document inaccurate RSS due to SPLIT_RSS_COUNTING", this behavior was totally undocumented. (The patch updates man 5 proc with a note regarding the inaccuracy in RSS accounting, but as of this writing there's still no mention in man 2 getrusage3.)

The thread is worth a read, but in summary:

  • There are weird cases where the accounting can be off by more than 63 pages per thread; and
  • There is uncertainty among the kernel maintainers about whether the performance benefit of split counters is not outweighed by the poor accounting. That's a +1 from me, at least.

…and, that's all I got for today.

  1. This code does a "classic" competitive programming trick of pre-buffering the entire input in order to avoid calling read(2) more than necessary. System calls are expensive! 

  2. Why not just use the struct rusage *rusage populated by wait4(2), and grab ru_maxrss from it instead of parsing VmHWM out of /proc/${pid}/status? This is subtle enough that Guanzhong had to point it out while reading an early draft of this post, but ru_maxrss is reset on fork(2), while VmHWM is reset on exec(2). If we were to use ru_maxrss, the minimum possible memory usage reported by a submission would be that of the judge process at fork(2) time. The judge is written in Python, so this would be tens of megabytes. 

  3. Time permitting, I intend to send in a patch updating man 2 getrusage, and maybe another for syncing the counters on thread exit. Or maybe getrusage(2) is somehow fine? I haven't checked this.