Rendered at 17:15:10 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
baq 22 hours ago [-]
Linux vm defaults are legit insane in 2026.
- system dies under memory pressure (regardless of swapping, actually not having swap makes it worse which should be common knowledge by now)
- system dies under disk pressure even if there are tons of free memory (this one is fun to diagnose)
- system can technically not die, but render itself useless (or worse) under memory pressure by the oom killer
- memory compression of any sort is not enabled
- ...
Both Windows and macOS do so much better out of the box for essentially any workload.
alexrp 14 hours ago [-]
> Both Windows and macOS do so much better out of the box for essentially any workload.
We test FreeBSD, Linux, macOS, NetBSD, OpenBSD, and Windows in Zig's CI fleet. Of these, Windows is the only OS that we've had to configure with swap double the size of physical RAM to not hit completely unjustifiable OOMs.
By "unjustifiable", I mean that we're not even close to actually running out of physical memory (let alone swap), but the MM seems to be doing a horrible job of making unused memory actually available to processes.
It's possible there's a relevant configuration knob here that we're just not aware of... but the point is, the default behavior does in fact suck.
0x1d7 3 hours ago [-]
Windows will autogrow the page file (*swap file is separate from a page file in Windows; swap is exclusively for UWP apps) as needed.
It sounds like the application wanted to allocate contiguous regions of memory when none were available. That's a typical indicator of an 'early' OOM condition.
LoganDark 13 hours ago [-]
Windows seems to work a lot better with a 16MB page file for whatever reason, just because it refuses to enable memory compression without it. Fucking stupid
baq 45 minutes ago [-]
It might be shocking but so does Linux (though 16MB won’t cut it) and all other OSes
psd1 8 hours ago [-]
My curiosity is piqued. What reason did you have to disable the page file?
LoganDark 2 hours ago [-]
128GB of system memory?
koverstreet 22 hours ago [-]
The mm people are increasingly hostile to any method of handling OOMs (like, just failing the allocation) besides the OOM killer - it's become very dominated by the hyperscalars and cloud vendors.
Working around mm nuttiness is a frequent source of frustration.
man8alexd 9 hours ago [-]
There is no point in managing memory allocations as they have little relation to actual memory usage. There are also other methods than the OOM killer to handle OOM, like process throttling using cgroups "memory.high" limits.
Kernel oomkiller is useless for the desktop. It only cares about kernel survival. There's user space oom options, like oomd, earlyoom ... but I'm not sure any heuristic really knows what user space program to clobber.
Resource control via cgroupsv2 sounds better for both desktop and server use cases, differing by what processes receive minimum resources. But IO isolation remains elusive. Some suggest a database of hardware capabilities, I wonder if a cheap estimator of throughput and latencies could do this dynamically.
Anyway, I'm not convinced user space should care about or rely on the kernel oomkiller. Well before it even considers the situation human interactivity with the system was lost.
psd1 8 hours ago [-]
> Well before it even considers the situation human interactivity with the system was lost.
"Are you logged on to DB1?"
"...yes?"
"What did you do, it just died"
This has happened multiple times
giancarlostoro 18 hours ago [-]
Yeah I noticed and was surprised after using ec2 that it didnt really have a sizable swap so it would lock up if I used enough memory importing a db export. Was genuinely surprised.
pyinstallwoes 16 hours ago [-]
What are better defaults for 2026+?
baq 7 hours ago [-]
start with a couple gigs of swap, fixed size sub-gig dirty buffers, enable zram, enable an oom daemon with some sane default configuration like 'kill X, wayland, sshd and shells last and browsers and node processes first'
for some more user sanity huge ram consumers under memory pressure should be suspended, paged out as much as possible and the user notified (macOS does it right)
Bender 1 days ago [-]
They allude to this in the article but I would emphasize caution when using mode 2 especially if one has already adjusted overcommit ratios as one can prevent forks. Test this in a QA/Perf environment first, also testing the restart of all applications. Load test and do full QA tests before deploying to Production and even then when deploying to production I would just dynamically change the setting via app deployment scripts until confidence is high instead of putting it in the sysctl config files.
I've gone through this exercise in the past on much older kernels which they cover as well and just me personally I ran into less issues by leaving overcommit to 0 and just dropping the overcommit ratio to 0 and setting the oom_score_adj for programs as high as 1000 if I wanted vmscan to leave them alone and of course using the Redhat formulas for setting vm.min_free_kbytes, vm.admin_reserve_kbytes, vm.user_reserve_kbytes. And of course be vigilant in disallowing app owners from using every last bit of memory.
Bender 1 days ago [-]
Correcting a rather significant typo: setting the oom_score_adj for programs as high as 1000 should be -1000 to be left alone. 1000 would make it a prime candidate for an OOM kill. Positive integers should be used on sacrificial superfluous programs. [1] As an example OpenSSH sets the sshd to -1000 by default.
I agree with the blog post's technical contents, but I feel we came across too strong in the title. For Ubicloud as a managed Postgres provider, we use strict memory overcommit. Our experience with operating Postgres at scale taught us that it's better to enable this than going with the defaults.
However, I can see many other scenarios, where using strict memory overcommit would have unanticipated side-effects. That's why Linux doesn't go with strict memory commit as its default.
furkansahin 1 days ago [-]
(Furkan, submitter) Hmm, I haven’t thought about that. I updated the title to better reflect Ubicloud Postgres' position.
dang 22 hours ago [-]
(previous title was "PostgreSQL and the OOM Killer: Why You Must Use Strict Memory Overcommit", if anyone is wondering.) Thanks for updating it here as well!
geraldwhen 1 days ago [-]
Is this an AI response?
furkansahin 1 days ago [-]
No.. But I have been using a lot of AI, recently. It might have impacted how I form my phrases? maybe?
sisve 21 hours ago [-]
Nah, your doing great, you just reflect on you own position and adjust it. People are just (too) suspicious when people are not locked in their ways and are nice instead of hostile
leononame 1 days ago [-]
This has bitten me multiple times. The problem I have is that at work we deploy the application (written in Go) and PostgreSQL on the same machine. The backend app allocates a lot of virtual memory, and initially we had overcommit to 0 (heuristic). This caused crashes on big queries in PostgreSQL and we set it to 2. The whole system became a bit unstable because the backend would still allocate a lot of virtual memory and at some point we ran into errors when allocating.
For now, we have overcommit_ratio set to a value that is stable from experience, but there really seems to be no silver lining. Go is very happy to allocate a lot of virtual memory, but so are most managed languages. The best solution would probably be to host the backend and the database on separate servers.
xyzzy_plugh 1 days ago [-]
I'm not sure if you are aware but there are relatively recent environment variables you can set to help contain Go memory to a fixed size.
GOMEMLIMIT works very well if you set it to around 90% of available memory as a rough heuristic. You should definitely profile your application to fine tune this number (e.g. if you link with C libraries that hold large memory pools then Go doesn't account for that) but also to identify sources of spikey/leaky allocations. For example, encoding/json is notorious for it's inner sync.Pool hanging on to outsized buffers. There's usually a lot of low hanging fruit.
In my experience Go can be extremely stable in terms of memory footprint at both small (~O(1MiB)) and large (~O(256GiB)) scales, and it takes only a small amount of effort.
As far as GC languages go, it is by far the easiest to work with.
leononame 24 hours ago [-]
I was aware of GOMEMLIMIT, but it didn't cross my mind in this case, thanks for pointing it out. It could be really useful! I'll have to check our specific use case
fpoling 23 hours ago [-]
You can also use cgroup to set allocation policy for a particular program.
leononame 22 hours ago [-]
Interesting idea. How does that work in practice? If I've got 64GiB RAM, and PostgreSQL has 32GiB memory usage, and Go has 32GiB of memory.
If the database requests more memory, it gets ENOMEM, but if the backend app requests more memory, it does get some more because it can overcommit?
Sounds dangerous, if the go program then writes to the overcommitted memory, you'd still trigger the OOM killer, right?
man8alexd 21 hours ago [-]
cgroups have nothing to do with overcommit and memory allocation. They limit actual memory usage for a specific program or group of programs. If this program tries to use more memory than the cgroup memory limit, the program gets OOM killed.
leononame 11 hours ago [-]
Ok, got it, thanks. The wording in the comment I was responding to confused me and made me think it was possible to also change the overcommit behavior for a cgroup, not just the limits.
hilariously 1 days ago [-]
Yes, it would. Basically every serious database tries to allocate everything and more - back in the day we'd just allocate VMs on the machine even with the overhead because knowing it cannot leave its constraints and would work within them was worth the cost.
guenthert 1 days ago [-]
There are many reasons to use a dedicated host (or VM) for a DB server, but if only the accessible memory needs to be limited a container is the simpler, more efficient tool. Said that, I would expect to be able to configure how much memory a DB process is allowed to allocate. I remember distinctly that PostgreSQL allows such. But of course both can be configured simultaneously, a belts&suspenders approach if you will.
Whether failed transactions are actually so much more desirable than a OOM-killed process isn't quite obvious, but it might be easier to troubleshoot.
mono442 21 hours ago [-]
The problem with disabling the memory overcommit is that then the RAM is wasted. That can be worked around with setting up swap but then the disk space is wasted.
man8alexd 20 hours ago [-]
People like to reinvent things that they are not aware of. Original BSDs used to use strict swap reservation - every anonymous memory page had to have an associated swap page. You had to have the swap 2x of RAM to allow large processes to fork - otherwise you would get an "out of swap" error. FreeBSD implemented overcommit around 2000, I think version 4.x or 5.x.
frollogaston 11 hours ago [-]
Why are programs even allocating memory that they don't use?
man8alexd 9 hours ago [-]
Because no one can predict the future, and they don't know how many resources they will need.
frollogaston 2 hours ago [-]
I mean they can just malloc when they need more, what am I missing? Unless this is about JVMs which might preallocate a ton for their heaps and not use it
10000truths 21 hours ago [-]
I'd be interested to see a Linux distribution whose entire shtick is to run well-behaved under a kernel with overcommit disabled. But it would be a huge undertaking. Besides the obvious issue with fork(), there are a lot of programs and libraries out there that implicitly rely on overcommit due to not checking malloc() for failure.
man8alexd 21 hours ago [-]
There is some kind of illusion or myth that strict overcommit solves memory management issues.
zbentley 11 hours ago [-]
Sigh. Malloc failure should have had to be trapped with a signal or something, not just a return status. I know, I know, threads and nesting handlers make that hard and historical precedent makes it impossible to retrofit, but I can still dream.
senderista 20 hours ago [-]
The proper way to handle OOM is to do what mature databases do: implement your own memory accounting, use only your own allocators integrated with the accounting system, and ensure that every allocation path can recover from OOM. Easier said than done.
man8alexd 20 hours ago [-]
MariaDB recently implemented memory PSI monitoring but failed with that in a curious way and disabled it afterwards by default. The failure is that under memory pressure, they flushed the entire InnoDB buffer pool.
zbentley 12 hours ago [-]
The issue is that there’s no generally correct behavior. Should a database under memory pressure stay up at all costs even if it becomes unusably slow (by e.g. nuking 99% of the buffer cache)? Or should it crash/failover hard with a likelihood of potential recovery afterwards, even if it technically could have stayed up? Something in between?
There’s no correct-in-general answers to those questions. This is a hard problem due to context dependence; that’s why there are so many knobs.
man8alexd 9 hours ago [-]
In this specific case, the correct behaviour would be to drop a part of the buffer pool until the memory pressure is gone. The context-dependent question is how much and how fast to drop. The current implementation drops to a single configurable level but I suspect it could have implemented better heuristics.
otterley 1 days ago [-]
I think this is also a good lesson on why it's best to isolate mission-critical services like databases on their own compute nodes.
adamors 1 days ago [-]
I read this article about 3 weeks ago when this bit me. Really great write-up, some tricky details.
man8alexd 22 hours ago [-]
Mode 0 (Heuristic) is described incorrectly. All this complex heuristic was removed almost a decade ago. Currently, the kernel refuses a single allocation that exceeds the physical memory. That is all.
The article ignores the proper modern solution to prevent OOM killing of critical processes - OOM Score Adjust.
Tuning CommitLimit manually is an archaic, imprecise, and error-prone way to handle memory limits, only suitable for single-process workloads that can handle ENOMEM properly. It completely ignores dynamic file page cache memory allocation. You still can get OOM if you get unusually high file activity. On the other hand, under low file activity, it wastes memory on the same page cache, because it can't be reclaimed without memory pressure, and memory pressure can't be created because workload hits ENOMEM earlier. Don't use strict overcommit.
fdr 21 hours ago [-]
The key thing is Postgres does handle enomem well and does a nice rollback rather than crashing the server and entering crash recovery. It’s one of the few programs that does. Exceptions for the exception.
Even a revised heuristic that only spots large, individual allocations is not going to do the job.
Oom score adjust also doesn’t do the job: because the only interesting workload is Postgres, if a backend does a page fault that needs memory, who dies? Another sibling Postgres, almost certainly. Then postmaster does crash recovery, which most would rather avoid. High performance databases with distant checkpoints can take a while to come back up.
man8alexd 21 hours ago [-]
They do have sidecars like prometheus, node_exporter running alongside Postgres and they include them in their MemoryLimit calculations.
wongarsu 1 days ago [-]
For once, Microsoft's decision to just not do overcommit in Windows seems sensible
fpoling 22 hours ago [-]
Microsoft just followed VAX/VMS that does not overcommit. And there is a noise on Linux mail lists to implement process builder pattern which VAX had like 50 years ago…
man8alexd 23 hours ago [-]
Windows doesn't need to fork, and you can't fork a large process without overcommit.
layer8 18 hours ago [-]
One could rephrase the parent that Windows did the sensible thing by not using the fork model.
chiply314 1 days ago [-]
Nothing worse than memory management on Hyperscaler VMs which do not use Swap :|
Took k8s ages to get Swap support.
We lost something when we accepted that Hyperscalers just tell you to use more moemory. It was shitty 5 years ago and today especially after the ram price increases
ValdikSS 1 days ago [-]
My guess would be: it's because memory management before MGLRU was really not good and required different userspace solutions and tinkering. You either get killed with OOM (no swap) or got into thrashing (swap).
And now, with PSI + MGLRU, situation is much better, but there are still missing features/subsystems which would be nice to have. For example there's no simple way to lock memory mlockall-style to ensure that rarely used daemon would not face long no-cache-latency upon accessing the first time after long idle time.
1 days ago [-]
szmarczak 1 days ago [-]
I have disabled overcommit both on Windows and on Linux. I hate having random programs being killed.
Unfortunately, many programs commit 2x memory than they actually use. Often I see ~32GB committed and ~16GB resident.
sterwill 1 days ago [-]
Does this result in programs more frequently erroring/crashing because they can't allocate? I don't know how well many of the programs I frequently use on my desktop (Firefox, GNOME desktop, JVM + IntelliJ, Slack, etc.) handle allocation failures. I'm not sure they would do much better than crash, but I know the default OOM killer settings work well for me. About once a year a real runaway process (usually a throwaway program I'm working on) gets OOM-killed, and that's fine with me.
fpoling 22 hours ago [-]
Java allowed to handle out-of-memory rather well if one wanted to even 25 years ago. Basically one allocated a buffer on a startup taking 5% of memory that the application was supposed to use and made all threads to catch the oom exception. When handling that the buffer would be released, GC would be forced and a special flag would be set asking app to cancel any memory-intensive tasks until enough memory would be released and the buffer can be allocated again. It worked extremely well.
zbentley 11 hours ago [-]
Yeah, crash ballast is an extremely underrated tool.
It also works for the OOM killer: run a daemon with a child process that holds some fixed amount of memory. Adjust OOM scores of everything else on the system lower than the child. If the parent’s waitpid() returns due to an OOM kill, send an alert/shutdown nonessentials/sync buffers to disk and so on.
man8alexd 9 hours ago [-]
Maybe you should skip running a useless child process and just use PSI to monitor memory pressure.
szmarczak 1 days ago [-]
> Does this result in programs more frequently erroring/crashing because they can't allocate?
I run Firefox, VSCodium with LSP, Discord, Signal and there's still space left for a game like CS2. I'm not a heavy user by any means.
> I'm not sure they would do much better than crash
I have yet to see a program that silently handles allocation failures and doesn't crash. These days everything is coded to crash if no memory :(
> About once a year a real runaway process (usually a throwaway program I'm working on) gets OOM-killed
In my case it killed system critical processes with no way to recover. With disabled overcommit, it freezes for a while (usually for a minute or two), I close some random program of my choosing and then see in Resource Monitor what's eating my ram.
__s 23 hours ago [-]
> I have yet to see a program that silently handles allocation failures and doesn't crash
Postgres handles allocation failures
man8alexd 23 hours ago [-]
disabling overcommit is trading OOM for random program crashes due to the inability to handle ENOMEM. It also wastes a lot of system memory.
I don't think it has overcommit at all, at least that's the default. That would be why you don't have Windows OOM killer stories.
senfiaj 21 hours ago [-]
I think Windows also uses something similar to Linux memory overcommit (maybe call it "lazy page allocation"), but probably all the available virtual memory is backed by the page file, the OS has limit for the allocation size and all the edge cases are handled well under low memory conditions. For example, think about stack memory. Megabytes of stack space can be reserved for each thread, but since programs rarely use the whole stack, it's not rational to immediately allocate all the physical pages, it will waste a lot of memory (especially for multi-threaded programs). The stack grows on demand when the program hits a guard page.
In short, Windows partially does the same lazy thing but unlike Linux with its optimistic overcommit, it is stricter about commit/backing-budget reservation.
0x1d7 18 hours ago [-]
Windows will never overcommit. All pages are backed by physical memory or backing store.
senfiaj 18 hours ago [-]
Yes, that was my point. But Windows might still allocate pages lazily (i.e. reserve them and actually allocate only when the program writes into them). The difference from Linux is that it will only reserve if the allocation has a reasonable size and the total committed allocation doesn't exceed certain safe limits / quotas, so it will be possible to safely swap on disk later under low memory conditions.
tredre3 1 days ago [-]
The reason you hear less about Window's OOM killer is simply because it works well.
The Linux Kernel OOM killer kills random things. Userspace OOM killers are meant to improve this, and they work well in a server situation when you already know in advance what is likely to go haywire and what is safe to kill. But they don't work well on desktop (some of them are improving but it doesn't seem to be a priority).
The Windows OOM killer by comparison usually kills something sensible (i.e. the program that is actually using all the memory), and asks the user for permission before killing it (when possible). You do see a lot of memes of situations where it fails.
man8alexd 23 hours ago [-]
> The Linux Kernel OOM killer kills random things.
By default, the Linux kernel kills the largest process in the system (unless OOM adjust was applied).
0x1d7 18 hours ago [-]
Which, by default, is dumb for a presumably interactive system. Photoshop (or equiv) is going to be my "largest" process on a system. Because it's the one I'm interacting with.
Don't kill what I'm using.
man8alexd 18 hours ago [-]
You can always tell the system not to kill your Photoshop (or whatever else) by setting the OOM Score Adjust. This mechanism has existed for almost two decades and systemd has supported it for over a decade.
0x1d7 3 hours ago [-]
Yes, you can but the fact that I have to do it makes for a poor system.
senfiaj 21 hours ago [-]
I think Windows doesn't do OOM kill, it just fails fast for unreasonably large memory allocations. If the allocation is succeeds, the committed virtual pages are backed up (or more precisely guaranteed to be backed up) by a page file.
nok22kon 1 days ago [-]
damn, good observation, when my data analysis python script goes wrong and allocates 24 GB of RAM on a 32 GB computer, it crashes (gets killed) with "out of memory" error. I've never seen something else getting killed
frollogaston 11 hours ago [-]
That's what I've seen Linux do too. What's different about how Windows chooses a process to kill?
0x1d7 3 hours ago [-]
Windows doesn't kill.
11 hours ago [-]
wongarsu 1 days ago [-]
Not overcommitting is Windows's default and only behavior
A memory allocator can implement overcommit, because you can separate reserving virtual memory and having it backed by physical memory into two different system calls. But from the point of view of the kernel, any time it promises to give you physical memory that memory is backed either by RAM or by space reserved in the swap file
senfiaj 21 hours ago [-]
As I understand, Windows can also lazily allocate pages, but it does after making sure the memory budget is adequate in a case of low physical RAM pressure and is guaranteed to be backed up by a page. But yeah, Linux approach is really sloppy.
szmarczak 1 days ago [-]
Settings -> View advanced system settings -> Performance (Settings) -> Advanced -> Virtual memory (Change...) -> No paging file
tredre3 1 days ago [-]
That's disabling swap, not overcommit. Windows doesn't overcommit. It's one of the reason why it handles low memory situations so much more gracefully than Linux.
0x1d7 1 days ago [-]
^
The purpose of the system commit limit and commit charge is to track all uses of these resources to ensure they are never overcommitted — that is, that there is never more virtual address space defined than there is space to store its contents, either in RAM or in backing store (on disk).
- Windows Internals, 7th Edition
0x1d7 1 days ago [-]
This is almost always a bad idea.
If no memory is available where a page file would make a difference, this leads to application crashes instead. A crash is (usually) worse than paging.
Certain applications, Photoshop being the historical example, will outright fail to run with no page file present.
szmarczak 1 days ago [-]
> this leads to application crashes instead
Same happens if the page file is full. In that case, why don't those programs use disk directly instead?
No such problem would've ever occured if programs hadn't allocated more than they actually use.
toast0 1 days ago [-]
By default, windows uses an expandable page file.
Typically, performance drops enough that the user kills the program or reboots before the page file expands to fill the disk. And other threads here suggest there is something that will prompt users to kill programs in states like this.
> No such problem would've ever occured if programs hadn't allocated more than they actually use.
That's part of the issue, but sometimes things do in fact use too much memory as well as allocate too much.
Another part of the issue is that few programs are built to handle allocation failures.
And then you have a metrics issue. There's not really a good metric to know when you're out of memory, other than performance collapse. If your applications don't use disk, it's not too hard; but when they do use disk, performance will collapse once there's insufficient memory to provide the disk caching needed. In my experience, adding a small swap and monitoring swap i/o can be pretty helpful, and a small swap doesn't tend to allow long thrashing when memory use grows. But that's not universal and everybody loves to hate swap these days.
quotemstr 23 hours ago [-]
> Typically, performance drops enough that the user kills the program or reboots before the page file expands to fill the disk. And other threads here suggest there is something that will prompt users to kill programs in states like this.
Not in the age of NVMe it doesn't. Swap is fast now. Plus, at least on Linux, you can put zswap in front of the regular swap and introduce an even faster level of memory hierarchy and thereby make page-outs even more profitable.
0x1d7 18 hours ago [-]
Swap is not fast. Faster than it was, yes, fast, no.
swinglock 21 hours ago [-]
Windows does memory compression too.
quotemstr 19 hours ago [-]
It's been a while since I've looked at it, but IIRC it doesn't model the compressed RAM as a special swap tier, which IMHO is a pretty elegant way to look at it.
0x1d7 1 days ago [-]
Your argument falls flat when a page file can be multi-GB and automatically grow. And if your application admin was competent, memory monitoring would be part of the application monitoring stack.
An application that grows in such a way (besides having backing stores for memory-mapped files, as well) will often perform so poorly that it requires addressing (adding RAM, looking for application faults, etc).
A page file is insurance, one that can last you much longer than available system memory.
szmarczak 1 days ago [-]
> memory monitoring would be part of the application monitoring stack
You don't need it if you have everything allocated upfront. TigerBeetle does this, everybody else can.
Using something like Rust is already a huge win when compared to shipping a browser or running Node.js.
> Your argument falls flat when a page file can be multi-GB and automatically grow
This doesn't solve the original issue and only masks the underlying problem.
0x1d7 18 hours ago [-]
> This doesn't solve the original issue and only masks the underlying problem.
You're moving goal posts. No, a page file doesn't solve the problem of a misbehaving application, but it does solve the problem of an app crash because no more VAS allocation can be made.
You should really dive into Windows Internals. Only misinformed gamers turn off page files.
IshKebab 22 hours ago [-]
There's so much great stuff here.
First, Linux's default memory management strategy is bonkers. OOM killing rarely actually works in my experience, at least on desktop. It takes ages to kick in and usually the system just freezes and you have to hard reboot. I've experienced this on every Linux system I've used, even my current one with 128GB of RAM and 64GB of swap, so don't say "it works for me". Windows and Mac do not have this issue at all, so clearly it's possible to do it better.
Has anyone tried using strict overcommit on desktop Linux?
Second, this bug is a great counterpoint to those annoying people who naysay Rust with "but not all bugs are memory safety bugs, what about logic bugs? huh?". Rust code would not have had this bug.
> On the modern desktop, where programmers don't care about failing malloc(), disabling overcommit is shooting yourself in the foot. As you can observe, the memory allocations start failing long before the memory is exhausted.
senfiaj 20 hours ago [-]
So it's a footgun because a lot of low quality software?
man8alexd 20 hours ago [-]
Is Firefox low quality?
senfiaj 19 hours ago [-]
Well, if Firefox also works on other OSes, it probably should gracefully handle failing allocations, isn't it?
If a memory allocations fails with strict mode then you'll get a null pointer returns and some kind of crash or panic (in code that doesn't handle it properly).
If it fails with the default mode the whole process will get killed by the OS. Is that really much better?
man8alexd 8 hours ago [-]
OOM is better. If a program doesn't handle ENOMEM properly, then its state is unpredictable and can lead to data corruption.
IshKebab 6 hours ago [-]
I don't think it's as simple as that. Killing a program can have unintended consequences too, e.g. corrupting files they were in the middle of writing.
grg0 21 hours ago [-]
What exactly does Rust solve here? Virtual memory is a hardware/OS feature.
IshKebab 21 hours ago [-]
The bug that is detailed in the article. Wouldn't have happened with Rust.
grg0 21 hours ago [-]
This is not a memory safety bug, but a bug resulting from a type coercion of int to bool. I don't know if Rust is stricter but your original statement was confusing.
IshKebab 9 hours ago [-]
Exactly my point. Lots of people think Rust only prevents memory safety bugs, but this is an example of a bug that isn't a memory safety bug but also wouldn't have happened in Rust.
They used an int with special meanings for negative/0/positive values. Very common in C, and not at all type safe (all meanings have the same type). In Rust you would use an enum or Result, it would be type safe and the refactoring mistake they made would have been a compile time error.
senfiaj 21 hours ago [-]
There was also a bug in the Linux kernel, or did I miss something?
IshKebab 21 hours ago [-]
Yes there was. It's detailed in the article.
senfiaj 20 hours ago [-]
But why wouldn't have happened with Rust? Sorry I can't find anything about Rust in the article. Or you mean if the Linux kernel was written in Rust and that stupid bool coercion was not possible?
grg0 19 hours ago [-]
It's the latter; Rust won't allow the int->bool coercion.
Though to be absolutely pedantic, !x is an int for x:int in C, there is no bool coercion involved; an if-statement takes an expression of any scalar value and evals to true on non-zero. Not that that helps to avoid introducing bugs anyway.
IshKebab 9 hours ago [-]
Because Rust programmers would have used an enum or Result.
- system dies under memory pressure (regardless of swapping, actually not having swap makes it worse which should be common knowledge by now)
- system dies under disk pressure even if there are tons of free memory (this one is fun to diagnose)
- system can technically not die, but render itself useless (or worse) under memory pressure by the oom killer
- memory compression of any sort is not enabled
- ...
Both Windows and macOS do so much better out of the box for essentially any workload.
We test FreeBSD, Linux, macOS, NetBSD, OpenBSD, and Windows in Zig's CI fleet. Of these, Windows is the only OS that we've had to configure with swap double the size of physical RAM to not hit completely unjustifiable OOMs.
By "unjustifiable", I mean that we're not even close to actually running out of physical memory (let alone swap), but the MM seems to be doing a horrible job of making unused memory actually available to processes.
It's possible there's a relevant configuration knob here that we're just not aware of... but the point is, the default behavior does in fact suck.
It sounds like the application wanted to allocate contiguous regions of memory when none were available. That's a typical indicator of an 'early' OOM condition.
Working around mm nuttiness is a frequent source of frustration.
Resource control via cgroupsv2 sounds better for both desktop and server use cases, differing by what processes receive minimum resources. But IO isolation remains elusive. Some suggest a database of hardware capabilities, I wonder if a cheap estimator of throughput and latencies could do this dynamically.
Anyway, I'm not convinced user space should care about or rely on the kernel oomkiller. Well before it even considers the situation human interactivity with the system was lost.
"Are you logged on to DB1?"
"...yes?"
"What did you do, it just died"
This has happened multiple times
for some more user sanity huge ram consumers under memory pressure should be suspended, paged out as much as possible and the user notified (macOS does it right)
I've gone through this exercise in the past on much older kernels which they cover as well and just me personally I ran into less issues by leaving overcommit to 0 and just dropping the overcommit ratio to 0 and setting the oom_score_adj for programs as high as 1000 if I wanted vmscan to leave them alone and of course using the Redhat formulas for setting vm.min_free_kbytes, vm.admin_reserve_kbytes, vm.user_reserve_kbytes. And of course be vigilant in disallowing app owners from using every last bit of memory.
[1] - https://man7.org/linux/man-pages/man5/proc_pid_oom_score_adj...
I agree with the blog post's technical contents, but I feel we came across too strong in the title. For Ubicloud as a managed Postgres provider, we use strict memory overcommit. Our experience with operating Postgres at scale taught us that it's better to enable this than going with the defaults.
However, I can see many other scenarios, where using strict memory overcommit would have unanticipated side-effects. That's why Linux doesn't go with strict memory commit as its default.
For now, we have overcommit_ratio set to a value that is stable from experience, but there really seems to be no silver lining. Go is very happy to allocate a lot of virtual memory, but so are most managed languages. The best solution would probably be to host the backend and the database on separate servers.
GOMEMLIMIT works very well if you set it to around 90% of available memory as a rough heuristic. You should definitely profile your application to fine tune this number (e.g. if you link with C libraries that hold large memory pools then Go doesn't account for that) but also to identify sources of spikey/leaky allocations. For example, encoding/json is notorious for it's inner sync.Pool hanging on to outsized buffers. There's usually a lot of low hanging fruit.
In my experience Go can be extremely stable in terms of memory footprint at both small (~O(1MiB)) and large (~O(256GiB)) scales, and it takes only a small amount of effort.
As far as GC languages go, it is by far the easiest to work with.
If the database requests more memory, it gets ENOMEM, but if the backend app requests more memory, it does get some more because it can overcommit?
Sounds dangerous, if the go program then writes to the overcommitted memory, you'd still trigger the OOM killer, right?
Whether failed transactions are actually so much more desirable than a OOM-killed process isn't quite obvious, but it might be easier to troubleshoot.
There’s no correct-in-general answers to those questions. This is a hard problem due to context dependence; that’s why there are so many knobs.
The article ignores the proper modern solution to prevent OOM killing of critical processes - OOM Score Adjust.
Tuning CommitLimit manually is an archaic, imprecise, and error-prone way to handle memory limits, only suitable for single-process workloads that can handle ENOMEM properly. It completely ignores dynamic file page cache memory allocation. You still can get OOM if you get unusually high file activity. On the other hand, under low file activity, it wastes memory on the same page cache, because it can't be reclaimed without memory pressure, and memory pressure can't be created because workload hits ENOMEM earlier. Don't use strict overcommit.
Even a revised heuristic that only spots large, individual allocations is not going to do the job.
Oom score adjust also doesn’t do the job: because the only interesting workload is Postgres, if a backend does a page fault that needs memory, who dies? Another sibling Postgres, almost certainly. Then postmaster does crash recovery, which most would rather avoid. High performance databases with distant checkpoints can take a while to come back up.
Took k8s ages to get Swap support.
We lost something when we accepted that Hyperscalers just tell you to use more moemory. It was shitty 5 years ago and today especially after the ram price increases
And now, with PSI + MGLRU, situation is much better, but there are still missing features/subsystems which would be nice to have. For example there's no simple way to lock memory mlockall-style to ensure that rarely used daemon would not face long no-cache-latency upon accessing the first time after long idle time.
Unfortunately, many programs commit 2x memory than they actually use. Often I see ~32GB committed and ~16GB resident.
It also works for the OOM killer: run a daemon with a child process that holds some fixed amount of memory. Adjust OOM scores of everything else on the system lower than the child. If the parent’s waitpid() returns due to an OOM kill, send an alert/shutdown nonessentials/sync buffers to disk and so on.
I run Firefox, VSCodium with LSP, Discord, Signal and there's still space left for a game like CS2. I'm not a heavy user by any means.
> I'm not sure they would do much better than crash
I have yet to see a program that silently handles allocation failures and doesn't crash. These days everything is coded to crash if no memory :(
> About once a year a real runaway process (usually a throwaway program I'm working on) gets OOM-killed
In my case it killed system critical processes with no way to recover. With disabled overcommit, it freezes for a while (usually for a minute or two), I close some random program of my choosing and then see in Resource Monitor what's eating my ram.
Postgres handles allocation failures
https://unix.stackexchange.com/questions/797835/disabling-ov...
I dont think it has an option for that.
In short, Windows partially does the same lazy thing but unlike Linux with its optimistic overcommit, it is stricter about commit/backing-budget reservation.
The Linux Kernel OOM killer kills random things. Userspace OOM killers are meant to improve this, and they work well in a server situation when you already know in advance what is likely to go haywire and what is safe to kill. But they don't work well on desktop (some of them are improving but it doesn't seem to be a priority).
The Windows OOM killer by comparison usually kills something sensible (i.e. the program that is actually using all the memory), and asks the user for permission before killing it (when possible). You do see a lot of memes of situations where it fails.
By default, the Linux kernel kills the largest process in the system (unless OOM adjust was applied).
Don't kill what I'm using.
A memory allocator can implement overcommit, because you can separate reserving virtual memory and having it backed by physical memory into two different system calls. But from the point of view of the kernel, any time it promises to give you physical memory that memory is backed either by RAM or by space reserved in the swap file
If no memory is available where a page file would make a difference, this leads to application crashes instead. A crash is (usually) worse than paging.
Certain applications, Photoshop being the historical example, will outright fail to run with no page file present.
Same happens if the page file is full. In that case, why don't those programs use disk directly instead?
No such problem would've ever occured if programs hadn't allocated more than they actually use.
Typically, performance drops enough that the user kills the program or reboots before the page file expands to fill the disk. And other threads here suggest there is something that will prompt users to kill programs in states like this.
> No such problem would've ever occured if programs hadn't allocated more than they actually use.
That's part of the issue, but sometimes things do in fact use too much memory as well as allocate too much.
Another part of the issue is that few programs are built to handle allocation failures.
And then you have a metrics issue. There's not really a good metric to know when you're out of memory, other than performance collapse. If your applications don't use disk, it's not too hard; but when they do use disk, performance will collapse once there's insufficient memory to provide the disk caching needed. In my experience, adding a small swap and monitoring swap i/o can be pretty helpful, and a small swap doesn't tend to allow long thrashing when memory use grows. But that's not universal and everybody loves to hate swap these days.
Not in the age of NVMe it doesn't. Swap is fast now. Plus, at least on Linux, you can put zswap in front of the regular swap and introduce an even faster level of memory hierarchy and thereby make page-outs even more profitable.
An application that grows in such a way (besides having backing stores for memory-mapped files, as well) will often perform so poorly that it requires addressing (adding RAM, looking for application faults, etc).
A page file is insurance, one that can last you much longer than available system memory.
You don't need it if you have everything allocated upfront. TigerBeetle does this, everybody else can.
Using something like Rust is already a huge win when compared to shipping a browser or running Node.js.
> Your argument falls flat when a page file can be multi-GB and automatically grow
This doesn't solve the original issue and only masks the underlying problem.
You're moving goal posts. No, a page file doesn't solve the problem of a misbehaving application, but it does solve the problem of an app crash because no more VAS allocation can be made.
You should really dive into Windows Internals. Only misinformed gamers turn off page files.
First, Linux's default memory management strategy is bonkers. OOM killing rarely actually works in my experience, at least on desktop. It takes ages to kick in and usually the system just freezes and you have to hard reboot. I've experienced this on every Linux system I've used, even my current one with 128GB of RAM and 64GB of swap, so don't say "it works for me". Windows and Mac do not have this issue at all, so clearly it's possible to do it better.
Has anyone tried using strict overcommit on desktop Linux?
Second, this bug is a great counterpoint to those annoying people who naysay Rust with "but not all bugs are memory safety bugs, what about logic bugs? huh?". Rust code would not have had this bug.
> On the modern desktop, where programmers don't care about failing malloc(), disabling overcommit is shooting yourself in the foot. As you can observe, the memory allocations start failing long before the memory is exhausted.
If it fails with the default mode the whole process will get killed by the OS. Is that really much better?
They used an int with special meanings for negative/0/positive values. Very common in C, and not at all type safe (all meanings have the same type). In Rust you would use an enum or Result, it would be type safe and the refactoring mistake they made would have been a compile time error.
Though to be absolutely pedantic, !x is an int for x:int in C, there is no bool coercion involved; an if-statement takes an expression of any scalar value and evals to true on non-zero. Not that that helps to avoid introducing bugs anyway.