I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time), but it's not always easy to replace them from a performance angle.
On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.
Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.
The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.
Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.
Addendum big enough to warrant a separate post: The fact the contention is a spinlock, rather than a futex is unrelated to the "regression".
A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.
Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.
Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.
> On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).
Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.
I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.
One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.
But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.
I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.
(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)
[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.
[1] Famous last words. At least in ossified microarchitectures, nothing is simple.
> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).
> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.
Hah.
> ...
> But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.
I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.
I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.
> I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.
> I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line.
I don’t think so. The CPU is pretty good about hiding that kind of latency — reading a contended cache line and doing a correctly predicted branch shouldn’t stall anything after it.
That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty. And there are like ten open coded spin waits around the uses... you certainly have my empathy :)
This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?
> That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty.
Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.
For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...
> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)
Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)
We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?
It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).
> Turns out to be pretty crucial for performance though...
I don't doubt it. I just meant nasty with respect to using futex() to sleep instead of spin, I was having some "fun" trying.
I can certainly see how pushing that state into one atomic would simplify things, I didn't really mean to question that.
> We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
I'm cheering you on, I hadn't looked at this code before and its been fun looking through some of the recent work on it.
> It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I've seen too much open coded spinning around 64-bit CAS in proprietary code, where it was a real demonstrable problem, and similar to here it was often not straightforward to avoid. I confess to some bias because of this experience ("not all spinlocks...") :)
I remember a lot of cases where FUTEX_WAIT64/FUTEX_WAKE64 would have been a drop-in solution, that seems compelling to me.
Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.
With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.
So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?
I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.
Cross building of possible, but it's rather useful to be able to test the software you just built... And often enough, tests take more resources than the build.
It's very heavily dependent on what your processes are doing. I've seen extreme cases where the gains of pinning were large (well over 2x when cooperative tasks were pinned to the same core), but thats primarily about preventing the CPU from idling long enough to enter deeper idle states.
FWIW, the article says "Frontends are somewhat insulated from this because they can use the largely stable C API." but that's not been my/our experience. There are parts of the API that are somewhat stable, but other parts (e.g. Orc) that change wildly.
I know, but even if it's not breaking promises, the constant stream of changes still makes it still rather painful to utilize LLVM. Not helped by the fact that unless you embed LLVM you have to deal with a lot of different LLVM versions out there...
FWIW eventual stability is a goal, but there's going to be more churn as we work towards full arbitrary program execution (https://www.youtube.com/watch?v=qgtA-bWC_vM covers some recent progress).
If you're looking for stability in practice: the ORC LLJIT API is your best bet at the moment (or sticking to MCJIT until it's removed).
The issue is more fundamental - if you have purely random keys, there's basically no spatial locality for the index data. Which means that for decent performance your entire index needs to be in memory, rather than just recent data. And it means that you have much bigger write amplification, since it's rare that the same index page is modified multiple times close-enough in time to avoid a second write.
> Every time Postgres advice says to “schedule [important maintenance] during low traffic period” (OP) or “outside business hours”, it reinforces my sense that it’s not suitable for performance-sensitive data path on a 24/7/365 service and I’m not sure it really aims to be.
It's a question of resource margins. If you have regular and predictable windows of low resource utilization, you can afford to run closer to the sun during busy periods, deferring (and amortizing, to some degree) maintenance costs till later. If you have a 24/7/365 service, you need considerably higher safety margins.
Also, there's a lot of terrible advice on the internet, if you haven't noticed.
> (To be fair, running it like that for several years and desperately trying to make it work also gave me that feeling. But I’m kind of aghast that necessary operational maintenance still carries these caveats.)
To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful? Like, would it useful to criticize oxides lack of production hardware availability in 2021 or so?
What you describe is true and very important (more margin lets you weather more disruption), but it's not the whole story. The problem we had was queueing delays mainly due to I/O contention. The disks had the extra IOPS for the maintenance operation, but the resulting latency for all operations was higher. This meant overall throughput decreased when the maintenance was going on. The customer, finally accepting the problem, thought: "we'll just build enough extra shards to account for the degradation". But it just doesn't work like that. If the degradation is 30%, and you reduce the steady-state load on the database by 30%, that doesn't change the fact that when the maintenance is ongoing, even if the disks have the IOPS for the extra load, latency goes up. Throughput will still degrade. What they wanted was predictability but we just couldn't give that to them.
> To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful?
First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).
I admit that some of this has been hard for me personally to let go. These issues dominated my professional life for three very stressful years. For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc. And we certainly did make mistakes! But many of those problems were later acknowledged by the community. And many have been improved -- which is great! What remains is me feeling triggered when it feels like users' pain is being casually dismissed.
I'm sorry I let my crankiness slip into the comment above. I try to leave out the emotional baggage. Nonetheless, I do feel like it's a problem that, intentionally or otherwise, a lot of the user base has absorbed the idea that it's okay for necessary database maintenance to significantly degrade performance because folks will have some downtime in which to run it.*
> First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).
I said oxide, because it's come up so frequently and at such length on the oxide podcast... Without that I probably wouldn't have commented here. It's one thing to comment on bad experiences, but at this point it feels like more like bashing. And I feel like an open source focused company should treat other folks working on open source with a bit more, idk, respect (not quite the right word, but I can't come up with a better one right now).
I probably shouldn't have commented on this here. But I read the message after just having spent a Sunday morning looking into a problem and I guess that made more thin skinned than usual.
> For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc.
I agree that the wider community sometimes has/had the issue of excusing away postgres problems. While I try to avoid doing that, I certainly have fallen prey to that myself.
Leaving fandom like stuff aside, there's an aspect of having been told over and over we're doing xyz wrong and things would never work that way, and succeeding (to some degree) regardless. While ignoring some common wisdom has been advantageous, I think there's also plenty where we just have been high on our own supply.
> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.
I don't agree that we have been "bashing" Postgres. As far as I can tell, Postgres has come up a very small number of times over the years: certainly on the CockroachDB episode[0] (where our experience with Postgres is germane, as it was very much guiding our process for finding a database for Oxide) and then again this year when we talked about our use of statemaps on a Rust async issue[1] (where our experience with Postgres was again relevant because it in part motivated the work that we had used to develop the tooling that we again used on the Rust issue).
I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.
I'm certainly very biased (having worked on postgres for way too long), so it's entirely plausible that I've over-observed and over-analyzed the criticism, leading to my description.
> I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.
I agree that criticism is important and worthwhile! It's helpful though if it's at least somewhat actionable. We can't travel back in time to fix the problems you had in the early 2010s... My experience of the criticism of the last years from the "oxide corner" was that it sometimes felt somewhat unrelated to the context and to today's postgres.
> if one views recounting that experience as showing insufficient "respect" to its developers
I should really have come up with a better word, but I'm still blanking on choosing a really apt word, even though I know it exists. I could try to blame ESL for it, but I can't come up with a good German word for it either... Maybe "goodwill". Basically believing that the other party is trying to do the right thing.
>> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.
> Was that done in this thread?
Well, I raised a general problem around 24/7/365 use cases (rooted in my operational experience, reinforced by the more-current words that I was replying to and the OP) and you called it "tedious", "low-info griping". Yes, that seems pretty dismissive.
(Is it fair? Though I thought the podcast episodes were fairly specific, they probably glossed over details. They weren't intended to be about those issues per se. I did write a pretty detailed post though:
https://www.davepacheco.net/blog/2024/challenges-deploying-p...
(Note the prominent caveat at the top about the experience being dated.))
You also wrote:
> running an, at the time, outdated postgres, on an outdated OS
Yes, pointing to the fact that the software is old and the OS is unusual (it was never outdated; it was just not Linux) are common ways to quickly dismiss users' problems. If the problems had been fixed in newer versions, that'd be one thing. Many (if not all) of them hadn't been. But also: the reason we were running an old version was precisely that it was a 24/7/365 service and there was no way to update databases without downtime, especially replicated ones, nor a great way to mitigate risk (e.g., a mode for running the new software without updating the on-disk format so that you can go back if it's a disaster). This should be seen as a signal of the problem, not a reason to dismiss it (as I feel like you're doing here). As for the OS, I can only think of one major issue we hit that was OS-specific. (We did make a major misconfiguration related to the filesystem that certainly made many of our issues much worse.)
I get that it sucks to keep hearing about problems from years ago. All of this was on 9.2 - 9.6 -- certainly ancient today. When this comes up, I try to balance sharing my operational experience with the fact that it's dated by just explaining that it's dated. After all, all experience is dated. Readers can ignore it if they want, do some research, or folks in the PostgreSQL world can update me when specific things are no longer a problem. That's how I learned that the single-threaded WAL receiver had been updated, apparently in part because of our work: https://x.com/MengTangmu/status/1828665449850294518 (full thread: https://x.com/MengTangmu/status/1828665439234474350). I'll happily share these updates wherever I would otherwise share my gripes!
On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.
Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.
The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.
Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.
reply