More

menaerus · 2026-04-10T08:05:19 1775808319

You're condescending for no valid reason and I will tell you that what you say is not correct. Models superseded "plumbing" tasks and went well into the engineering grounds a generation or two ago already. Evidence is plenty. We see models perfectly capable reasoning about the kernel code yet you're convinced that game engines are somewhat more special. Why? There're plenty of such examples where AI is successfully applied to hard engineering tasks (database kernels), and where it became obvious that the models are almost perfectly capable reasoning about that tbh quite difficult code. I think you should reevaluate your stance and become more humble.

jenniferhooley · 2026-04-10T14:19:57 1775830797

Link me the research on the hard engineering tasks they've done on database kernels, I'd love to see it, sounds interesting.

As long as people comment, "Only bad/stupid engineers hand-write code because LLMs are better in every way," and that's objectively not true in various engineering circles, I'll keep trolling them and being just as hyperbolic in the inverse because it amuses me. Don't take things too seriously on the internet; you'll have a bad time ;)

menaerus · 2026-04-07T06:50:03 1775544603

I am an experienced C++ developer, I know what happens in this particular case, but this type of minutiae are only interesting to the developers who have never had an actually hard problem to solve so it's a red flag to me as well. 10 years ago I would have thought differently but today I do not. High performance teams do not care about this stuff.

menaerus · 2026-04-06T09:25:59 1775467559

Contention doesn't exist in older kernel versions even with huge-pages disabled, no?

anarazel · 2026-04-06T12:22:12 1775478132

The contention does exist in older kernels and is quite substantial.

menaerus · 2026-04-06T14:35:40 1775486140

You said

> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.

... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.

My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.

anarazel · 2026-04-06T14:59:50 1775487590

> ... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.

It gets a bit worse with preempt_lazy - for me just 15% percent or so - because the lock holder is scheduled out a bit more often. But it was bad before.

> My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.

I mean it wasn't a regression before, because this is how it has behaved for a long time.

This workload is not a realistic thing that anybody would encounter in this form in the real world. Even without the contention - which only happens the first time the buffer pool is filled - you lose so much by not using huge pages with a 100gb buffer pool that you will have many other issues.

We (postgres and me personally) were concerned enough about potential contention in this path that we did get rid of that lock half a year ago (buffer replacement selection has been lock free for close to a decade, just unused buffers were found via a list protected by this lock).

But the performance gains we saw were relatively small, we didn't measure large buffer pools without huge pages though.

And at least I didn't test with this many connections doing small random reads into a cold buffer pool, just because it doesn't seem that interesting.

menaerus · 2026-04-06T09:23:05 1775467385

You're assuming that they ran the workload with huge-pages disabled unintentionally.

db48x · 2026-04-06T12:06:22 1775477182

No… I’m assuming that they didn’t use the same automation that creates RDS clusters for actual customers. No doubt that automation configures the EC2 nodes sanely, with hugepages turned on. Leaving them turned off in this benchmark could have been accidental, but some accident of that kind was bound to happen as soon as the tests use any kind of setup that is different from what customers actually get.

menaerus · 2026-04-06T14:41:21 1775486481

You're again assuming that having huge pages turned on always brings the net benefit, which it doesn't. I have at least one example where it didn't bring any observable benefit while at the same time it incurred extra code complexity, server administration overhead, and necessitated extra documentation.

scottlamb · 2026-04-07T15:28:43 1775575723

FYI: huge pages isn't just a system-wide toggle, but a variety of things you can do:

* explicit huge pages

* transparent huge pages system-wide default

* app-specific or even mapping-specific toggles

* various memory allocator settings to raise its effectiveness

It would be really surprising to me to see a workload for which it's optimal to not use huge pages anywhere on the system.

menaerus · 2026-04-08T05:59:20 1775627960

It is a system-wide toggle in a sense that it requires you to first enable huge-pages, and then set them up, even if you just want to use explicit huge pages from within your code only (madvise, mmap). I wasn't talking about the THP.

When you deploy software all around the globe and not only on your servers that you fully control this becomes problematic. Even in the latter case it is frowned upon by admins/teams if you can't prove the benefit.

Yes, there are workloads where huge-pages do not bring any measurable benefit, I don't understand why would that be questionable? Even if they don't bring the runtime performance down, which they could, extra work and complexity they incur is in a sense not optimal when compared to the baseline of not using huge-pages.

menaerus · 2026-04-06T09:12:01 1775466721

Huge pages and THP are not the same thing.

menaerus · 2026-04-03T08:32:45 1775205165

> I know git-bisect doesn’t support this

It does.

menaerus · 2026-04-01T08:55:57 1775033757

> they unfortunately don't care about code quality.

> It's a shame, because it's still the best coding agent, in my experience.

If it is the best, and if it delivers the value users are asking for, then why would they have an incentive to make further $$$ investments to make it of a "higher" quality if the value this difference could make is not substantial or hurts the ROI?

On many projects I found this "higher quality" not only to be false of delivering more substantial value but actually I found it was hurting the project to deliver the value that matters.

Maybe we are after all entering the era of SWE where all this bike-shedding is gone and only type of engineers who will be able to survive in it will be the ones who are capable of delivering the actual value (IME very few per project).

troupo · 2026-04-01T10:31:43 1775039503

Is this why they ran into a bug with people hitting usage limits even on very short sessions and had to cease all communications for over a day after a week of gaslighting users because they couldn't find the root cause in the "quality doesn't matter" code base?

Or that's why tgey had to buy bun with actual engineers to work on Claude Code to reduce memory peaks from 68 GB (yes, 68 gigabytes) to a "measely" 1.7? Because code quality doesn't matter?

Or that a year later they still cannot figure out how to render anything in the terminal without flickering?

The only reason people use Claude Code is because it's the only way to use Anthropic's heavily subsidized subscription. You get banned if you use it through other, better, tools.

menaerus · 2026-04-01T13:35:45 1775050545

Sure, now the only thing remaining is you convincing Anthropic that they're doing wrong. Or alternatively you change your perspective.

troupo · 2026-04-01T14:00:05 1775052005

"Windows is the world's most popular desktop consumer OS. Microsoft are doing everything right, and should never ever change. Who are we to criticise them"

Meanwhile I apparently need to change my persoective about this: https://news.ycombinator.com/item?id=47598488

menaerus · 2026-04-03T08:13:58 1775204038

My response is a bit more nuanced than your interpretation of "no, we should not care about the quality".

menaerus · 2026-03-30T17:56:09 1774893369

> Rust is quite capable of expression templates, as its iterator adapters prove.

AFAIU iterator adapters are not quite what expression templates are because they rely on the compiler optimizations rather than the built-in feature of the language, which enable you to do this without relying on the compiler pipeline.

aw1621107 · 2026-03-30T18:40:40 1774896040

I had always thought expression templates at the very least needed the optimizer to inline/flatten the tree of function calls that are built up. For instance, for something like x + y * z I'd expect an expression template type like sum<vector, product<vector, vector>> where sum would effectively have:

    vector l;
    product& r;
    auto operator[](size_t i) {
        return l[i] + r[i];
    }

And then product<vector, vector> would effectively have:

    vector l;
    vector r;
    auto operator[](size_t i) {
        return l[i] * r[i];
    }

That would require the optimizer to inline the latter into the former to end up with a single expression, though. Is there a different way to express this that doesn't rely on the optimizer for inlining?

menaerus · 2026-03-30T19:31:58 1774899118

Expression templates do not rely on optimizer since you're not dealing with the computations directly but rather expressions (nodes) through which you are deferring the computation part until the very last moment (when you have a fully built an expression of expressions, basically almost an AST). This guarantees that you get zero cost when you really need it. What you're describing is something keen of copy elision and function folding though inlining which is pretty much basics in any c++ compiler and happens automatically without special care.

aw1621107 · 2026-03-30T19:55:26 1774900526

> since you're not dealing with the computations directly but rather expressions (nodes) through which you are deferring the computation part until the very last moment (when you have a fully built an expression of expressions, basically almost an AST).

Right, I understand that. What is not exactly clear to me is how you get from the tree of deferred expressions to the "flat" optimized expression without involving the optimizer.

Take something like the above example for instance - w = x + y * z for vectors w/x/y/z. How do you get from that to effectively

    for (size_t i = 0; i < w.size(); ++i) {
        w[i] = x[i] + y[i] * z[i];
    }

without involving the optimizer at all?

menaerus · 2026-03-31T07:59:41 1774943981

The example is false because that's not how you would write an expression template for given computation so the question being how is it that the optimizer is not involved is also not quite set in the correct context so I can't give you an answer for that. Of course that the optimizer is generally going to be involved, as it is for all the code and not the expression templates, but expression templates do not require the optimizer in the way you're trying to suggest. Expression templates do not rely on O1, O2 or O3 levels being set - they work the same way in O0 too and that may be the hint you were looking for.

aw1621107 · 2026-03-31T11:53:12 1774957992

> The example is false because that's not how you would write an expression template for given computation

OK, so how would you write an expression template for the given computation, then?

> Expression templates do not rely on O1, O2 or O3 levels being set - they work the same way in O0 too and that may be the hint you were looking for.

This claim confuses me given how expression templates seem to work in practice?

For example, consider Todd Veldhuizen's 1994 paper introducing expression templates [0]. If you take the examples linked at the top of the page and plug them into Godbolt (with slight modifications to isolate the actual work of interest) you can see that with -O0 you get calls to overloaded operators instead of the nice flattened/unrolled/optimized operations you get with -O1.

You see something similar with Eigen [2] - you get function calls to "raw" expression template internals with -O0, and you need to enable the optimizer to get unrolled/flattened/etc. operations.

Similar thing yet again with Blaze [3].

At least to me, it looks like expression templates produce quite different outputs when the optimizer is enabled vs. disabled, and the -O0 outputs very much don't resemble the manually-unrolled/flattened-like output one might expect (and arguably gets with optimizations enabled). Did all of these get expression templates wrong as well?

[0]: https://web.archive.org/web/20050210090012/http://osl.iu.edu...

[1]: https://cpp.godbolt.org/z/Pdcqdrobo

[2]: https://cpp.godbolt.org/z/3x69scorG

[3]: https://cpp.godbolt.org/z/7vh7KMsnv

menaerus · 2026-03-31T14:17:21 1774966641

Look, I have just completed work on some high performance serialization library which avoids computing heavy expressions and temporary allocations all by using expression templates and no, optimization levels are not needed. The code works as advertised at O0 - that's the whole deal around it. If you have a genuine question you should ask one but please do not disguise so that it only goes to prove your point. I am not that naive. All I can say is that your understanding of expression templates is not complete and therefore you draw incorrect conclusions. Silly example you provided shows that you don't understand how expression template code looks like and yet you're trying to prove your point all over and over again. Also, most of the time I am writing my comments on my mobile so I understand that my responses sometime appear too blunt but in any case I will obviously not going to write, run or check the code as if I had been on my work. My comments here is not work, and I am not here to win arguments, but most of the time learn from other people's experiences, and sometimes dispute conclusions based on those experiences too. If you don't believe me, or you believe expression templates work differently, then so be it.

aw1621107 · 2026-03-31T15:53:39 1774972419

> If you have a genuine question you should ask one but please do not disguise so that it only goes to prove your point.

I think my question is pretty simple: "How does an optimizer-independent expression template implementation work?" Evidently the resources I've found so far describe "optimizer-dependent expression templates", and apparently none of the "expression template" implementations I've had reason to look at disabused me of that notion.

> My comments here is not work, and I am not here to win arguments, but most of the time learn from other people's experiences, and sometimes dispute conclusions based on those experiences too.

Sure, and I like to learn as well from the more knowledgeable/experienced folk here, but as much as I want to do so here I'm finding it difficult since there's precious little for me to go off of beyond basically just being told I'm wrong.

> If you don't believe me, or you believe expression templates work differently, then so be it.

I want to understand how you understand expression templates, but between the above and not being able to find useful examples of your description of expression templates I'm at a bit of a loss.

gpderetta · 2026-04-01T09:10:27 1775034627

Expression templates do AST manipulation of expressions at compile time. Let's say you have a complex matrix expression that naively maps to multiple BLAS operations but can be reduced to a single BLAS call. With expression templates you can translate one to the other, this is a static manipulation that does not depend on compiler level. What does depend on the compiler is whether the incidental trivial function calls to operators gets optimized away or not. But, especially with large matrices, the BLAS call will dominate anyway, so the optimization level shouldn't matter.

Of course in many cases the optimization level does matter: if you are optimizing small vector operators to simd inlining will still be important.

aw1621107 · 2026-04-01T15:00:30 1775055630

> With expression templates you can translate one to the other, this is a static manipulation that does not depend on compiler level.

How does that work on an implementation level? First thing that comes to mind is specialization, but I wouldn't be surprised if it were something else.

> What does depend on the compiler is whether the incidental trivial function calls to operators gets optimized away or not.

> Of course in many cases the optimization level does matter: if you are optimizing small vector operators to simd inlining will still be important.

Perhaps this is the source of my confusion; my uses of expression templates so far have generally been "simpler" ones which rely on the optimizer to unravel things. I haven't been exposed much to the kind of matrix/BLAS-related scenarios you describe.

gpderetta · 2026-04-01T15:41:03 1775058063

Partial specialization specifically. Match some patterns and covert it to something else. For example:

  struct F { double x; };
  enum Op { Add, Mul };
  auto eval(F x) { return x.x; }
  template<class L, class R, Op op> struct Expr;
  template<class L, class R> struct Expr<L,R,Add>{  L l; R r; 
    friend auto eval(Expr self) { return eval(self.l) + eval(self.r); } };
  template<class L, class R> struct Expr<L,R,Mul>{  L l; R r; 
    friend auto eval(Expr self) { return eval(self.l) * eval(self.r); } };
  template<class L, class R, class R2> struct Expr<Expr<L, R, Mul>, R2, Add>{   Expr<L,R, Mul> l; R2 r; 
    friend auto eval(Expr self) { return fma(eval(self.l.l), eval(self.l.r), eval(self.r));}};
  template<class L, class R>
  auto operator +(L l, R r) { return Expr<L, R, Add>{l, r}; } 
  template<class L, class R>
  auto operator *(L l, R r) { return Expr<L, R, Mul>{l, r}; } 

  double optimized(F x, F y, F z) { return eval(x * y + z); }
  double non_optimized(F x, F y, F z) { return eval(x + y * z); }

Optimized always generates a call to fma, non-optimized does not. Use -O1 to see the difference (will inline trivial functions, but will not do other optimizations). -O0 also generates the fma, but it is lost in the noise.

The magic happens by specifically matching the pattern Expr<Expr<L, R, Mul>, R2, Add>; try to add a rule to optimize x+y*z as well.

aw1621107 · 2026-04-01T23:23:39 1775085819

Hrm, OK, that makes sense. Thanks for taking the time to explain! Guessing optimizing x+y*z would entail something similar to the third eval() definition but with Expr<L, Expr<L2, R2, Mul>, Add> instead.

I think at this point I can see how my initial assertion was wrong - specialization isn't fully orthogonal to expression templates, as the former is needed for some of the latter's use cases.

Does make me wonder how far one could get with rustc's internal specialization attributes...

menaerus · 2026-03-30T17:46:49 1774892809

No, they are not.

aw1621107 · 2026-03-30T18:49:52 1774896592

They are both; there are things that Rust's macros can do metaprogramming-wise that C++ templates cannot do and vice-versa.

Rust's macros work on a syntactic level, so they are more powerful in that they can work with "normally" invalid code and perform token-to-token transformations (and in the case of proc macros effectively function as compiler extensions/plugins) and less powerful in that they don't have access to semantic information.

aldanor · 2026-03-30T22:17:35 1774909055

Incorrect.

menaerus · 2026-03-30T17:33:18 1774891998

Totally. It's funny how many people do not actually realize this and get stuck in cargo-cult mindset forever, no matter the years of experience.