> There is a huge mismatch between the assumptions of the C spec and actual machine code.
Right, which is why the kind of UB pedantry in the linked article is hurting and not helping. Cranky old man perspective here:
Folks: the fact that compilers will routinely exploit edge cases in undefined behavior in the language specification to miscompile obvious idiomatic code is a terrible bug in the compilers. Period. And we should address that by fixing the compilers, potentially by amending the spec if feasible.
But instead the community wants to all look smart by showing how much they understand about "UB" with blog posts and (worse) drive-by submissions to open source projects (with passive agressive sneers about code quality), so nothing gets better.
Seriously: don't tell people to shift and mask. Don't pontificate over compiler flags. Stop the masturbatory use of ubsan (though the tool itself is great). And start submitting bugs against the toolchain to get this fixed.
I agree but language of the standard very unambiguously lets them do it. Quoth X3.159-1988
* Undefined behavior --- behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately-valued objects, for which the Standard imposes no
requirements. Permissible undefined behavior ranges from ignoring the
situation completely with unpredictable results, to behaving during
translation or program execution in a documented manner characteristic
of the environment (with or without the issuance of a diagnostic
message), to terminating a translation or execution (with the issuance
of a diagnostic message).
In the past compilers "behaved during translation or program execution in a documented manner characteristic of the environment" and now they've decided to "ignore the situation completely with unpredictable results". So yes what gcc and clang are doing is hostile and dangerous, but it's legal. https://justine.lol/undefined.png So let's fix our code. The blog post is intended to help people do that.
No; I say we force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority. There's a lot more programmers who are not compiler writers.
The standard is really a minimum bar to meet, and what's not defined by it is left to the discretion of the implementers, who should be doing their best to follow the "spirit of C", which ultimately means behaving sanely. "But the standard allows it" should never be a valid argument --- the standard allows a lot of other things, not all of which make sense.
force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority
As far as I understand it, they do neither. Transforming an AST to any level of target code is not done by handcrafted recipes, but instead is feeded into efficient abstract solvers which have these assumptions as an operational detail. E.g.:
p = &x;
if (p != &x) foo(); // optimized out
is not much different from
if (p == NULL) foo(); // optimized out
printf("%c", *p);
No assumption here is idiotic, cause no single human was involved, it’s just a class of constraints, which alone to separate properly you’ll have to scratch your head extensively (imagine telling a logic system that p is both 0 and not-0 when 0-test is “explicit” and asking it to normally operate). Compiler writers do not format disks just to punish your UBs. Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one?
In your example, why should it optimise out the second case? Maybe foo() changed p so it's no longer null.
Compiler writers do not format disks just to punish your UBs.
IMHO if the compiler exploiting UB is leading to counterintuitive behaviour that's making it harder to use the language, the compiler is the one that needs fixing, regardless of whether the standard allows it. "But we wrote the compiler so it can't be fixed" just feels like a "but the AI did it, not me" excuse.
The address of p could have been taken somewhere earlier and stored in a global that foo accesses, or a similar path to that; and of course, p could itself be a global. Indeed, if the purpose of foo is to make p non-null and point to valid memory, then by optimising away that code you have broken a valid program.
If the compiler doesn't know if foo may modify p, then it can't remove the call. Even if it can prove that foo does not modify p, it still can't remove the call: foo may still have some other side-effects that matter (like not returning --- either longjmp()'ing elsewhere or perhaps printing an error message about p being null and exiting?), so it won't even get to the null dereference.
As a programmer, if I write code like that, I either intend for foo to be doing something to p to make it non-null, or if it doesn't for whatever reason, then it will actually dereference the null and whatever happens when that's attempted on the particular platform, happens. One of the fundamental principles of C is "trust the programmer". In other words, by trying to be "helpful" and second-guessing the intent of the code while making assumptions about UB, the compiler has completely broken the expectations of the programmer. This is why assumptions based on UB are stupid.
The standard allows this, but the whole intent of UB is not so compiler-writers can play language-lawyer and abuse programmers; things it leaves undefined are usually because existing and possible future implementations vary so widely that they didn't even try to consider or enumerate the possibilities (unlike with "implementation-defined").
But in fact compilers do regularly prove such things as, "this function call did not touch that local variable". Escape analysis is a term related to this.
I'm more of two minds about that other step, where the compiler goes like, "here in the printf call the p will be dereferenced, so it surely is non-null, so we silently optimize that other thing out where we consider the possibility of it being null".
Also @joshuamorton, couldn't the compiler at least print a warning that it removed code based on an assumption that was inferred by the compiler? I really don't know a lot about those abstract logic solver approaches, but it feels like it should be easy to do.
warning that it removed code based on an assumption that was inferred by the compiler
That would dump a ton of warnings from various macro/meta routines, which real-world C is usually peppered with. Not that it’s particularly hard to do (at the very least compilers know which lines are missing from debug info alone).
Yes, the assumption that p is non-null is idiotic. Also, the implicit assumption that foo will always return.
> no single human was involved
Humans implemented the compilers that use the spec adversarially and humans lobby the standards committee to not fix the bugs
> Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one
The majority of optimizations are harmless and useful, only a handful are idiotic and harmful. I want a compiler that has the good optimizations and not the bad ones.
For essentially every form of UB that compilers actually take advantage of, there's a real program optimization benefit. Are there any particular UB cases where you think the benefit isn't worth it, or it should be implementation-specific behavior instead of undefined behavior?
Most performance wins from UB come from removing code that someone wrote intentionally. If that code wasn't meant to be run, it shouldn't be written. If it was written, it should be run.
Now obviously there are lots of counter-examples for that. You can probably list ten in a minute. But it should be the guiding philosophy of compiler optimizations. If the programmer wrote some code, it shouldn't just be removed. If the program would be faster without that code, the programmer should be the one responsible for deciding whether the code gets removed or not.
MSVC and ICC have traditionally been far less keen on exploiting UB, yet are extremely competitive on performance (ICC in particular). That alone is enough evidence to convince me that UB is not the performance-panacea that the gcc/clang crowd think it is, and from my experience with writing Asm, good instruction selection and scheduling is far more important than trying to pull tricks with UB.
Get the teamsters and workers world party to occupy clang. You should fork C to restore the spirit of C and call it Spiritual C since we need a new successor to Holy C.
I read this, and go "yes, yes, yes", and then "NO!".
Shifts and ors really is the sanest and simplest way to express "assembling an integer from bytes". Masking is _a_ way to deal with the current C spec which has silly promotion rules. Unsigned everything is more fundamental than signed.
Right, which is why the kind of UB pedantry in the linked article is hurting and not helping. Cranky old man perspective here:
Folks: the fact that compilers will routinely exploit edge cases in undefined behavior in the language specification to miscompile obvious idiomatic code is a terrible bug in the compilers. Period. And we should address that by fixing the compilers, potentially by amending the spec if feasible.
But instead the community wants to all look smart by showing how much they understand about "UB" with blog posts and (worse) drive-by submissions to open source projects (with passive agressive sneers about code quality), so nothing gets better.
Seriously: don't tell people to shift and mask. Don't pontificate over compiler flags. Stop the masturbatory use of ubsan (though the tool itself is great). And start submitting bugs against the toolchain to get this fixed.