Neat! My main side project takes exactly the same base idea, that we need a synthesis v2.
The big question is "what have we learned in the past thirty years of OS design that's applicable?" Capabilities are amazing in this context is the big one; if you squint synthesis quajects and capability objects line up in semantics pretty well. I'd argue that xok would be a lot cleaner if they had a central VM concept instead of the three or so seperate kernel VMs.
> The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.
So when you see the bytes surrounding the instruction pointer in your kernel OOPS message, good luck looking for them in the vmlinux disassembly.
One problem here is that in-kernel self-modifying code is very expensive on current CPUs due to the synchronization and possible flushing involved. There's flushing because the various levels of instruction caching are, in general, not coherent in hardware, and there's flushing because you can't turn writable memory into read-only executable memory without flushing the writable TLB entries.
Honestly, seems like a big waste of time to me. Generating the code wouldn't be free, so you pay for that. IO, generally, is fairly expensive, so you are now weighing the costs of IO against the savings in instructions/reduced branches. So where are the savings? OS level mutexes?
Basically all of the modern CPUs are going to do a really good job at optimizing away most of the checks cooked into the code Modern CPUs are really good at seeing "99% of the time, you take this branch" and "these 8 instructions can be ran at the same time".
What they can't optimize away is accessing data off CPU. The speed of light is a killer here. In the best case, you are talking about ~1000 cycles if you hit main memory and way more if you talk to the disk (even an SSD). You can run down a ton of branches and run a bunch of code in the period of time you sit around waiting for data to arrive.
This wasn't true in 1992 when CPUs ran in the Mhz. But we are long past that phase.
About the only place this would seem to be applicable is in embedded systems looking to squeeze out performance while decreasing power usage. Otherwise, feels like a waste for anything with a Rasberry pi CPU.
I read the thesis, and it is incredible. Massalin developed Synthesis OS to run on Sun machines (using a 68030 if I'm not mistaken), and was able to run Sun OS executables on her OS, faster than Sun OS could run its own executables. Part of that was the specialization kept code and data close together to keep cache misses down. I just wish the code to Synthesis was available for study.
The partial specialization here is all about optimizing data off of the CPU. Your drivers and protocol stacks above them have tons of data associated to make the drivers and stacks more generic and configurable. By JITing out optimized code you get the generic source, and optimized hot paths by embeddi g some of those co stants in the instruction stream, and simply not reading others. This also helps the I$, skipping over all those if(bug that doesn't apply to my hardware) sections. But, I think the biggest win is defining the user/kernel boundary in ways that are specific to your application. We've started to see some of that with neat bpf uses.
> The speed of light is a killer here. In the best case, you are talking about ~1000 cycles if you hit main memory
Nitpick: this isn't really a speed-of-light issue. The time taken for an electrical signal to make a round-trip between CPU and RAM on the same motherboard is closer to 10 clock cycles than 1000. Any latency beyond that is a result of delays within the CPU or memory modules themselves, not the time spent traveling between them.
CPU limitations relative to IO became a much bigger issue again with 10Gbps+ networking and NVMe, but ignoring that, there is the entirely separate issue of IPC and anything else that causes lots of traversals of the userspace/kernel boundary at a high rate relative to the amount of time spent on external IO, and it can be absolutely brutal on throughput.
The biggest speedups I've made on systems I've worked on over the last 20 years have usually been down to being aware of patterns people don't think about where they caused unnecessary amounts of context switches, because context switches are far more expensive than people realize.
Now, doing that is not going to be fixed by "just" JIT'ing the write syscall, but you have to start somewhere, and I think the more promising approach there would be to look at what parts of the kernel code you could actually JIT into user-space.
E.g. obviously you can't elide anything that checks for anything security related to userspace, but consider when the kernel knows that you're writing to a pipe that only a single other process is reading from, for example. Could you manage to turn the write()/read() pairs into creating a buffer mapped directly into the reading process, and replace the read() syscalls with working purely on the userspace buffer when there is data there, and only ever requiring a context switch when outpacing the writer/reader? (e.g. using futexes when the buffer is potentially full or empty)
Other case is things like user-space networking. If the user has sufficient privilege to write directly to the network anyway, why not make their networking code dynamically run in user space when it makes sense?
Yes, it won't be free, but the point of targeting IO intensive code for this kind of thing is that a lot of the code paths are quite simple but tends to get executed at high frequency over and over again, so you can in fact often afford to do a lot of work even for relatively modest speedups. Even more so in modern systems where distributing the work over enough cores is often a challenge - any time you have a core idling you have time to spent trying to JIT existing code paths with more expensive optimizations.
Any must reads for getting into kernel optimization? I'd like to be able to diagnose performance issues better and you seem to know the flow very well. Any advice on material to read or places to hang around would be appreciated!
Specifically kernel optimization isn't really my area. I've mostly approached it from the userspace side in that I've worked a lot on optimizing apps that often turn out to abuse syscalls.
From that side, I'd say looking at tracing tools like systrace is worthwhile. But also just place strace (system calls) and ltrace (dynamic library calls). "strace -c -p [pid]", wait a bit and then ctrl-c, or just "strace -p [pid]" for a running process on Linux (or drop the "-p" and add the command you want to trace) is hugely illuminating and will often tell you a lot of things about the code you're running.
You can usually do the same with "proper" profilers (like e.g. gprof), but strace has the benefit of focusing specifically on the syscall boundary, and being very easy to pull out. You can use it both for profiling (with -c) or to see the flow of syscalls (without -c). The latter is useful for code that gives opaque error messages or otherwise to find out e.g. which files an application tries to open, and is invaluable for trouble-shooting.
Specifically for performance issues, there is one thing that is very valuable to remember: syscalls are expensive. Very expensive. Code that e.g. does lots of small read() calls, for example, is almost certainly doing something stupid (like not doing userspace buffering). But in general, a brief run of strace -c and figuring out why the top calls occur so often and/or take so long very often leads to obvious fixes.
E.g. a far too common pattern, though it's getting better, is read()'ing a length indicator in an incoming packet of data, followed by a second read() to read the contents, instead of using a non-blocking socket and doing larger reads into a buffer. When the message flow is slow, it doesn't matter. When it's fast, the difference in throughput can be massive.
As an additional thing, even redundant security checks will be plenty fast as long data checked is mostly unchanging and colocated with code. (E.g. struct task in Linux)
Write syscall being JIT is not even close to what can be achieved when you have to copy data on average 5 times until it reaches a driver. Mailbox style IPC everywhere would help with that immensely, but devices tend to skip DMA and require sleeps or slow port writes to do anything due to hardware issues.
This will be more like the kernel as a JIT compiler instead of the current almost-FSM (finite state machine) model. Improved performance at the expense of (probably) correctness. Expect more of the current Intel-style bugs that are hard to debug and devastating. But implemented well this could be revolutionary for Ops etc. Self-tuning DBs exist and stuff like the JVM has been around for decades, this simply does that on a lower level.
This is a wonderful idea, and I hope many people start working on it right away.
Although Massalin has never published her code, according to my memory of her thesis, Synthesis's runtime code generation was mostly extremely simple, more like linking than what we think of as "code generation" — it copied a template method into the appropriate slot in the newly-generated quaject, then overwrote specific bytes in the generated code with pointers to the relevant callout (or, in some cases, the initial value of an instance variable for that quaject). Parts of the code that did not benefit from being specialized in this way were factored into ordinary functions the quaject method would just call.
This meant that only a small amount of code was generated for each quaject, and the runtime code generation was very nearly as fast as memcpy(), which meant that it was reasonable to use it on every quaject instantiation.
Massalin also talked about applying some optimizations to the generated code, such as the constant-folding and dead-code removal John mentions, but I have the intuition that only a minority of quaject instantiations involved such more aggressive optimizations. Since she never published Synthesis, it's impossible to know for sure. (I'm not questioning her integrity or claiming that the impressive benchmarks reported in her dissertation are faked; I'm saying that we unfortunately can't see the exact mixture of interesting things you need to do to get those kickass benchmarks; so, like an insecure Intel CPU, I'm reduced to speculation.)
Later implementations inspired by Massalin's approach included Engler's VCODE (which, to my knowledge, has also never been published; Engler's PLDI paper cites Massalin in the second sentence of the abstract), which was used to implement Engler's `C, and GNU Lightning (inspired by Engler's published papers about VCODE), used in a number of modern JIT compilers.
I suspect that, by contrast, John's idea of using LLVM is inevitably going to have much higher overhead — if only from the CPU cache devastation brought about by any LLVM invocation — so will only be a win for much-longer-lived objects, where the large instantiation overhead can be amortized over a much larger number of invocations. An intermediate approach like Engler's `C might be more broadly applicable.
John suggests this early on in his "for deployment" comment, but I think that it's probably necessary for prototyping too, since the objective of the whole exercise would be to get an order-of-magnitude speedup, and the objective of the prototype would be to find out if that's a plausible result. A prototype that makes all your programs run slower due to LLVM wouldn't provide any useful evidence about that.
I asked Perry what he thought about the above, and he replied with this gem:
So you’re correct that the code generation was mostly “template
instantiation”. I think that was key to having calls like open()
function in reasonable time. I also suspect LLVM is a blunt instrument
for this work. That said, it would have been difficult for anyone but
Massalin to work with the approach in Synthesis. It was very much the
product of a person who was both intensely brilliant and completely
comfortable with writing “weird code” in the instruction set they were
working in.
So there’s then the question of how one can take the ideas from
Synthesis and make them a practical thing that ordinary programmers
could build and contribute to. And that is almost certainly going to
involve compiler tooling. As a prototype, making this work by using
LLVM is probably a good approach. Ultimately, I think that one is
going to have to do the magic at kernel build time and have something
fairly lightweight happen at runtime. But to figure out what that is,
one needs to play. And the easiest tools right now for playing
involve LLVM. If, for example, you can use LLVM successfully to
specialize a write call’s instruction path down an order of magnitude
or more, or to do similar things in the networking code, one can then
ask how to do this better.
There are, of course, a number of interesting paths towards playing
with this. I suspect that none of them end anywhere near where they
start. But the only way to see what might be possible with much better
tooling is to start, and you have to start somewhere.
BTW, I think the time is right, or even over-right, for this. Single
processor core performance is stalled out, and while in 1992 one could
just say “well, we’ll have another factor of ten performance
improvement in a few years, who needs the trouble”, that’s no longer
the case. Note that this argument also applies, to a considerable
extent, to other parts of the modern software ecosystem. When you
can’t just say “we could spend a couple of months optimizing this, but
the next generation of processors will be out by then”, things change.
Anyway, not sure if this answers your call for comments, but if you
are interested in specific areas around this, I’ve no shortage of
opinions. Many would say I have far too many opinions...
You can quote any subset or the entire thing. So long as you don’t
distort my intent I have no problem with people using my words that
way.
I also got the impression when reading the article that a full blown high level language JIT is overkill for a kernel. A lighter template instantiation approach os much better. The key is deriving the template fragments and comatamt patch tables etc... from the fully featured source code. Doing this in a fully automatic fashion seems computationally hard and likely requires finely tuned heuristics to figure out sections that can likely be made optional depending on parameters etc.. humans could annotate the code, but that essentially creates a new and quite weird programming language. The compiler would effectively comsiders all possible combinations compile time configuration switches at once and emit instructions on how to instantiate a certain comfiguration from machine code templates.
All of this has devils and monsters lurkimg in the details, of course. I have seen systems where this kind of runtime code specialization woukd indeed be helpful even in user space. There is a lot of code where the innermost loop has conditionals based on some user input or program setting that cannot be determined at compile time. Code like this could profit a bunch from runtime specialization, too. Mapped into a programming language, this would probably look something like generics, but with parameters that need to be passed to a comstructor at runtime to get something executable.
The problem lies in making things work with optimizations. For example, how do you schedule instructions and allocate registers optimally when you don't know the exact sequence of instructions that is executed? Not gettimg that right costs you a bunch of performance. Then there are things like variants of the code that might be vectorized under certain conditions. You would need to find them in the compiler and be able to emit the proper alternative code templates and a way for the runtime to pick them up when appropriate. A full JIT could do all that at runtime, but that is also expensive to run and likely too big and buggy to exist in a stable and safe operating system kernel.
It gets harder when you want to do link time optimization. Without it, the approach is inferior to ahead of time optimization with some programmer input... (Degeneralization and devirtualization rather than specialization.)
It's fine if it's inferior to ahead-of-time optimization with programmer input, because you don't have time for a programmer to optimize the kernel code anew every time you spawn a process or open a pipe or a socket. It just needs to be faster than what the kernel is doing now when you call read() or write() or whatever, without being too much slower when you open the file (etc.)
Like, ahead-of-time optimization with programmer input might take a few hours to a few days, and that's not really an acceptable execution time for the open() system call.
Of course if you get some of the benefits of run time specialization, you also get some of the downside, i.e., overspecialization.
The heuristics to know when to stop specializing is a really hard part of those kind of systems.
You could have a program that does a single write() on 100k different files. You could have thousands of processes each writing to a single file and bloating the I$ with useless copies of write.
Probably fifteen years ago I discussed runtime code generation with some computer architects and they did indeed raise concerns about caching at the time. But now I am not as convinced this is as much of a worry as people seem to think. Sure if you were generating new code ever time you opened a file and did a write perhaps you would blow out the instruction cache with excessive amounts of freshly generated code. But if you waited to generate an optimized version of write until you see it is "hot spot" it seems like you would probably be okay.
I can only speculate, so maybe someone who knows better can correct me, but I would be surprised if many kernel syscall implementations would actually stay in cache after a dwq context switches. Especially with recent timing channel mitigations. But I'm just guessing.
And even if there are architectural issues, with processors seeing less benefit from Moore's Law I would expect computer architects to be looking for ways to eek out more performance such that they'd find solutions.
Novice question here: Applications that care about performance already do things like buffered reads, to minimize the overhead of system calls. Is there much value to optimizing read() in that case? Are there other system calls that are harder for applications to amortize?
Part of the value of reducing system call overhead is that it reduces the need to buffer reads; one of Massalin's key use cases was building a real-time audio processing system (on I think a 40-MHz dual-processor 68020 workstation) out of shell pipelines, for example to simulate what her voice would sound like after her gender transition, and real-time audio processing is terrible when there's a lot of buffering introducing latency. Only up to a few milliseconds is acceptable over the whole pipeline. On my Linux laptop the read() overhead is only about 300 ns, but lmbench says the pipe latency is 19000 ns, which adds up fast.
There's also the overall complexity cost: by pushing complexity into dozens of libraries and thousands of applications to compensate for unnecessary inefficiency in the kernel, the system can become more complex — harder to debug and harder to optimize — than if you just bite the bullet and optimize the kernel. And even then, you don't get back all the performance you lost — on Synthesis, unbuffered reads were actually faster than buffered reads on other operating systems, because you could avoid the extra copy through the userland buffer!
Assuming it's possible to optimize the kernel that much, of course. On a Netburst Pentium IV, you're fucked, and it still remains to be seen how bad the Intel speculative execution disaster ends up being.
You'd think so, but it's crazy how much code does not minimize system calls at all. Things like read() to read the length of a string followed by a second read() to read the string (hopefully after sanity checks..) is rife. "strace -c" is my favorite first stop when something looks slower than it should, and it's just insanely frustrating how often it reveals poorly written code.
I'm working on something that could be applicable to this! There seem to be many areas that would benefit from a pluggable JIT engine, so I started building one: https://github.com/nitrousjit/nitrous/
It's designed to analyze and optimize existing codebases. So far it's produced some pretty decent speedups on the Python interpreter, though with a big drawback of very high compilation times due to using LLVM.
Let me know what you think or if you have performance-sensitive programs that you think would benefit from a low-investment JIT.
Yeah, I’m surprised that security wasn’t mentioned at all. The Linux kernel’s BPF JIT is incredibly restrictive on what it can interpret and produce: presumably a JIT like this would be much less constrained.
But in a sense, security is just part of the engineering required to make such a project a reality. No one is going to be deploying a kernel like this until it's proven, so perhaps the emphasis on security isn't needed for an initial blog post.
I think one interesting approach for this would be to look for cases where the kernel can leverage information it has to JIT code into userspace that allows the userspace process to reduce the number of context switches it needs into the kernel. Obviously that could not get as aggressive, as anything that requires security checks or access to information not accessible to the calling process would still need to enter the kernel.
E.g. is the other end of that file descriptor you're writing to another process on the same machine? Intercept the calls and replace them with operations directly on shared memory using a futex to coordinate, for example. Or is the process doing read()'s that are smaller than what is most efficient on the current system, and it's the only reader? JIT buffering into user space.
A lot of what you might consider it for could be done just fine in user-space with sufficiently smart libraries that knows how to handle a bunch of special cases, but the caveat is that you then depend on users knowing to use them, and my experience is that there is enough code out there doing tiny read()'s that I have no faith whatsoever that people will do a good job at even the very basics. Meanwhile the kernel has a lot of information very easily accessible that a userspace implementation might have to jump through hoops to find out.
The way I understood it is that it would only perform code generation for trusted kernel code, not for arbitrary code provided by the user. Doesn't this resolve most (all?) security concerns?
Not necessarily, because it means you need to be sure that the combination of the JIT, the trusted kernel code and user data will only ever result in safe code paths. E.g. consider a JIT that mistakenly optimizes away a bounds check in the original trusted code in certain cases where it is not safe to optimize it away.
There is a lot of space between 'fully Turing complete' and 'useful' where you can allow certain patterns but not arbitrary code. This area gets discussed every ten years or so.
There was even something called Proof Carrying Code where you make the caller do some of the work up front.
I suspect the main advantage of a system like this (basically, userspace code is run in a software VM) would not be code synthesis (though I do think that is also useful), but the fact that there's no longer an expensive boundary between "user" and "kernel" code. A system call would simply be a regular function call; there would be no use of interrupt handling or manipulating/switching page tables or ultimately any use of an MMU.
Once this boundary is removed, we won't need things like userspace buffering, which exists to circumvent this expense at the expense of some odd user behaviour:
$ cat | grep a | cat
Ultimately, I do think that OSes will go in this direction. As mentioned in the article, there are currently things like eBPF in Linux which take advantage of this. I don't see why we couldn't have a kernel which supports both running VM code in "ring 0" but also supports a Linux-like VM mechanism for running computational code which would be compiled sub-optimally by the software VM. Presumably this functionality could be implemented directly into an existing OS, so it wouldn't necessitate some major switch.
Running userspace code in a kernel VM isn't what John is suggesting, and the experimental OSes that tried this tack have been hit rather badly by the Intel speculation security problems.
Webassembly would probably be an ideal executable format for a synthesis kernel. Safe to execute, easy to generate these days from all sorts of programming languages.
Plus, you can actually inline syscalls and whatnot into your compiled application at runtime.
The big question is "what have we learned in the past thirty years of OS design that's applicable?" Capabilities are amazing in this context is the big one; if you squint synthesis quajects and capability objects line up in semantics pretty well. I'd argue that xok would be a lot cleaner if they had a central VM concept instead of the three or so seperate kernel VMs.