HermiTux: A Binary-Compatible Unikernel

externalreality · on April 7, 2019

I think there on the right track from a usability standpoint in making tools that will help build a minimal kernel by determining what system-calls need to be included based on the application. Pulling in the correct dynamic libraries is also important.

I envision a system that can use static analysis to chose system calls, find libraries (ldd), and perhaps even choose between appropriate algorithms (thread schedulers) all based on the particular program it is going to run. In other words Kernels that optimize themselves based on the program much like compilers optimize programs based on the various things like architecture. I think uni-kernels have great potential.

Also for system calls that are not security concerns or performance concerns why not continue to proxy them to the host kernel as Hermit core does?

saagarjha · on April 7, 2019

Unfortunately a system like this can't work for every binary: system calls and dynamic libraries can't be statically inferred in all cases.

karambahh · on April 7, 2019

Could you please explain why they can't be?

It seems counter intuitive to me. It seems you're saying that static analysis cannot reach every branch of the program (and thus its underlying system libraries).

I'm surely missing something and would be grateful to anyone helping me out!

externalreality · on April 7, 2019

You can read Chiba's thesis which explains exactly how they are able to detect the systems calls made by the ELF using static analysis and code flow execution analysis (the techniques all seems very simple and ingenious - you can even try to write a little program yourself that uses their algorithm if you don't believe them). The thesis explains that there are edge cases and how they work around them. Remember they have to handle all of the system calls or the program would suffer from run-time errors.

simtel20 · on April 7, 2019

Dlopen allows for dynamic linking of symbols e.g. based on user input or local circumstances (cpu capabilities, type of db being connected to opens a different set of libs for different versions etc.).

You could provide hints for these circumstances, or change a program to work differently to avoid the practice, but having library resolution delayed until runtime and until after the program has started is a real thing.

gmueckl · on April 7, 2019

It is easy to write a function, that makes a syscall directly and not via well known paths in the libc or similar. And then the syscall number can be a variable instead of constant and determining that value statically means solving the halting problem. Q.e.d.

sanxiyn · on April 7, 2019

Since conservative approximation is still useful, halting problem doesn't matter.

That is, let's say syscall 1 is definitely used, syscall 2 is definitely not used, whether syscall 3 is used depends on halting problem. Including only 1 and 3 to unikernel is safe and better than including all of 1, 2, 3.

gmueckl · on April 7, 2019

This works in trivial problems. Any moderately sized program I can imagine right now will result in an analysis that is equivalent to "this uses everything"

nickpsecurity · on April 7, 2019

It's really nice work. I have to take issue with something in the paper [1]:

“First, hardware-enforced isolation such as the use of SGX or EPT is fundamentally stronger than software-enforced isolation (containers/software LibOS), as shown by the current trend of running containers within VMs for security”

I don’t think that’s true. High-assurance security used to demand products use as many hardware defenses as possible. That was good for a while. Then, the hardware started having serious problems we’d have to work around. They seemed to be screwing them all up at some point with MMU and virtualization working better than average (most tested). Muen went with decision to use just one mechanism to avoid multiple, failure modes with software written in SPARK Ada for verification of safety. It did better with Meltdown/Spectre than many. More problems coming in on HW side, including attacks focused on SGX.

I think their assumption is thoroughly refuted at this point for modern, complex hardware. I’d depend on the hardware as little as I can using as simple hardware as I can get away with. Then, build the software as strong as possible around the most battle-tested mechanisms to reduce number of changes needed. seL4 on ARM and Muen on x86 are examples. If one has a massive budget, then build extensions to hardware that make securing software easier. Then, do the same thing again building highly-assured kernels on those heavily-verified, hardware extensions.

[1] https://www.ssrg.ece.vt.edu/papers/vee2019.pdf

MuffinFlavored · on April 6, 2019

How does Docker use such little memory?

A better link: https://github.com/ssrg-vt/hermitux-kernel

The link in the article seems to be apps that use the kernel?

viraptor · on April 7, 2019

Docker's memory usage of the container is going to be pretty close to the same process running without the namespace. For the hello world they're testing, it's going to be ~ the size of the binary and libc. There's probably another MB or so hiding in Docker metadata, namespace metadata in kernel, etc. that "docker stats" doesn't account for.

compsciphd · on April 11, 2019

depends.

take the hello world example. Correct me if I'm wrong, but if one runs it 1000 times in parallel in a single namspace and it will be using close to 1x memory (i.e. like it was running once, assuming the vast majority of memory would be shared), run it 1000 times in parallel in 1000 docker containers and its memory impact will be closer to 1000x (i.e. no shared memory going on).

balena · on April 7, 2019

It's unfortunate, but unikernels are going to be the future - VS the present - until they beat containers in the only two things that matter: ease of use and memory consumption

externalreality · on April 7, 2019

> only two things that matter: ease of use and memory consumption

Those are seriously not the only two things that matter. Not by a long shot. Depending on what you are doing there are other serious concerns like security, OS Noise and other performance concerns. From the literature I've been reading security is a huge concern when deploying containers. From experience I can tell you that Dev-Ops with containers can be a nightmare and a half costing companies heavily.

Saying Ease-of-Use and Memory footprint is all that matters is serious misinformation that no research or other literature or anecdote supports.

That said, at least, ease of use is coming. There are some tools on the market right now that make Unikernels fairly easy to use Ops and BoxFuse come to mind.

balena · on April 7, 2019

I do share your vision externalreality, I'm just saying from an approximate, but factual, point of view, unikernels won't share a fraction of containers popularity until they can compete in the two areas I mentioned. ️

externalreality · on April 7, 2019

Agreed, agreed.

MuffinFlavored · on April 7, 2019

> From experience I can tell you that Dev-Ops with containers can be a nightmare and a half costing companies heavily.

Why are containers any more dangerous/vulnerable/prone to leaks than deploying say... a standalone REST API not in a container?

externalreality · on April 7, 2019

> Why are containers any more dangerous/vulnerable/prone to leaks than deploying say... a standalone REST API not in a container?

Its not about one service, its about n services running on the same resource partitioned hardware. If one gets compromised how likely is that to affect other services running on other partitions. Containers (High, shared kernel), Unikernels (Lower, different kernel, hardware supported isolation - almost like running n different physical machines).

viraptor · on April 7, 2019

It's not obvious though. It may be less risk, but you're not eliminating interfaces with a VM. You're replacing them with new ones. With bare processes you've got shared system resources and kernel to attack. With Docker you've got the kernel to attack. With VMs you've got virtual hardware drivers to attack.

We can play with risk estimation, but in practice, both containers and VMs were affected by memory sharing failures already. We know that syscall/ioctl issues exist. And we know that virtualisation issues exist.

convolvatron · on April 7, 2019

this is a great approach. the standard unikernel approach of cutting above libc means you have to be portable against musl (or another library) and play with your build. not the end of the world, but it doesn't seem necessary.

wwarner · on April 7, 2019

Very exciting, would like to see an orchestration layer working with this.

lallysingh · on April 7, 2019

Congrats on the MS and getting on HN front page!

mruts · on April 7, 2019

I've always wondered, what is the overhead of an OS (say Linux), compared to a stand-alone binary? Maybe some here could answer the question: What kind of differences would you see for nginx on HermiTux vs Linux? Less memory usage, more requests? Also, what about a standalone web server operating system that just implemented a TCP stack?

How would the performance ande memory usage of those three things compare?

sanxiyn · on April 7, 2019

OSv benchmarked memcached: "An unmodified memcached running on OSv was able to handle about 20% more requests per second than the same memcached version on Linux. A modified memcached, designed to use OSv-specific network APIs, had nearly four times the throughput."

http://osv.io/benchmarks

convolvatron · on April 7, 2019

not a huge difference i think, at a high level the work involved is exactly the same. but there are a few differences.

raw system call overhead: this isn't alot, but in an i/o heavy application it shows up as a reasonable percentage of work.

general purpose network machinery: things like kernel thread context switches for interrupts and networking likely wouldn't be there in the more bare bones systems. also none of the firewall/packet filter/netgraph checks will be there. its also possible that a single application network service might revert to a pure polling model and avoid interrupt context switches entirely.

unix interface: your standalone web server interface might be very well written to not use the copying tcp read(), and use direct callbacks or thread handoffs instead of select/poll/kqueue

other system gunk: none of the usual daemons and their kernel state needs to be there, and they wont be waking up and stealing cycles from your service. you should expect a much tighter service latency

another factor at play here is that alot of the general OS paths have had substantial work on them (interrupt coalescing, thread scheduling, memory management etc). that kind of effort may not be replicated in the simple standalone version. so by throwing that away you're losing some of the good along with the bad (where the bad here is otherwise useful generality, not something really objectively bad)

as mentioned in the HT writeup, startup time can be vanishingly small if thats of importance to you.

overall there should be some modest performance gains, but really not enough to drive adoption solely on that basis. for me the real win is radically simplified management.

given that everyone is building/using management layers over OS instances in a cloud context - is it really useful to carry around that large amount of additional machinery to manage the per-OS configuration? if I can trivially build and deploy new instances and wire them into my service, do I need chef anymore? do I need to fuss with user permissions in a world where no one is really logging in?

why manage a firewall, which assumes that all sorts of random junk might be running and we need to implement a policy for safety. with tools like this I can construct a system which by construction only responds to requests for the service of interest.

do I need to maintain a shell environment for monitoring and debugability? maybe - but effective distributed deployments already use distributed monitoring and tracing tools that are arguably more effective - even for general purpose OS deployments.

mruts · on April 7, 2019

I mean, ideally you could get rid of the performance hit from syscalls as you could run the app and the barebones kernel in the same address space. You could also get rid of pages, virtual memory, and have a very fast malloc implementation. read()/write() could also be super fast. In addition, you can get rid of the scheduler (assuming no threads) and a ton of other complexity. Also the cpu cache lines would be hella consistent.

Of course, you’re throwing away a lot. But for certain applications (like HFT), the potential benefits seem very attractive.