Natalie: An early-stage Ruby implementation that compiles to C++

WalterBright · on Dec 24, 2021

Compiling to another language is always going to be very slow, and very frustrating, because one's expressiveness is limited to the expressiveness of the target language.

(Yes, I know, because C++ is Turing Complete, anything is expressible, but it's a matter of efficiency.)

For example, what if the language needed exceptions to work in a different way than the target does?

And these days, one can easily use existing backends like the Digital Mars one (Boost Licensed!), the Gnu one or the LLVM one. Why punish yourself by emitting code in another language? Natalie already requires gcc or clang anyway.

I could never have gotten D to work if it emitted C code as output.

WalterBright · on Dec 24, 2021

> I could never have gotten D to work if it emitted C code as output.

D goes even further. It also compiles C code :-/

sitkack · on Dec 24, 2021

C++, a more complex language in all dimensions managed to produce C code via the first C++ compiler, Cfront.

Dlang has its own native code generator? You have amazing chops, I am sure you could have emitted C code.

What are your thoughts on producing Wasm instead of C? Maybe Scheme is the perfect compiler target.

https://en.wikipedia.org/wiki/Cfront

WalterBright · on Dec 24, 2021

cfront in the 80s was a far simpler C++ compiler than today. And if you ever looked at the code it generated, you'd be horrified.

One major problem it had in generating C was the lack of COMDAT sections in the generated object file. This was needed for generating code for the same member function in multiple source files. C compilers would just stick the code into the text segment, resulting in multiple declaration errors from the linker.

I am not saying Bjarne was an inept programmer. He's actually brilliant, it was just a very very hard problem he set for himself, and it's amazing cfront worked as well as it did.

The cfront on DOS also was very slow (imagine writing the generated source code out to floppies!) and crippled because it didn't have near and far pointers. Zortech C++ was the first native C++ compiler, which bypassed all these problems. ZTC++ simply vaulted over cfront by making C++ simple and fast on DOS. DOS at the time was where 90% of the programming action was, so this was no small thing.

My opinion, and that of a few others at the time, was that Zortech saved C++ from oblivion by having a usable implementation on DOS. You can see that by the traffic level in comp.lang.c++ at the time - it took off immediately after ZTC++ was released, and never looked back.

sitkack · on Dec 24, 2021

I always love your responses. I didn't know about COMDAT, but I did use the SAS compiler on the Amiga which was Cfront based. I didn't know enough to be horrified, but I often looked at the output because I was integrating the output with regular C code (interpreters, Arexx, etc). It was kinda like Hungarian Notation++ ;) I got to the point where it felt natural to read the name mangling.

The Amiga SAS/C compiler was also pretty slow. I can see how ZTC would be popular in by skipping intermediates and not having to reinvoke the compiler.

How many problems from the 80s melted away because of the large amount of ram available in the 90s?

WalterBright · on Dec 24, 2021

One problem that melted away was the compiler's 3 separate passes could be merged into one!

Also, the amount of code in the compiler increased enormously from the 80's to today.

mhh__ · on Dec 24, 2021

The dmd code generator was/is a C++ compiler backend too, and can trace it's lineage back to the first compilers to make Cfront obsolete.

We could emit C code, but it's just worse when you need to get to the cutting edge. Even if you end up going through the same backend via C, you've now (say) lost devirtualization.

Banana699 · on Dec 24, 2021

>Compiling to another language is always going to be very slow

I'm curious as to why you think this is necessarily true ? It's always going to be _slower_ than writing the target language directly, off course, that's just a tautology. But if I pick a target language with a really fast compiler, say Golang (judging purely by reputation, not actual experience), then I have a huge headstart on other languages, right ? Even though I'm slower than Golang, I'm faster than other languages because Golang is so much faster than other languages. This is not the case with C++, but that's just a special case.

I think that the true comparison you're probably making here is not what I said above, but that parsing a language, processing it, then writing it to another language (which itself needs to be parsed and processed etc...); I think that you're basically saying that this is always slower than just emitting whatever final IR we eventually reach directly ? But like I said above, if the target language is correctly chosen the overhead doesn't matter much, and in return you get the benefits of using a source language as IR, I will get to that later in reply to another point you made.

>and very frustrating, because one's expressiveness is limited to the expressiveness of the target language.

But C++ is extremely expressive though, indeed it's _too_ expressive for humans to grok and control.

>For example, what if the language needed exceptions to work in a different way than the target does?

I'm not the brightest bulb on runtimes and exceptions so I will take your (admittedly brilliant, I'm a fan) word that this is actually an insurmountable problem.

But isn't this a time-precision tradeoff? if you don't compile to a language with exceptions then you need to make exceptions work from the very scratch. Using a source language as target, you get a working, heavily-debugged, exception mechanism, but you're now constrained to what it can express.

>Why punish yourself by emitting code in another language?

This is where the benefit of compiling-to-source that I mentioned above comes in: it's an extremely low-barrier-to-entry strategy. Even though VM bytecode formats and other IRs are specifically designed to be clean and abstract targets that wraps the ugly details of platforms/architectures, there is nothing more clean and abstract than a source language designed to be used by humans.

It's extremely attractive for new compiler writers to not have to learn yet another language and keep its (quite low level) details in their mind along with the source one, they can just specify an equivalent program in another source language they already know and get all of its toolchain for free.

This can backfire in cases like C and C++, where the languages are not actually clean at all and there are tons of special cases and undefined behaviour that most developers ignore, but again that's just a special case, there is no reason the overall approach can't work with other languages.

>I could never have gotten D to work if it emitted C code as output

Like I said, it's an entry-level strategy. It works when you need something working and you need it fast, once you have something working you can ditch the makeshift backend and create a proper one (hopefully you haven't relied on any implicit semantics of the previous backend). It's like an intermediate point between an honest-to-God compiler and a naive tree-walking interpreter, they are all points on the same tradeoff axis.

WalterBright · on Dec 24, 2021

> I'm curious as to why you think this is necessarily true ?

It's always going to be slower simply because it's two compilers rather than one. You're writing another file to disk, and reading it back in. The lexing and parsing has to be done over again.

C++ only works if it is a superset of your language. For example, suppose your language wants to trap integer overflow. C++ doesn't do that. Think of how you'd write a+b in C++ and check for overflow. It isn't pretty or efficient. Or suppose you wanted to do a computed goto. C++ doesn't have that. It's not easy or efficient to do rewrite it. (gcc has it as an extension.)

Suppose you wanted to use the BCD arithmetic type in the x87. You're out of luck using C++. Or the 80 bit reals in the x87. You're out of luck with many C++'s as they don't support that.

> if you don't compile to a language with exceptions then you need to make exceptions work from the very scratch.

Yes, not an easy task at all.

> there is no reason the overall approach can't work with other languages

You'll find, as a practical matter, that if you're using language X as the target of your language, it will inevitably constrain the semantics of your language to be that of X. You can't even do things like use a different function call ABI.

> It works when you need something working and you need it fast

I bet you'll get something working fast, but trying to get the last 25% working will consume much more time than if you used an existing, well-developed back end.

adamrt · on Dec 23, 2021

Andreas Kling (of SerenityOS) recently did a Natalie contribution video that I found interesting. https://www.youtube.com/watch?v=b4PZgvPYkP4

awesomekling · on Dec 23, 2021

And if you'd like to see Tim (original author of Natalie) working on Natalie, he has a YouTube channel with lots of great content here: https://www.youtube.com/channel/UCgWip0vxtqu34rZrFeCpUow :^)

netule · on Dec 23, 2021

I'm genuinely curious: why would I use this over something like Crystal, which is pretty much Ruby, but native?

https://crystal-lang.org/

pjmlp · on Dec 23, 2021

Ideally, access to the existing Ruby ecosystem.

Crystal makes sense in the context of writing stuff from scratch or wanting to contribute to the ecosystem.

999900000999 · on Dec 23, 2021

I sincerely doubt a tool like this will give you access to the full Ruby ecosystem.

I definitely love the idea though, I struggle with lower level languages, although I consider myself fairly decent with both Python and JavaScript.

pjmlp · on Dec 23, 2021

This one in particular probably not, since it is in early stages.

RubyMotion on the contrary is quite advanced and an AOT compiler to go to, when one wants to use Ruby for mobile development.

http://www.rubymotion.com/

_3u10 · on Dec 23, 2021

RubyMotion is probably as good as it gets for AOT. Ruby depends on dynamic dispatch and thankfully objc is very similar the needs of a Ruby-like language.

Truffle might work even better as it’s able to recompile.

Any C++ port will likely need to reimplement half of the objc runtime to support all of Ruby. Not sure if clangs/gcc objc support includes the runtime, but I’m imagining it would… so maybe it’s reusable that way.

cmer · on Dec 23, 2021

Is RubyMotion something we can actually use to build non-toy apps in 2021? I thought it was dead, but perhaps it is not!

wndxlori · on Dec 24, 2021

It is not dead. There are regular releases, a helpful community at slack.RubyMotion.com and training available at https://wndx.school (that last is mine)

vidarh · on Dec 23, 2021

Crystal is superficially similar but it pretty much ends there, and many off the differences feels pretty arbitrary.

chrisseaton · on Dec 23, 2021

> which is pretty much Ruby

Crystal looks similar, but really has very different semantics, and so can't run existing Ruby.

ksec · on Dec 23, 2021

I was going to suggest adding Natalie to Ruby Compiler List [1], turns out Natalie is already on the list and started in 2019!

[1]https://ruby-compilers.com

chrisseaton · on Dec 23, 2021

Yeah I need to do the deep dive on it!

rubyfan · on Dec 24, 2021

Crystal is not “pretty much Ruby” unfortunately. Use it for more than 30 seconds and you realize “oh this isn’t ruby”

turbinerneiter · on Dec 23, 2021

I always wonder why you would target another language with your compiler, instead of an IR.

What is the benefit of transpiling to C++ over using LLVM?

(not meant as a criticism, genuinely curious)

pjmlp · on Dec 23, 2021

It is easier to deal with, going one step further requires dealing with more low level coding and can be demotivating for some.

However I would advise to target an IR instead.

Doesn't need to be LLVM, if the purpose is only learning about compilers, do as follows:

1 - Create an IR, preferably stack based as they are quite easy to target

2 - Basic IR interpreter for testing the workflow

3 - With a macro assembler, convert the IR into machine code in a dumb way

Now you have a workable compiler, even if it won't win any prizes.

If still interesting, then proceed to improve the code generation in a proper way.

sitkack · on Dec 23, 2021

Ribbit is an R4RS implementation that is extremely compact. The video is only 15 minutes and worth a watch.

https://github.com/udem-dlteam/ribbit

https://www.iro.umontreal.ca/~feeley/papers/YvonFeeleyVMIL21...

https://www.youtube.com/watch?v=A3r0cYRwrSs

pjmlp · on Dec 23, 2021

Thanks for sharing.

diordiderot · on Dec 24, 2021

What's an IR

pjmlp · on Dec 24, 2021

Intermediate Representation.

Kind of high level Assembly like language used across compiler stages to make them more modular and easier to manipulate.

Some known examples are LLVM bitcode, GCC GIMPLE, Rust MIR, Swift SIR, Tensorflow MIR,....

Even bytecode formats like MSIL, JVM, WASM, P-Code,.... can be considered some form of IR.

diordiderot · on Dec 24, 2021

Oooh okay. Thank you

sesuximo · on Dec 23, 2021

One reason is that c++ is a stable “api” whereas llvm ir is a somewhat moving target

boondaburrah · on Dec 24, 2021

For me, it's that it's almost guaranteed a C++ compiler is available for the platform I'm targeting. This could be for various reasons, but it's usually either:

1. The chip isn't supported by LLVM (which most new languages use)

2. The platform owner /requires/ the use of their C++ only SDK and will not approve any other compilers for use on their app store, so C++ becomes the "machine language" of that platform.

afranchuk · on Dec 23, 2021

"It depends", as always :)

Some people may do it because it's more approachable to them (and/or others). Others may have language models/runtimes that align closely with the target language. Or they may want to "stand on the shoulders of giants", benefiting from the higher-level optimizations of the target language. Also, for example, LLVM still doesn't target as many architectures as gcc (though I'm not saying that's necessarily very relevant for most users).

That's just what comes to mind. I can't say for certain anything about this particular language though!

rurban · on Dec 23, 2021

Most importantly you get immediate startup. No need to parse and compile. Then you'll also need less memory.

And third and less important you can add more expensive offline optimizations, which are too heavy at run-time. Like escape analysis, and inlining.

Esp for the static parts. The dynamic parts just call into the shared ruby runtime.

chrisseaton · on Dec 23, 2021

> Most importantly you get immediate startup. No need to parse and compile.

Compiling to an IR doesn't mean 'and then interpret or JIT it'. Can can use an IR to compile to native. GCC, Clang, Rust etc all an IR, for example.

sitkack · on Dec 23, 2021

This is cool, it looks conceptually similar to Shedskin. A Python to C++ compiler.

https://shedskin.github.io/

ModernMech · on Dec 23, 2021

I’m a little confused about this project. I’m trying to build the most complete list of programming languages out there (currently working on it via my favorites, it you want to have a look), and I’m trying to figure out if this qualifies.

Usually it’s pretty easy to get on my list: if you call your project a language I add it to the list.

But this one gives me a little pause, because it seems like this language is not distinct from Ruby at all. Rather this is a straight up Ruby -> C++.

Is it fair to call this a language rather than a compiler? To me, a language is more than syntax and semantics, but includes an library ecosystem, tooling, and community. Does Natalie aim to grow a community, or will it exist fully within the Rudy ecosystem?

richardwhiuk · on Dec 24, 2021

In a strict CS sense, languages are defined by what they accept, and how they parse what they accept.

Given that this will only accept and correctly parse a subset of the Ruby language, in a strict CS sense, it accepts a different language to Ruby.

In time, this difference may shrink (as Natalie becomes more complete), become larger (as Ruby gets more fully featured) or diverge so that Natalie isn't a subset (e.g. if Natalie decides for some reason to parse some Ruby construct differently to Ruby).

saghm · on Dec 23, 2021

This isn't the first time something like this has been done; for a while Facebook used a PHP to C++ compiler before switching to compiling it directly to native code via a JIT:

https://en.wikipedia.org/wiki/HipHop_for_PHP https://en.wikipedia.org/wiki/HHVM

chrisseaton · on Dec 23, 2021

> because it seems like this language is not distinct from Ruby at all

I don't think it claims to be - it's a compiler for Ruby.

ModernMech · on Dec 23, 2021

The domain is Natalie-lang.org and the header on that page is “Natalie Programming Language”, so that’s where I’m a little confused. Maybe the goal isn’t to be a Ruby compiler but to transition into something more, so I was wondering if anyone had any idea.

dragonwriter · on Dec 24, 2021

And the text immediately underneath the the header is “Natalie is a work-in-progress Ruby implementation, compiled to C++”.

pmontra · on Dec 23, 2021

In the JavaScript world this would be called a transpiler. Instead of compiling a newer version of JS into an older one this compiles Ruby in a totally different language, C++. However the Ruby interpreter (MRI) is written in C so C++ is less far away from it than, let's say, Java. I didn't check the code but I wonder if they inlined some code from MRI.

alekq · on Dec 23, 2021

It is Ruby implementation if I got it right, not a Ruby-like language.

Not the first and probably not the last one (jRuby, Rubinius, MacRuby etc.).

sanxiyn · on Dec 24, 2021

This heavily reminds me of Nuitka, a similar project for Python.

kdasme · on Dec 23, 2021

I was under impression this can already be achieved via TruffleRuby compilation into a native image? Not that I used it, but thought this is doable, and the process is well tested.

tambre · on Dec 23, 2021

autconf in 2021 instead of something more humane like CMake, Meson, Bazel, build2. Why?

whyareyoul · on Dec 23, 2021

Whys it matter? Its solid, exists everywhere and works and contributors dont have to learn $buildsystem of the month

tambre · on Dec 24, 2021

autoconf is nix-only in practice, macro-based, verbose and has a significant legacy burden. I have suffered plenty of pain from trying to build the few projects that still use it. In such cases it's often been faster to write own build scripts for some other newer buildsystem.

CMake and Meson I've used across tens of different projects, there's solid online documentation, significant usage, a good underlying language and I already have my toolchain files ready. Using autoconf instead of something more common and modern is likely to turn off people contributing or using your project as they'll be unfamiliar with it and are going to have more trouble solving issues on their own.

racl101 · on Dec 23, 2021

Ugh, these names.