Go-style coroutines will still be more efficient and more elegant than C++ corou...

pcwalton · on March 13, 2017

> Nishanov dismisses stackful coroutines out-of-hand because to be memory efficient one would need relocatable (i.e. growable) stacks. But that begs the question of how costly it would be to actually make the stack relocatable. I would assume that in a language like C++ where stack layout must already be recorded in detail to support exceptions, that efficiently relocatable stacks wouldn't be too difficult to implement.

No, this is a common misconception. Unwind tables only store enough information to be able to locate each object that has to be destroyed. It's an entirely different matter to move those objects, because a system that does that has to be able to find all outstanding pointers to an moved object in order to update them. It's legal, and ubiquitous, in C++ to hide pointers to objects in places that the compiler has no knowledge of. Thus moving GC as it's usually implemented is generally impossible (outside of very fragile and application-specific domains like Chrome's Oilpan).

The best you can do in an uncooperative environment like that of C++ is to allocate large regions of a 64-bit address space and page your stacks in on demand. (Note that this setup is gets awfully close to threads, which is not a coincidence—I think stackful coroutines and threads are essentially the same concept.)

mkup · on March 15, 2017

> I think stackful coroutines and threads are essentially the same concept.

I think "fibers" is more correct term here: "stackful coroutines and fibers are essentially the same concept".

In Win32, each process consists of threads, and each thread consists of fibers. Each fiber has its own stack. Userspace code can switch fibers with SwitchToFiber() API, kernel never switches fibers (kernel's unit of scheduling is entire thread).

Matthias247 · on March 13, 2017

Is there actually some profound research material or experience from larger projects on whether stackful [g|c]oroutines or stackless ones performs better? I'm wondering about this for quite some time and would be very interested in some more facts about this.

During the last years the stackless style in form of eventloops with promises, continuations and async/await sugar got a lot of traction. Often the authors argue that it's better for performance, since no stacks need to be allocated, switched or resized and the only overhead is the object which is allocated for the statemachines - which is true for sure. However these systems have still lots of dynamic allocations, often one or more for each point where a suspension happens. With goroutines in contrast there should be no need for dynamic allocations in general. And also less need for dynamic allocations in general, since lots of state can be put on the stack instead of being stored in various statemachine objects in the heap. I guess data locality might also be a plus for goroutines, instead of having state possibly spread all over the heap it's in a smaller region which is the goroutines stack. But of course the downside is that these stacks need to be allocated and sometimes resized (however I think for lots of goroutines that either won't happen or won't happen often).

dbaupp · on March 13, 2017

> However these systems have still lots of dynamic allocations, often one or more for each point where a suspension happens. And also less need for dynamic allocations in general, since lots of state can be put on the stack instead of being stored in various statemachine objects in the heap. I guess data locality might also be a plus for goroutines, instead of having state possibly spread all over the heap it's in a smaller region which is the goroutines stack

Rust's futures-rs library seems to only require a single allocation for a whole "task", rather than at every suspension point. This implies that that all of the relevant persistent-across-suspensions data is in one object on the heap, in a contiguous allocation that is reused for the whole processing (similar to a goroutine's stack).

> - None of the future combinators impose any allocation. When we do things like chain uses of and_then, not only are we not allocating, we are in fact building up a big enum that represents the state machine. (There is one allocation needed per “task”, which usually works out to one per connection.)

- https://aturon.github.io/blog/2016/08/11/futures/

Matthias247 · on March 14, 2017

It's certainly true that Rusts future model (which is readiness and not completion based) is the one that requires fewer allocations - e.g. compared to JS promises or C# Tasks. There seem to be some in the Task structure for parking/unparking. I don't have any idea how often these happen as I don't have a deep understanding of that code. Besides that I think most more complex scenarios will use boxed futures, as combining the raw futures yields more and more complex future types which at some point won't combine anymore unless there's some type erasure (through boxing) - or probably with intense effort and knowledge that the average user won't have. The comparable thing to lots of these combinators with goroutines would just be normal control flow (loops, conditions, etc.), which will not allocate. Then there's some things where I'm not certain about how they work out with Rusts approach: E.g. let's say I have some kind of parser with 20 input states that, and for the duration of each state I would need some temporary variable of 500bytes. With the stackful approach I would enter a new scope (maybe in form of a function call), put the temp variable and the stack, work with it in the blocking call, exit the stack and the memory is freed. My understanding is that with Rusts state machine approach where everything is allocated upfront I would need to reserve the 500bytes in each of the child futures which handles the step and the total future would combine those, which results in 20*500bytes. Or are these coalesced since the combined future can be represented as an enum/union?

wahern · on March 13, 2017

It's still using callbacks, though. So while there may be a single allocation, code conciseness suffers.

Using Rust futures' and_then, or_else, etc interfaces, try writing a parser that consumes data byte-by-byte, where reading each input byte might block. (Similar to using fgetc() in a C state machine.) It won't be very easy to understand, especially compared to threaded code.[1] Doing that with coroutines, even stackless coroutines, would be much more clear.

That's the problem with futures. If futures were so great, we'd use them for all control flow. There's a reason we don't, just like there's a reason we don't use purely functional languages everywhere. But at least purely functional languages are consistent, and usually have other features (like lexical binding and garbage collection) to make alternative control flow patterns easier and more concise.

Ask yourself this: if resizeable, relocatable stacks were a freebie (i.e. already provided, or because there was 0 desire to support a pre-existing calling convention and most of the instrumentation needed for moving stacks was preordained by exceptions), would you ever not implement stackful coroutines? The future pattern API could be easily implemented as a library for when it made sense. Decorating calls with "async" would become superflous, but people would still be free to do that. Message passing and consumer/producer inversions would become free and natural coding patterns. Stackful coroutines are clearly the more powerful abstraction. But implementing them is difficult, especially in the context of legacy environments (e.g. C runtimes) and so people compromise and often choose a less powerful abstraction.

[1] In reality you're likely to use a mismash of patterns and language features because futures will just be too intrusive of a construct for fine-grained control flow manipulation. All of that adds to the complexity of a language and to the solutions you implement in the language.

Back when C10K was first written, the big debate was between threaded vs asynchronous (implicitly callback based) code. In terms of performance asynchronous won the day, but I think the threading proponents were always correct that threaded code was more natural, more clear, and less buggy (ignoring preemptive multithreading). Even though I've spent over 15 years writing asynchronous I/O networking code, I think the threading proponents were correct. For control flow, threading is the best default model and should be the gold standard. Stackful coroutines are similar to threads in terms of how they relate to normal calling conventions and function types. When it comes to the theoretical code conciseness and performance possibilities that's no coincidence.

int_19h · on March 14, 2017

Once you have callbacks, it's not hard to wrap it in some nice syntax sugar that makes it look like regular code, more or less. Or did you mean something else?

Yes, I agree that it's a matter of "legacy environments". But the thing about it is that they will remain non-legacy for as long as the purported successor is not determined. The C model that everyone currently uses as the lowest common denominator persists precisely because of that - it's what everyone has been able to agree on so far. Beyond that, not so much. So Go-style coroutines may well be awesome, but it doesn't matter because of all the other languages that have incompatible models.

Futures, on the other hand, work for this purpose, because you can represent them even on C ABI level. Granted, it's a fairly expensive abstraction then, but it works. Look at WinRT, where futures are pervasive in platform APIs and code that consumes them, and it all works across .NET, C++ and JS, with a single future ABI underneath all three.

stymaar · on March 14, 2017

> Ask yourself this: if resizeable, relocatable stacks were a freebie (i.e. already provided, or because there was 0 desire to support a pre-existing calling convention and most of the instrumentation needed for moving stacks was preordained by exceptions), would you ever not implement stackful coroutines?

Don't stackful coroutines necessarily mean you need a runtime (with a GC) that is able to grow the stack ? Or am I missing something ?

slrz · on March 14, 2017

Why would you need a garbage collector?

To allow for stack growing, the compiler would insert a small number of additional instructions into function entries that either do nothing or call out to a runtime library function that handles stack growing.

Nothing conceptually new in here, that isn't already done by compilers to deal with exception handling, initialize variables or perform division on machines that don't do it natively.

Sure, it requires some runtime support but so does almost every C or C++ program in existence.

edit: Take a look at the split stack support implemented by GCC: https://gcc.gnu.org/wiki/SplitStacks

dbaupp · on March 14, 2017

FWIW, Rust used to use split stacks, as implemented by LLVM, but they had problems like unpredictable performance (if one happened happened to be unlucky and end up repeatedly calling functions across the split boundary, each of those calls ends up doing far more work than desireable). Relocating stacks is a whole other issue, and almost impossible in the model of languages like C/C++/Rust where there's no precise tracking of pointer locations: the information in exception handlers (approximately) tells you were things are, but nothing about where pointers to those things are.

slrz · on March 15, 2017

Yes, copying stacks are unsuitable for unrestricted C or C++ code. Segmented stacks might have problems (Go also switched away from them for the same hard-to-predict performance reason) but when you can't do stack moving I still think they're attractive.

lossolo · on March 13, 2017

> Goroutines are stackful, whereas in C++ you'll need to manually chain coroutines. That means a dynamic allocation and deallocation for _each_ call frame in the chain.

You can elide that heap allocation. Coroutines proposed for C++ are like functions, when you call function inside goroutine you also allocate function frame, as you allocate frame for coroutine.

Every gorutine has it's own stack which means you also allocate minimum of 2 KB for every gorutine. When stack grows over 2 KB you need to copy + reallocate in Go.

Coroutines are like functions, easily optimized by compilers which Gor showed. You can get coroutines inlined and state machines are not so easily optimized.

> At the end of the day, without stackful coroutines networking still won't be as elegant as in Go. And for maximum performance, state machines (e.g. using computed gotos) will still be useful. You'll either need to sacrifice code clarity and composition by explicitly minimizing chains, or you'll sacrifice performance by having deep coroutine chains.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p005...

EDIT: Citation from Gor discussing this in more detail:

"In an ordinary function, locals are stored in the registers and in the activation frame on the stack. Coroutines, in addition to the activation frame/registers, have a separate coroutine frame that stores things that need to persist across suspends. Activation frame is created every time coroutine is entered and torn down when it is suspended or completed. It looks as a normal function call and return. Coroutine frame is created the first time you activate a particular coroutine instance and persists across suspend until a control flow reaches the end of the function or hits a return statement.

In general, a call to a coroutine (with suspend points) is more expensive than a function call, since we need to allocate both the activation frame (sub rsp,X) and a coroutine frame (operator new). When the coroutine lifetime is fully enclosed in the lifetime of the caller we simply put the coroutine state as a temporary on the frame of the caller, thus avoiding heap allocation.

To summarize:

    suspend/resume = cost of the function call

    coroutine call/return >= const of the function call

"

wahern · on March 13, 2017

Coroutines proposed for C++ are indeed, "like functions". But they're not functions. It matters.

While the proposal will allow more optimization than the existing alternatives, they won't allow all the same optimizations possible as for stackful coroutines. For example, you can't inline a non-stackful coroutine that has independent state. But a stackful coroutines could be inlined. If all you care about is using coroutines as an interface to an async API, then it's not a big deal. But that avoids questions of how and at what cost you could make use of coroutines to implement the API. (You know, the _hard_ part!) And without experience using coroutines in languages like Lua, you can't even _imagine_ all the possible programming patterns you're missing out on.

As a general matter I think it's really short-sighted to add a language abstraction and only use as your frame of reference one niche usage pattern--namely, async completion. As an abstraction coroutines are _much_ more general than that, but they're unlikely to be used in those ways if they're not transparent--that is, if they're a leaky abstraction that constantly forces you to keep in mind various tradeoffs.

I don't see why the reality needs to be controversial. Stackful coroutines are just more powerful. Full stop. But that doesn't mean non-stackful are useless or shouldn't be added to C++. Just don't expect to see the same dividends that you could theoretically get from stackful coroutines, and definitely not the same dividends as from Goroutines. Don't forget that Go has first-class functions with lexical binding and garbage collection. All of those language features compound the benefits of stackful coroutines.

gpderetta · on March 13, 2017

> For example, you can't inline a non-stackful coroutine that has independent state

I'm a huge fan of stackful coroutines (having written multiple variants myself) and I have been lobbying hard for them. Having said that, stackless coroutines, as long as the next continuation is statically known or knowable, are trivially inlineable (they desugar to the equivalent of a switch statement).

lossolo · on March 13, 2017

I am not saying stackful coroutines are bad, I'am using Go on daily basis, I've used libmill stackfull coroutines in C etc. They have their usage cases in which they will be better than stackless coroutines, there are tradeoffs in both designs of coroutines. Stackless coroutines for my use case they are ideal (async IO) and can be easily optimized. I don't agree with what you wrote about using coroutines (and similar concepts) for async completion is niche. In most code bases I saw coroutines/tasks/gorutines etc are used mainly for async IO.

jstarks · on March 13, 2017

I thought I read somewhere that Gor's coroutines could combine allocations as a compiler optimization in some cases. In particular this seems like it would be useful when the compiler can statically recognize deep, unconditional branching.

I can't find a link for that right now, though. Do you know anything about this?

jevinskie · on March 14, 2017

I asked Gor if you can use LLVM's coroutines without a runtime. He answered in the affirmative (subject to some constraints), which I take as evidence, along with his presentations, that these optimizations do indeed take place.

https://www.reddit.com/r/cpp/comments/3p3uva/cppcon_2015_gor...

throwaway91111 · on March 13, 2017

Nit, that raises the question, you aren't begging anything.

LeanderK · on March 13, 2017

is this a limitation of C++ oder LLVM?

wahern · on March 13, 2017

It's a limitation of C++. The reason is to avoid breaking interoperability with C and the hosts' existing calling conventions.

They probably made the right decision. C++ is already heavy on dynamic allocation, so I don't really see it being a significant issue by itself.

But from a code composability standpoint there's a huge gulf between stackful and non-stackful coroutines. People don't see it because it's actually a rare feature supported by only a few mainstream languages (Lua, Go, Modula). It's hard to appreciate something you've never really used.

I like to think of stackful coroutines as similar to functions as first-class values. When you program in a language that supports functions as first-class values--especially with lexical binding--then function composition becomes much more natural. Similarly, when coroutines are stackful you'll use them more naturally, rather than relegate them to niche problems like async I/O. Put another way, when you don't need to decorate calls to coroutines, or when coroutines don't need to have a special type, coroutines become just like any other function. In particular, they can become blackboxes again, where you can use them as appropriate without the caller having to cooperate.

If all functions were first-class values in C++, then non-stackful coroutines would be a real loss for the language because it would break symmetry. Because C++ doesn't have first-class functions, it's already normal to juggle different kinds of functions or function-like objects, so programming patterns are unlikely to change very much. So in that sense it's not that much of a loss, but effectively a net gain.

I'd just like people to realize that not all coroutine implementations are the same. Many of the more elegant and seamless constructs just aren't possible with non-stackful coroutines, in the same way that some of the more elegant function composition patterns just aren't possible without functions as first-class values. And when writing languages from scratch it would be really nice for people to pay attention to these differences. Stackful coroutines are not easily added to a language design or implementation after the fact. It's why Python and JavaScript don't have them--because the existing implementations assume a fixed, contiguous C stack all over the code base, and refactoring those assumptions away is impractical, if not impossible without effectively a full rewrite.

gpderetta · on March 13, 2017

> It's a limitation of C++. The reason is to avoid breaking interoperability with C and the hosts' existing calling conventions.

Nothing to do with that. Stackful coroutines in C++ have existed for decades and are proven technology. In fact the push for stackless coroutines is because they are perceived to be better optimizable.

> They probably made the right decision.

no decision has been made, while MS proposal is in a more polished state, there is a lot of pushback and the other proposal is still under consideration.

> C++ is already heavy on dynamic allocation

what??

digler999 · on March 13, 2017

> whereas in C++ you'll need to manually chain coroutines.

Boost has stackful coroutines:

http://www.boost.org/doc/libs/1_63_0/libs/coroutine2/doc/htm...

wahern · on March 13, 2017

But without relocatable stacks you're incurring a large, fixed cost per coroutine.

Of course, with stackful coroutines you'll have a lot less coroutine objects. But the real issue is that you have to decide ahead of time how large you want to make the stack. Too large and you waste memory, too small and you'll blow the stack. In the context of individual projects the programmer can usually make the call without much trouble. But removing that kind of burden from the programmer is usually the primary reason you add such abstractions to the language standard. And it's why the typical execution stack is so huge (1+ megabytes) on most non-embedded systems (and even many embedded, network-facing systems).

gpderetta · on March 13, 2017

boost.coroutine can be optionally used with segmented stacks. Copyable stacks are a problem in C++ as objects can have custom copy semantics.

Anyway, large stacks (assuming you want at least 1 page per coroutine) do not waste memory, they only waste address space which is plentiful. If you want slim coroutines, there are the shallow generator style coroutines of the other proposal.

There is some hope that we will have the best of both worlds, stackful coruoutine semantics plus annotations to opt in to the stackless coroutines optimization.

wahern · on March 14, 2017

  There is some hope that we will have the best of both
  worlds, stackful coruoutine semantics plus annotations to
  opt in to the stackless coroutines optimization.

That would be pretty amazing. Are there proposals or committee discussions you could link to? If C++ got stackful coroutines I might finally make the switch from C.

medi11 · on March 15, 2017

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p053...

call/cc is the basic mechanism - useful to implement stackful coroutines and fibers (cooperative multi-tasking)