Game Programming Patterns: Event Queue (2014)

dbatten · on Aug 21, 2020

Reminds me of the excellent post about the architecture of multi-player in the original Age of Empires game:

https://www.gamasutra.com/view/feature/131503/1500_archers_o...

You had high-ping 28.8k internet, and you wanted to be able to have immersive battles in a world with hundreds of units controlled by several players. There's no way you could send real-time information about the status of every object in the game. So they didn't. They made the game simulation completely deterministic and then simply shipped user input around on the network. User input was processed on a slight (couple hundred milliseconds) delay (to allow for network transmission speeds) and executed simultaneously in "game time" on everybody's machine, so the simulation stayed in perfect sync. Voila, no worrying about lag.

nemetroid · on Aug 21, 2020

Fighting games take this one step further through "rollback" netcode. Instead of delaying the game while waiting, the game predicts what the opponent's input is going to be. If it turns out when the input arrives that the prediction was wrong, the game state is rolled back, the correct input is applied, and the game fast-forwarded to where it was.

Here's an extensive writeup about the concept:

https://ki.infil.net/w02-netcode.html

exlurker · on Aug 21, 2020

Many realtime online games use the same techniques today. For a very understandable walkthrough of the concept, I recommend Gabriel Gambretta's excellent 4 part series: https://gabrielgambetta.com/client-server-game-architecture....

monocasa · on Aug 21, 2020

Interestingly enough, your brain does the similar things, playing games with temporal relationships and fixing it all back up after the fact in a way that you normally can't notice.

https://en.wikipedia.org/wiki/Chronostasis

lioeters · on Aug 22, 2020

I was talking to someone about a related topic a couple weeks ago, how the brain "fudges" with the experience of time to compensate for the latency of perception. But I couldn't remember what it was called.

Thanks to the link above, I found the exact phrase: neural antedating/backdating. Ah, it's nice to have resolved that question.

smabie · on Aug 22, 2020

Except Super Smash Bros, which has the worst netcode and online play possible. Nintendo is this weird combination of incompetent and brilliant that I don't really understand.

yomly · on Aug 22, 2020

Nintendo's track record with online is truly bizarre.

They clearly have some phenomenal tech chops in the company. Iwata hand rolled a compression algorithm and Nintendo were in the internet game waaaaay early and had aspirations of being an ISP with their Famicom Modem in the 80s.

That said their web infra just seems so janky and backwards - I've heard excuses like the friction wrt online gaming keeps it family friendly and helps disincentive trolling.

Nintendo has also always been more focused on in person social which works well in Japan where console market penetration is very high, gaming in public is normal (both socially and public crime/mugging doesn't exist) and population density is high. Monhun (Monster Hunter culture) being a great example of JP cultural gaming trends that didn't export nearly so well outside of Japan. (did ok-ish in Asia). [--aside: Must have been so ace to have monhun parties after work /sigh--]

If Nintendo could make their estore more optimized for UX (think Amazon or Apple App Store / Google Play Store) I feel like they would be a money printing machine.

Of Sony, MS, Nintendo, Nintendo are the only pure games company and those marketplaces are fantastic cash cows...

ISO-morphism · on Aug 23, 2020

I remember hearing/reading when Slippi (community rollback netcode implementation for Melee) came out the creator mentioning that Japanese fighting games don't bother with rollback because it's more work and Japan is small enough and well wired enough that it's never a problem. Contrast with fighting games that come out of the US that normally implement rollback.

smabie · on Aug 23, 2020

Yeah the Japanese seem to operate under the impression that the entire world lives in Japan. I'm a semi-pro Super Smash player and I've desperately wanted to play it online, since like forever. But every version, the lag is unbelievable. It's always some p2p bullshit where one person lags, everyone lags.

I remember when the Wii U Smash was coming under criticism for the unbearable lag, the developers released a statement in which they literally said: we tested the online play in development, and it works great. Are you fucking serious? How can they care so little?

jpetso · on Aug 23, 2020

It should be noted that Japan is not the only country that frequently disregards technical constraints, units of measurement and other cultural circumstances from outside one's own country or seaside metropolitan area.

algorithmsRcool · on Aug 21, 2020

I recall that Supreme Commander used the same technique, a fully deterministic sim run in lockstep by all multiplayer peers. For it's era, it had very very high scale for an RTS game. (thousands of units and enormous game maps (the biggest of which took 15+ mins to send naval units across)

If I recall correctly, there was another benefit. You could save a multiplayer game by just having all peers dump the current game state to disk and resume it later. Since a big SupCom game could take HOURS this was a great feature.

However, there were 2 issues with SupCom's implementation.

1. It wasn't perfect and sometime hours deep into a big game, you would get a "desync" error and it was unrecoverable.

2. Everyone needed to be able to run the sim at the same speed. So the game could only progress at the speed of the slowest player's computer. And SupCom was VERY compute intensive.

SolarNet · on Aug 21, 2020

> For it's era, it had very very high scale for an RTS game

> So the game could only progress at the speed of the slowest player's computer. And SupCom was VERY compute intensive.

It might surprise you to learn that these statements are still true! SupCom:FA (still playable by buying a copy and through the third party FAF launcher; they have even fixed most of the desyncs, it's very stable these days) is probably still the largest scale RTS out there (even planetary annihilation only feels larger). And it still requires a top-tier gaming rig to play.

The reason for this is pretty straightforward, CPUs actually haven't gotten faster for well programmed, cache aware, tight-looped CPU bound operations in the last 15 years. And in fact there are arguments for why something like SupCom:FA would be difficult to make today given the industry's reliance on game engines. And games are still rarely threaded more than SupCom was (e.g. game logic thread, render thread), so until someone can get back to the low level excellence of SupCom:FA and then design and implement a concurrent game logic loop for it that actually parallelizes well, I think it will remain the king of high scale RTS for quite a while.

https://youtu.be/DEhp5eCwgSI?t=3304 (to see why this game is awesome)

majormajor · on Aug 21, 2020

Total Annihilation (unsurprisingly) had the same speed-limited-by-slowest-computer thing so I would imagine it was similar.

It always made me a bit sad that the comparatively low-scale Starcraft seemed to suck up so much of the space in the genre.

planetsmashr · on Aug 21, 2020

I had no idea that supcom worked that way! Thanks for the insight.

Your comment brings back fond memories of playing huge supcom games at a local LAN cafe. Those were some of the best times I've had gaming.

algorithmsRcool · on Aug 21, 2020

I couldn't count the number of hours I spent playing that game with friends. Never played anything else like it!

PaulStatezny · on Aug 21, 2020

StarCraft 2, the modern "king of RTS games" so to speak, still uses that model! Determinism based on user input, plus a deliberate lag (~150ms?) in handling user input to allow time for player inputs to be received.

AnotherGoodName · on Aug 21, 2020

And the original Starcraft 1 was the pioneer of the technique fwiw. They even let you set the input delay in the options to account for lag.

hinkley · on Aug 21, 2020

Aren't we saying that all of the popular 'Real Time Strategy" games are in fact turn-based, but there are 7 turns per second?

MattRix · on Aug 21, 2020

Hah, nope. A delay of 150ms doesn't mean that events only happen every 150ms. Also even if events were sent in 150ms chunks, the offset of time within that chunk would still matter.

With that said, most games do have a maximum tick rate, so in a very misleading way you could say they are turn-based ;)

JRKrause · on Aug 21, 2020

If we keep a complete record of all player inputs we can also replay the exact game after the fact, a great feature that arises as a consequence of this design choice.

malexw · on Aug 21, 2020

A great feature not only for players, but for developers as well. If every change to game state goes through an event system you can reproduce any bug or crash as long as you're careful about making sure that the event stream is recorded before the process is killed.

hinkley · on Aug 21, 2020

This ended up being an enabling technology for e-sports too, didn't it?

The observer gets the replay log (cheap), and broadcasts the pixels out (expensive) to the world with very little impact on the game play for the active participants.

setr · on Aug 22, 2020

No one watches sc2 matches by loading up the game and running a live replay -- it's all video streaming/streamcasting. Casters do use a live (delayed) replay, letting them poke at any arbitrary object in the game, but a video stream of that is what gets broadcasted.

Low bandwidth broadcast isn't really relevant I think to esport success -- the "complete" replay aspect to even 1 viewer (the caster) however might be

xsmasher · on Aug 23, 2020

I think you're in agreement with the post you're replying to.

You both say that the game state is sent to a non-player machine that creates the actual video stream seen by the viewers.

setr · on Aug 23, 2020

Sort of -- im saying that for streaming, this doesnt much; deterministic lockstop does little for the general viewership, and saves relatively little processing compared to recording the stream directly on the player's machines.

Much more notable is the impact it has on casting, where it's presence is clearly felt -- casters are granted much more freedom, and can be much more thorough and descriptive, when complete replays are available (versus simple dumb video).

We agree about the technogoly -- we disagree on its impact

EamonnMR · on Aug 21, 2020

I built an implementation of this architecture in Python if you want to see how it works in code:

https://github.com/eamonnmr/OpenLockStep

rgoulter · on Aug 21, 2020

Somewhat related, one of the developers discussed some of the issues he faced when working with the re-mastered "Age of Empires: Definitive Edition". https://richg42.blogspot.com/2018/02/some-lessons-learned-wh...

gameswithgo · on Aug 21, 2020

That is what you try to do with most games even today, just to make it harder to cheat, and various other reasons (easy to replay games, its the natural thing to do, etc)

There are times gameplay needs don't allow that model to be perfect, but it is the norm.

setr · on Aug 22, 2020

Afaik its only the norm in RTS games -- almost everything else is client/server, with the server providing (and authority of) full game state. In which case the client is just a view into server simulation -- logic moved to the client only so far as to enable faster input response through local interpolation (with provisions to undo, if the server decides thats actually wasn't allowed), and that's also where information may be leaked to enable cheating.

The main reason for the deterministic lockstep architecture is the cost of sending game state -- in a game with 10 player characters and some particles, it's sufficiently cheap. In a game with 10,000 player characters, not so much.

Determinism is a bitch to maintain, so afaik, no one tries to do so unless they must.

infinite8s · on Aug 21, 2020

Nice! I used this approach for a turn based gamed engine, but with the input processing lag it totally makes sense for real-time games as well!

bob1029 · on Aug 21, 2020

If anyone is curious, one of the fastest multi-threaded queue implementations out there is the LMAX Disruptor.

https://github.com/LMAX-Exchange/disruptor

https://github.com/LMAX-Exchange/disruptor/wiki/Performance-...

I've started using a variant of this in my .NET Core projects and have found the performance to be astonishing.

SQueeeeeL · on Aug 21, 2020

Jonathan Blow warns against threaded queues in game development, as normally simulating your world isn't the bottleneck (rendering is) and it will just cause a fair bit of unexpected behavior/debugging

zemo · on Aug 21, 2020

statements like this really need to be put into context. Maybe that's true for his games, but it's not necessarily true for all games. His latest game, The Witness, is a first-person puzzle game with zero non-player characters, no kinematics, and a variety of puzzles based around light and rendering. He designs games that don't have much to simulate and do have complicated rendering situations.

Meanwhile, Doom Eternal has no main or render thread at all, and instead uses a massively parallel jobs system. https://www.dsogaming.com/news/doom-eternal-does-not-have-a-...

SQueeeeeL · on Aug 21, 2020

I think the majority of game devs who would get advice from lectures/hacker news comments are probably making games on a small enough scope/scale that choosing a single threaded game logic engine is fairly reasonable. The people at Bethesda are the best of the best; this kind of reminds me of the fitness "advice" I was once given to not go on long jogs/runs because "all the best marathon runners are super skinny", but that only applies to world class marathoners, not dudes running a 5k on the weekend.

zemo · on Aug 21, 2020

people who work in games also read HN you know.

MaulingMonkey · on Aug 21, 2020

And said AAA devs aren't "ubermensch". They run a large gamut of specialties and skillsets - plenty of them will benefit from extra context.

And some of those small scale hobby/indie devs may later end up working for "the big leagues" as well, so the extra context can benefit them too.

SQueeeeeL · on Aug 21, 2020

tbf I said take advice from HN. I except people deeply involved in games to have their own specialized info us mortals can't touch (or they just do their own testing)

corysama · on Aug 21, 2020

I'd appreciate a link so I could hear his whole argument.

You can argue about when it is appropriate to use threads at all. But, if I'm going to use threads, I use a threaded queue for communication exclusively.

cridenour · on Aug 21, 2020

I wish X4 would go this route though, as it is entirely bottle necked by simulation speed.

SQueeeeeL · on Aug 21, 2020

I imagine execution order/consistency is very valuable for 4X games, and most of the time, results are dependent on each other (who wins a battle may depend on the current status of an empire, which is dependent on the outcome of various planet level actions, for example). It'd probably be a very different game to have each action be stateless, could be a cool exercise though

0xffff2 · on Aug 21, 2020

Note that despite using the same two characters, X4 is definitely not a 4X game.

SQueeeeeL · on Aug 21, 2020

Whoops, my brain just ran right through the word. Looking at X4, it still looks like an incredibly busy game with a very busy gamestate

winrid · on Aug 22, 2020

In some cases you have to. I'm working on a game and absolutely need separate threads for animation, rendering, and constructing meshes (it's kind of procedural - is part 3d map renderer).

TwoBit · on Aug 21, 2020

Sim games (eg SimCity, Sims) could spend more time on sim than graphics.

Thaxll · on Aug 21, 2020

Indeed, lot of games are actually single threaded.

dkersten · on Aug 21, 2020

In recent years there has been a trend shift away from this, at least in the AAA engines, towards a job system. This makes sense: you have a thread per core and you create jobs to “go wide” when you can. See for example Unity’s Job system it’s the GDC talk by Naughty Dog from a few years ago.

The big games will also prepare data for rendering in parallel (eg culling and sorting and whatnot, although much if this is also done on the GPU).

(Going by GDC talks, the rendering teardown articles and just what I see online from Unity/Unreal. I don’t work in games myself)

asdfasgasdgasdg · on Aug 21, 2020

Which makes total sense because single thread performance is growing more slowly these days. Used to be you'd double every couple of years but today the midrange is only about 50% faster single threaded than it was in '16. Now if you count all the cores you're still seeing things more than double over that time. Compare these similarly priced CPUs from today and a few years ago: i5-6500 and i5-10500. The latter is maybe 30-40% faster single threaded but has more than double the parallel throughput.

Thaxll · on Aug 21, 2020

It's true but lot of the main area don't multi-thread well like AI and physics.

asdfasgasdgasdg · on Aug 21, 2020

I don't know why AI shouldn't thread well, assuming there is more than one actor. As long as they are operating over an immutable view of the game state, each actor should be able to plan independently and enter its commands independently. Likewise, there are probably some tricks you can do with physics. And anyway in most games interactive physics is only done for a few objects in the game world, and those objects are often not interacting with each other, at least not physically. You could cluster the objects that can affect each other and then do each of them single-threaded.

corysama · on Aug 21, 2020

> As long as they are operating over an immutable view of the game state

That's a big issue. There is a surprising amount of back and forth between objects in a single step of gameplay/AI.

And, generally gameplay code tends to be a big mess of wild and ever-changing requirements from gameplay designers, extreme time crunch and short term (1 game then burn it) goals. Ivory tower software architecture it is not...

Clustering physics into "islands" is common practice though.

asdfasgasdgasdg · on Aug 22, 2020

I'm not a game dev but I do know how software can become a mess. I think engines that are used by many games have a chance to push good practices here.

gameswithgo · on Aug 21, 2020

I mean, not entirely, though the game logic often will be. But usually there is a fair bit of threading going on, and things that are threaded in the engine, or graphics card driver.

jcelerier · on Aug 21, 2020

would be nice to see updated benchmarks against the C++ queues - this LMAX queue seems to give 20-25 million messages per second on sandy bridge - the best 1P/1C C++ queue is around 250 million messages per second on a 9900K and I doubt a 9900K is 10 times more performant than a 2600K.

> https://max0x7ba.github.io/atomic_queue/html/benchmarks.html

bob1029 · on Aug 21, 2020

I am more concerned about worst case latency than going into message rates measured in the billions per second. The load those events will generate far exceeds the load incurred by creating and processing them on the same physical host, so ill never be in a situation where 25 vs 250 million makes a difference.

I am also interested in the productivity and safety afforded by high-level languages in this arena. Dealing with memory and threading at the same time is not something I like to do at a low-level.

newobj · on Aug 21, 2020

LMAX is optimizing latency, not throughput.

exhaze · on Aug 21, 2020

Can you give more context about your projects i.e. what makes them require a super high-performance queue?

bob1029 · on Aug 21, 2020

The type of project I am using this for is a centralized client/server UI architecture where 100% of user events are submitted to the queue for processing. This allows for very high throughput user interfaces if you are doing clever things on the server WRT caching of prior-generated content for other events (i.e. all login attempts for the same region will get the same final view).

I found the abstraction this was originally developed for - processing of financial transactions with latency as the primary constraint - as an excellent analogue for UI event processing. Latency is also a huge concern when the user's eyeballs are in the loop.

cma · on Aug 21, 2020

And that use Java...

dkersten · on Aug 21, 2020

If you control object pools yourself and don’t use GC, as the LMAX disrupted does as far as I remember, Java can be blazingly fast.

corysama · on Aug 21, 2020

Martin Thompson has basically made a career out of writing Java in the style of embedded C because he found enterprise customers that need the performance of embedded C but, being enterprise, insist on absolutely everything being Java.

afiori · on Aug 23, 2020

calling external code from Java adds latency.

cma · on Aug 21, 2020

You have to make sure all dependencies don't use GC as well right?

dkersten · on Aug 21, 2020

Sure, but I'm assuming that if you're writing such high performance limited scope software like the LMAX disruptor, you have few dependencies (looking at their code, it appears that the disruptor code itself has no external dependencies and uses few of the standard library classes outside of NIO bytebuffers).

MarkyC4 · on Aug 21, 2020

in LMAX-disruptor's case, they have no runtime dependencies: https://github.com/LMAX-Exchange/disruptor/blob/master/build...

griffiths · on Aug 21, 2020

Are you using https://github.com/disruptor-net/Disruptor-net library port or something else?

bob1029 · on Aug 21, 2020

This is exactly what I am using.

pjmlp · on Aug 21, 2020

Just curious, are you also making use of Span and Pipelines?

bob1029 · on Aug 21, 2020

I haven't made much use of Span directly, but I do like using Pipelines for copying streams to other streams (i.e. building AspNetCore proxy abstractions).

pjmlp · on Aug 21, 2020

Thanks!

siscia · on Aug 21, 2020

One think that would make wonder is a company wide event queue.

Just imagine all the things that it would be possible to make automatic if all the events that happens inside a business environment where in a queue.

And don't even think about how it would be great if the same queue were open to outside providers.

Galanwe · on Aug 21, 2020

That's called an ESB (enterprise software bus), a lot of companies have that, and there's a lot of softwares already on that space (IBM has quite a good market share there). Some companies do roll their own though, I've seen quite a few implementing ESBs on top of 0mq for instance.

My previous company was doing trading and made the outgoing orders available on the ESB. It's was quite handy since all teams could subscribe to the trade topic and do their own processing: trade checks, accounting, risk, etc.

It's a bit old school though, I mainly see old corporations with that architecture. The thing is, it ages quite badly most the time.

Fast forward 5 years, and every single team is publishing stuff on the ESB "cause it's super easy to publish and consume data from the ESB". Before you know it, you have a full team dedicated to scaling and maintaining the monster.

golergka · on Aug 21, 2020

Do you mean that these scaling problems hapenned because ESB was misused? Or would it occur with any architecture or architectures (if different teams used different instruments) just because of sheer volume of data that needs to be processed by business logic?

May be I'm bad at phrasing this, but the point is, sometimes you have perfomance problems because you could write better code, or use better tech, or tweak requirements a little bit, or do something else a little smarted. But sometimes you have perfomance problems simply because you just do a lot of actual work.

Galanwe · on Aug 21, 2020

I think these problems occurred because an ESB is too easy to over-use.

Once you have it, it's very tempting to use it for everything. But of course the more you use it, the more features it needs to have.

Let me illustrate that with a example compiling some of the patterns I have come to see at companies that overly used their ESBs:

Say Bob is working at a trading company and sets up an ESB where he publishes the daily trades of the company.

Soon, the trading desk learns about that, and decide that it's actually cool to use this for trading, they just have to listen on the ESB, and send the trades accordingly. That's the error pattern 1: making the ESB business critical.

The next day, the risk team learns about that ESB thing, and decide its very handy to perform post-trading checks by just listening to the trades flowing on the ESB. So they setup a system that listens for trades on the ESB, check that the trade is compliant with some limits, and send an other message on the ESB to let everyone know that this trade is validated. This is error pattern 2: cascading messages triggered from other messages.

The week after that, the security team learns about the ESB, and decides its very insecure to let anyone see the trades of the company, so they start implementing access control on the ESB. This is error pattern 3: now you have an overly complex layer on top of the ESB to decide who can see what and who can publish what.

Rince and repeat patterns 1, 2,and 3 for 5 years and here is the situation you end up with:

- The ESB is not the easy and handy system it was in the beginning. Since it has become the de-facto standard for publishing information in the whole company, it has to support features for _all_ the company use cases. There is access control to publish per topic, access control to listen per topic, multiple bindings of varying quality for each technological stack/language that each team in the company is using. The company of course is not capable / prepared to maintain a software of this scope, so the ESB is crippled with bugs that nobody can fix, because, you know, the infrastructure team cannot fix their groovy scripts using the ESB cause the guy that wrote them left. And the marketing team has some interns using the excel plug-in but they don't have time to rewrite them this year. The ESB is now partly un maintained, because the company relied totally on it without having the capacity / willingness / foresight of understanding how intricate it can be to update something that everyone use.

- The ESB is now very slow, because it was so tempting to publish anything of various interest on it that everyone did it. The problem is that the ESB is also critical for the company, so the whole flow of message is now slowly moving and overflowing, requiring endless tuning and tentatives at scaling it better. Of course 80% of the messages on the ESB are actually not listened to by anyone, but since nobody really knows who listens to the published messages, it's very tempting to just _not_ stop programs from publishing, ever, because god knows if some random team at the other end of the company might have a program reading these messages.

- You most likely have now an IT team dedicated to maintain the ESB. They are squeezed and pressured by the business teams to keep the ESB fast and easy to stable without requiring them to recode all the crap they plugged on it. On the other end, the other IT teams are pressuring them to update the ESB to support <place your language/stack here>. Of course the ESB team has no incentive to make any improvement whatsoever to the ESB, because that would definitely crash most of the crap the less technical teams of the company plugged to it. But the ESB team is the de facto guardian of the temple of the ESB, so everyone ends up frustrated by the situation.

---

I'm not sure I did a good job at explaining the various problems here, but basically, the one size fits all that ESBs are promoting is often not a future proof choice.

The reality is that you don't want your whole company coupled to a single system like that. Otherwise your system will be as good as the worst user of it.

mumblemumble · on Aug 21, 2020

FWIW, I worked at a trading firm that had a very tidy and long-lived ESB that I found to be a joy to work with.

-but-

My experience there 100% confirms what you say about it not being something that scales well.

It was a relatively modestly sized firm by headcount, as trading firms go, and there was a corporate culture of absolutely intense inter-team and inter-departmental communication. One absolutely would not dream of subscribing to an event stream without first talking to the team that maintained the program that published it. Any changes to the event stream - both the grotty little details about what was being published and how, and the grotty little details about what was being consumed and how - would be preceded by a discussion among all the people who were working with it, to make sure that everyone involved continued to have a complete picture of the interactions involving it.

That level of communication, which I do believe was essential to the bus's long term success, just wouldn't scale to a large company. Nor could it have been maintained at a company that had a more relaxed attitude in general. Nor could it have been maintained at a company where programmers are allowed to believe that most of their time at the office can be spent with hands on a keyboard.

the_af · on Aug 21, 2020

> That level of communication, which I do believe was essential to the bus's long term success, just wouldn't scale to a large company

What would though? An ESB could be a way to enforce a standard and slow down gung-ho devs & teams using ad hoc solutions for every problem. But say an ESB is not the solution -- what is? As the company grows, overhead and friction become more important. I'm not sure "every piece connecting to every other piece in whatever way" would help with scale, rather than compounding the problem...

mumblemumble · on Aug 21, 2020

IMO, there is no software solution, because it's not a software problem, it's a social problem.

As far as specific things to try go, domain-driven design is my personal favorite off-the-shelf mental framework for dealing with these sorts of things. Especially the concept of bounded contexts. Embrace Conway's Law; recognize it's not a criticism, per se, it's also a scaling strategy.

the_af · on Aug 21, 2020

Fair enough. I agree partially: it's a social problem. Thing is, software engineering deals with social problems too: those related to development.

I've never seen DDD successfully used in any company I've worked in, but that's probably a shortcoming of my own experience. (Likewise, I've never seen TDD or Agile or lots of things people often mention in their blogs successfully used. Again, this is probably my own problem!).

addendum: to be fair I've never seen a completely working ESB either. Always a plan to build or deploy one, never the finished thing ;)

mumblemumble · on Aug 21, 2020

I doubt it's your own problem. The thing about methodologies like DDD and Agile is that they're not just a development practice. They're also (in my opinion much more importantly) frameworks for how the entire company interacts with the dev team.

I see them fail more often than not, too, and one thing that's consistent about every failed implementation I've witnessed first-hand is that non-developers who are involved in or influence product development weren't engaged with, bought into, or properly trained in the framework.

siscia · on Aug 21, 2020

You did an extremely good job at explaining what are the risks.

But if everybody it in the company it is using it, maybe it is worth to have.

Must be well though of from the beginning, like proper ACL and notification of reception (so we know what messages some other people listen to) but overall it does not seems so bad.

In your example, the trading desk would either need an intern to communicate the trades or some software developed ad-hoc. Similarly for the second examples. The data need to be moved from the trade to the risk analysis someway. Either custom software or some human need to do it.

For the third pattern, I agree that should be backed into the EBS, but you need ACL anyway. Either you let all the people in the company see the trade or you have ACL implemented somewhere.

I definitely can see the problems, but I can also see the benefits...

the_af · on Aug 21, 2020

Thanks for the detailed explanation!

I'm not sure I understand why making the ESB business critical is an error. I mean, it's a key piece in the architecture of your system (say, like a database would be for an app that uses one). It makes sense that it's critical if it's central to your solution. It's also unsurprising if it requires a dedicated team to maintain and monitor, much like any piece of critical infrastructure would. Am I missing something?

What's the alternative? Multiple ad hoc, p2p, uncontrolled connections and streams between arbitrary components of your solution, of varying quality, and many that require maintenance of different kinds. This works for a smaller software with fewer connections, but how does the effort scale as the system grows?

Galanwe · on Aug 21, 2020

Don't get me wrong, buses/queues/pubsubs/etc systems are definitely a good thing, and can quite elegantly and efficiently solve cross processes/teams needs.

But in the case of this discussion, we should not forget the "E" in ESB. These aim at being company-wide.

In the example I wrote above, we could have a bus for flowing the trades, the risk department could use a database to store their check results, the devops could use an ad-hoc time series database for their metrics, the interns with excel could just read plain files exported from other systems, etc.

The key here is that technological diversity inside a company is not a bad thing. Sure, it does not seem as "unified", but in the long run, it gives each team its own little realm of responsibility, room for standalone technological improvements and tech stack switches, etc.

I guess overall there is a fine equilibrium to find between having a totally uniform stack used by everyone, and having each team using drastically different tools. ESBs can make you fall into an extreme without realising it, because at first, it may look like a grandiose unification plan.

jrott · on Aug 21, 2020

I guess overall there is a fine equilibrium to find between having a totally uniform stack used by everyone, and having each team using drastically different tools. ESBs can make you fall into an extreme without realising it, because at first, it may look like a grandiose unification plan.

This totally nails it on the head. The problem with Entreprise service buses is the same problems that every major unification plan runs into. It usually executives looking for one ring to rule them all. This misses all the complexity that is going on in the business and deprives the teams that are actually solving those problems the independence to solve them the best way.

the_af · on Aug 21, 2020

> The key here is that technological diversity inside a company is not a bad thing

Agreed! I love working in companies where there is margin for teams to choose their own tech (within reason). It's just that it's hard to know where to draw the line. At one company I worked in, the infrastructure team had a terribly difficult time developing company-wide solutions within acceptable deadlines because the tech and standards where all over the place, the result of a policy of "everything goes, as long as it keeps the company running". This works until you reach a point where some rule must be enforced company-wide... and then chaos ensues. Rules can be anything company-wide: tests, privacy, backups, automated checks, disaster recovery, any kind of compliance, monitoring, etc.

Agreed about your overall point though, and that the ESB is an extreme.

bcrosby95 · on Aug 21, 2020

Why not both though? What makes an ESB more onerous than, say, providing a microservice for other departments to use?

johnm · on Aug 21, 2020

Implicit vs. explicit interactions/contracts. Both between consumers & producers but also...

The features, constraints, goals, etc. expected out of the ESB substrate by the different use cases becomes a nightmare of responsibility/blame shifting. Durability and replay? That's the ESB's problem. Load balancing? Yep, ESB. But it's screwing up the ordering guarantees! Well, make the ESB 'smarter' and we can just keep punting all of our problems to someone else. Etc.

the_af · on Aug 21, 2020

Agreed, but that's one side of the coin. The other is that these guarantees and constraints are clearly located in one place (and one team), instead of spread across many teams and implemented with varying maturity and seriousness. I've seen this often happens when separate teams are responsible for separate microservices, implemented in random technologies.

For this to work, the ESB must be acknowledged as the critical piece of the architecture, and the team responsible must be empowered enough.

golergka · on Aug 21, 2020

Thanks such a detailed answer! I've never worked with enterprise, so this is very interesting.

However my question remains: is the ESB the source of the problem here? If your company would world "like that" — as in, with all the circuimstances and limitations you mentioned — would using different systems really bring a better outcome than ESB?

ngcc_hk · on Aug 21, 2020

Wonder whether these can explain many characteristics and issues of both democracy and court as well as totalitarian process. Real life it is always a mix. USA version has a total open public bus, (congress senate and court) but closed one (3 letters). China mostly closed but on its own way open (with heavy sanction). Just wonder.

alexisread · on Aug 21, 2020

Sounds like Kafka? (Only half joking)

the_af · on Aug 21, 2020

It's not a joke. An ESB and event-streaming like in Kafka are very related concepts. Some details vary, but they share a lot of the same ideas and live in the same "problem space".

infinite8s · on Aug 21, 2020

That's exactly the pitch for Kafka - using it as the data backbone/nervous system for a company.

ZeroClickOk · on Aug 21, 2020

Yes, BizTalk says hello.

adamkl · on Aug 21, 2020

That is exactly the premise behind Kafka and LinkedIn's (creator of Kafka) systems architecture. Here's an excellent article on how that architecture looks in practice: https://engineering.linkedin.com/distributed-systems/log-wha...

the_af · on Aug 21, 2020

I also recommend this article! It's entertaining and informative, and a good primer on distributed logs. The reference at the end (which I won't spoil in this comment) made me chuckle.

mrighele · on Aug 21, 2020

This has been possible the healthcare sector for many years with HL7 and later FHIR [1].

Essentially you have a standard protocol and a set of message to exchange health-related information between the different softwares even if they from different vendors. This allows (for example) a doctor with software A to ask for a test on a patient, have a nurse see it on software B, have the sample sent to a laboratory which uses software C and then the result sent back to software A where the doctor can also check it. Meanwhile the signed document has been stored on software D.

That said FHIR especially is more than that and allows different approaches.

[1] https://www.hl7.org/fhir/

chapium · on Aug 21, 2020

HL7 is a shared messaging standard, much like XML or JSON. The shared standard does allow for communication between applications, but its rather unruly.

mrighele · on Aug 21, 2020

I would say that HL7 is more like a XML + a set of XML schemas. In any case, you're right that it is only a messaging standard, but I have always seen it used in the contest of having a central broker like Mirth passing messages around so it is in my opinion a good example of what the parent poster was thinking about.

And yes, you're right, it is quite unruly (information being present in one field or another depending on the software and the like). In my experience luckily most of those issues could be handled at the broker level.

hoorayimhelping · on Aug 21, 2020

You're describing an Enterprise Service Bus, a concept that has been around and implemented for decades:

https://en.wikipedia.org/wiki/Enterprise_service_bus

Spoiler alert: it's not the beautiful, decoupled silver bullet it seems like. The concept makes a ton of sense, but good execution of it is difficult. The biggest issue I've run into using systems that use this pattern is throughput and latency. Most shops design and deploy their systems in such a way that as soon as there's a little bit of traffic beyond the norm, the whole thing gets clogged and backed up for hours.

Jtsummers · on Aug 21, 2020

Most of the responses are about ESB and similar, but see also the Linda language [0]. Using a shared associative memory called a tuplespace to coordinate between different processes. Each process can be run concurrently (on the same or separate computer systems) and respond as things appear in the memory by transforming the data, triggering actions, etc.

[0] https://en.wikipedia.org/wiki/Linda_(coordination_language)

bob1029 · on Aug 21, 2020

ESBs and other abstractions are the only way many sorts of highly-complex businesses can be practically tied together and integrated.

Think about an automated factory with tens of thousands of tools, each with hundreds or thousands of events, all from differing vendors. The only rational way to tie this stuff together is with some sort of event-driven messaging architecture.

bsenftner · on Aug 21, 2020

Visual effects production companies have a studio wide queue, typically called "the queue", and entire film productions are created by digital artists and developers running jobs on their queue that create, polish or otherwise manipulate every shot of every frame of the film, plus trailers never in the film at all.

janekm · on Aug 21, 2020

I once had the idea for a global event queue, to enable new types of application interaction. Then twitter came out and I thought "oh well, that basically does what I was thinking of". Didn't quite turn out that way when they messed up the APIs, and then IFTTT did a better version of that concept...

stevekemp · on Aug 21, 2020

I think that was the motivation behind the project-wide event-bus/queue that the Fedora distribution started to use:

https://lwn.net/Articles/608915/

https://fedmsg2.readthedocs.io/en/latest/

I remember having a play with it a couple of years ago, and it was nice to see the various messages flowing over it.

ojhughes · on Aug 21, 2020

Enterprise Service Bus had this goal, and it mostly failed as a standard due to its complexity. Arguably, HATEOS has a similar goal by making APIs universally discoverable

the_af · on Aug 21, 2020

A similar concept lives on in the form of event streaming and pub/sub.

I'm not sure how HATEOAS is related though? This is about publishing and subscribing to events (or streams of data), not about API discovery or "resources" as documents. Or did you mean something else?

capableweb · on Aug 21, 2020

Solid idea, worth thinking more about. A FIFO queue of tasks that anyone can grab from at any time.

Main drawback is that each task would have to have a proper description, context and definition of done, otherwise it's bound to fail. Once you have those, any tool would do. But usually those three things are the hard part, not the tool/technology.

11235813213455 · on Aug 21, 2020

That's how most job queues work (typically for processing periodic, asynchronous and heavy tasks, e.g. emails, ..)

Jolter · on Aug 21, 2020

Look into the Eiffel protocol if you want further examples besides Kafka and ESB.

chapium · on Aug 21, 2020

I think you've described a service ticket management system.

Const-me · on Aug 21, 2020

Good article.

I’d like to add that similar queues are widely used in all multimedia, not just games.

This library https://github.com/Const-me/Vrmac/tree/master/VrmacVideo uses queues to coordinate 3 threads, one decoding video by pumping V4L2 kernel queues, another one decoding audio and playing it with ALSA, and the main thread rendering frames with GLES. Two of my queues are thin wrappers around mq_send / mq_receive Linux APIs.

phreack · on Aug 21, 2020

This is why I love game programming. You think that's what you're learning and then find out the same techniques are useful all over.

nogabebop23 · on Aug 21, 2020

not to be overly pendantic but events and messages mean pretty different things and have different applications. Same thing with queues and buses.

I suspect the author knows more about this than me but plays a little fast and loose in their explanations. I feel it's important to have disipline in our vocabulary so that people learning the material can participate in more advanced discussions as they grow.

That said, I really appreciate the overall content and using the context of games for broadly applicable concepts is a great way to promote them (and funny as generally commercial games have some of the loosest "structure" of any code I've seen).

mindcrime · on Aug 21, 2020

not to be overly pendantic but events and messages mean pretty different things and have different applications. Same thing with queues and buses.

I agree with what you say. But I'd add the suggestion that this depends on the context to some extent. For example, in the world of hardware and electrical circuits, a "bus" and a "queue" have almost no connection at all. But in the world of enterprise software, an "enterprise service bus" is often thought of as basically "a bunch of queues and a message router". Of course that's imprecise as well, since it ignores orchestration... which just goes to show even more how fuzzy some of these terms are in practice.

I feel it's important to have disipline in our vocabulary so that people learning the material can participate in more advanced discussions as they grow.

I generally agree, but I think this may be a case where "that ship has sailed." :-(

wokwokwok · on Aug 21, 2020

An aggregate of the previous posts:

https://news.ycombinator.com/item?id=23205511

This gets posted very often.

...it has some interesting content, to be fair, but I think there not much left to rehash again.

It hasn’t changed since 2014.

cableshaft · on Aug 21, 2020

Seen the site before, but this is the first time I saw this particular topic linked, and it managed to highlight a solution to a problem I was having, which is how to handle a tutorial without putting a ton of hooks for a tutorial throughout the code, so I'm pretty grateful it was posted today.

mindcrime · on Aug 21, 2020

It hasn’t changed since 2014.

I'm pretty sure there are a lot of HN users today who weren't here in 2014.

wokwokwok · on Aug 21, 2020

ffs. Read the link before you post. Posted in 2014, 2014, 2014, 2014, 2013, 2017, 2018, 2019, 2020.

--> https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Some kind of link from this gets posted every 2-3 months or so, since it came out. I appreciate you haven't seen it before... but.... this gets posted a lot here.

The previous (635 point) thread was literally 3 months ago (https://news.ycombinator.com/item?id=23203699).

prerok · on Aug 21, 2020

Well, I've missed it. There are times I read HN on daily basis and times when I skip for weeks or even months, so I appreciate the reposting. Especially because it is an interesting read.

There are a lot of articles being posted and so even if there were a feature "here is what you missed" I would not have used it :)

the_af · on Aug 21, 2020

For many of us, this is the first time we've seen it posted on HN.

aussieguy1234 · on Aug 21, 2020

It's the first time I've seen it

billfruit · on Aug 21, 2020

on a related topic, I have often wondered if studying Discrete System/Event Simulation help with game design and implementation?

Does DES have some deep magical tricks or it is just plain commensense type of discipline?

skapadia · on Aug 21, 2020

Overall this is a great description of queues and why they are used.

bionhoward · on Aug 21, 2020

Anybody using the Actor Model for games these days?