>Firstly, the video is low latency, which means that the time between you broadcasting and the time the video shows up on your viewer’s screens can be as low as 2-3 seconds.
Excuse my ignorance, and I'm sure 2 seconds is probably an engineering feat, but I'm genuinely curious. What is it that prevents latency to go as down as a few hundred ms (pretty much close to and IP round trip) ?
1) If you want very low latency, any network jitter or delays will cause pauses on the viewer side and "skips" when the feed catches up after a brief dropout. This is fine for video chat, where a little blip doesn't interrupt the experience. For live streams with 30k+ viewers, it's pretty annoying and very noticeable if the audio cuts out or skips. A 2-3 second window is typically large enough to paper over any jitter or retransmits due to packet loss between the broadcaster and Twitch servers.
2) Transcoding can be done with very low latency, but it's harder to scale horizontally and uses more bandwidth than if you give yourself a few hundred ms of buffer. Larger buffers enable better compression. Transcoding is needed if you want to stream to mobile, web, etc. in multiple formats, bitrates, or resolutions.
3) Chunked HTTP content is much easier to serve than RTMP or WebRTC-style content. You can use nginx or drop your content on a low-cost CDN. The caveat is that chunking generally introduces latency unless you do something fancy such as streaming chunks as they're being written to disk.
This definitely isn't ignorance, it's a very, very common question. The TL;DR on it is: cost.
The most cost-effective way of delivering video is using some form of HTTP streaming (like HLS or DASH). In a nutshell, the player downloads a manifest that tells it where to find chunks of video, which are downloaded and cached in normal, commodity CDNs. Everything is stateless and is scaled like any other form of HTTP download. To do all of this you end up needing to transcode the incoming stream. All along the way through this process introduces latency, and for reference 20 seconds is perfectly normal HLS latency, so credit where credit's due, this is really impressive. One of my colleagues wrote about the state of low latency last year[1], and considered < 4 seconds to be "ultra low latency"...that's really rare, particularly among platforms.
You can get lower latency, of course, but typically that involves stateful connections. All of that cheap commodity stuff that comes with HTTP streaming above goes away, and scaling to a large number of viewers can get extremely costly (and operationally difficult).
mmcclure is way too polite to say so, but Amazon IVS is definitely not going to give you 2s latency.
Currently, IVS configured for "ultra low latency" is using HLS segments that are three seconds long. The client tries not to buffer more than one segment, so on a good network connection you'll see ~4 seconds of latency.
In theory, you could start playing the video while you're still downloading the first segment. That's how you'd get ~2s of latency. But the AWS player doesn't actually do that. And for good reason. These are TCP connections, so if there's any packet loss at all, you'll have to either buffer or skip the segment and change bitrates. Starting the video and then immediately buffering is a pretty poor user experience.
This is pretty easy to test. I just did, twice: streaming from OBS on my desktop and then directly from our compositing servers in the cloud. In both cases the latency was ~4 seconds.
> but typically that involves stateful connections [...] and scaling to a large number of viewers can get extremely costly (and operationally difficult)
This is why we built Pushpin, to make it easier to handle stateful connections at scale. The project is mostly intended for moving application data, but it does work for media streaming too. See our live MP3 demo [1]. The backend runs GStreamer in a loop to produce the audio, and has no awareness of client connections. Pushpin moves the bytes and knows nothing about audio or codecs.
You mean all those Twitch streamers see their chats More than 4 seconds after the relevant video has elapsed? And can you explain what you mean by “stateful”? Thanks
> all those Twitch streamers see their chats More than 4 seconds after the relevant video has elapsed
Yep! I think Twitch can sometimes be at 4 seconds or less for what it's worth, but yes, that delay is real. It's generally not really noticeable because the communication is totally async; the streamer is doing other stuff, finishing other thoughts, etc, then can get to messages as they see them.
> can you explain what you mean by “stateful”?
Sure! I was talking about what kind of connection is necessary to deliver the video. A lot of real-time video solutions require stateful connections to the client, which means that once the client connects, the connection is kept open, and video data is streamed via that connection. Common examples on the web are things like WebSockets and WebRTC, but it gets really expensive because you basically have to maintain a persistent connection with every single viewer and it makes it impossible (or extremely difficult) to do any kind of meaningful caching.
Stateless connections on the other hand are most of your common HTTP requests. The client asks for a resource, and gets it back. No prior connection setup is required, servers, networks, whatever can change between each request and everything will merrily chug along, which makes it much easier to scale.
Huh, thanks, I think I’m starting to get it. Is it that the connection-based model needs the data “copied” into each user’s stream on the server-side while the “stash the files in a bin” stateless model allows the networking hardware to cache this data somewhere in memory and just copy it on the fly onto different network links?
Stashing files–let's say that each file is one second long and is just numbered with the unix timestamp – makes things cost effective because now the server is just dropping files no 1,2,3,4... into a directory, and everyone is pulling them out in sequence the way an other file would be downloaded. This also allows exploiting the HTTP architecture – if you set the Cache-Control: public headers on the files (which you can do happily because they'll never change) they'll be cached at lots of places along the way, like CDNs, the local ISP, the office network, etc. HTTPS blew out most of these caching benefits, but at least the CDNs can cache the files at edges all over the world.
Its not just that it can be cached, but that it can use very standard existing infrastructure like HTTP CDNs, mobile browsers, etc. The limitation is that the audio/video is encoded as segments, each a few seconds long. Because of this it looks kind of like serial batch processing with latency constraints based on the batch size (segment duration). This is in contrast to say webrtc or rtmp thats a lot closer to a multiplexed stream of data.
Because the protocol in question has support for caching through CDNs. If there are two viewers in the same city you can reduce your global bandwidth needs by half. There are alternatives like WebRTC that do not support caching and allow sub 1 second latency but they are also far more expensive.
Excuse my ignorance, and I'm sure 2 seconds is probably an engineering feat, but I'm genuinely curious. What is it that prevents latency to go as down as a few hundred ms (pretty much close to and IP round trip) ?