Over the past three weeks I've tried a few different conferencing solutions, including jitsi. I'll give it another try with this update.
My use case is I take weekly music lessons, and now they are virtual. The problem is the DSP done on audio was designed for speech. If my teacher is explaining something then plays an example on his bass, it usually sounds terrible, maybe even inaudible.
I send him pre-recorded mp3s of cover songs; ideally he could listen to it and I could comment in real time about places where things could be improved. Instead, if he is playing any music on his system, I hear nothing -- no music, no talk. It seems like the software thinks "Hey, this participant is listening to non-conference audio, so I'll just mute him (at least on skype). I'd love there to be a half duplex audio button so none of the DSP shenanigans are needed, and a high quality audio stream would be sent.
Firstly, thanks for your work on what is really a great project. Can we set stereo=1 in the SDP and also the bandwidth constraint? That would make it ideal for this use case.
For music quality webRTC you need 3 things: disable audio processing, stereo=1 in the SDP and a way to limit bandwidth usage so it doesn't saturate the available bandwidth and create errors.
Disabling video is also really the best thing to do when recording for this reason (bandwidth saturation), and also Chromium will give you much superior experience. Safari and Firefox isn't quite there yet: Safari can't let you choose your output device and lacks some other useful features, and Firefox doesn't yet seem to allow stereo Opus, maybe that's changed since I tested. Microsoft Edge is now Chromium so you're good to go.
Firefox has supported stereo opus for a very long time (four years at least?). We know it works, it's used by medical professionals for their job and they wrote a message a few month thanking us for this feature, that doesn't seem to work on other browser (according to them, but I see tickets open on chromium).
Of course all the chain has to be stereo, that goes without saying: input signal is stereo, negotiation has been done in stereo, having enough bandwidth is important (otherwise opus goes mono), and then playback has to be on stereo hardware (but that's the easy part).
We hold regular flute meetings and play together. In this quarantine time we wanted to meet online, but if we play all at once, it seems I cannot hear everyone else at the same time. I guess it is as if everyone was shouting over everyone else, which is not the case when you have a meeting where usually only one person speaks at a time
Will this also fix this issue? So everyone will be able to hear everyone?
You won't be able to play together because of latency.
You will think you are in time with someone, but you will react when you hear/see them on your screen, which is maybe .15 seconds after they actually made the sound/movement. And then they will hear/see your reaction .15 seconds later again.
If all participants have good internet and are geographically close it should theoretically be possible to have delay not much greater than rtt/2 for everybody.
With rtt < 20ms that should make musical performances possible. After all, sound only travels less than four meters in 10ms. So this is just like singing in a choir (with more visual delay - but that can be solved by having a conductor).
Unfortunately I'm not aware of any software making that a practical reality, even with ftth.
You're assuming that network latency is the only latency that's involved here, but a huge latency source is the audio codec. Opus adds ~20ms latency, and that's the most low latency codec that's widely supported at the moment. You can see a comparison here: https://www.opus-codec.org/comparison/
There are all sorts of other latency that need to be taken into consideration too, and unfortunately in practice those do add up to live music being unplayable on pretty much any network.
There's a really interesting project called NINJAM https://www.cockos.com/ninjam/ which is designed for live music jam sessions. It flips this fundamental constraint on its head - instead of being real-time, it streams everyone else's output delayed by one bar (theoretically any interval >RTT I guess?). I haven't tried it, but it's a really cool idea.
Of course there's a ton of other potential sources of delay that make my fantasy hard to achieve, probably already starting at the typical USB microphones (in headsets/cameras).
20ms rtt through e.g. opus on a loopback network interface is already decidedly non-trivial to archive with "normal" hardware. When you do have low-latency devices, it becomes easy, but not everyone has those.
Musicians building digital audio workstations commonly have to replace the whole software stack to get audio latency down to an acceptable (<10ms) level: JACK instead of PulseAudio, a Linux kernel recompiled with custom options for low latency, other software reconfigured to use the JACK APIs, and so on. Sometimes they can't use whatever standard audio hardware. (And remember that USB polling frequency is normally only 100 Hz: 10 ms worst-case by itself.)
Minimizing latency is certainly technically feasible, it's just hard for stupid reasons.
I haven't tried it yet, but sofasession.com seems optimized for this. Using wired Ethernet instead of WiFi can go a long way, from what I've heard. Has anyone here tried it?
Depends on the type of music, something slow and choral can easily deal with high latencies, while something quick, rhythmic and precise can't be harder to deal with.
Has anyone tried Mumble for this? It's very low latency but I can't find exactly how low the latency is. It ofcourse depends also on your internet connection and other settings but the base latency that comes from buffering the sound before sending. Mumble also has lots of settings for sound quality and different sound formats so might work for music if you try all the settings.
Mumble has a setting for the audio buffer size and in fact they make you set it during initial configuration. It works great, has low latency and doesn't use much bandwidth (I hosted a server on a 1Mbps DSL connection for several people back in the days).
Latency is not that big problem I'd say. We play a music where it does not matter that much, sometimes just playing one long tone for the length of everyone's breath.
I just would like to hear everybody at the same time, but what I hear is always one person's sound getting preference over others. Or sounds just alternate randomly based on the volume, I'd guess.
Musicians already deal with that kind of issue when doing particular kinds of performance (e.g. famously at Wagner's festival opera house, where the orchestra is in a deep pit below the singers).
That's not how it works. In fact there are multiple algorithms depending on the browser, it's not defined in the spec. The most used one currently would be AEC3 from Google, which is quite a bit more advanced than what you describe.
If the website doesn't want to offer a control to switch this on/off, I'm confident this can be done by a browser extension in no time (which would have the benefit to work for all websites).
padenot, although I am a programmer of sorts, I don't do web development, so I'm at a loss. Say I go to the jitsi website (https://meet.jit.si/), type in my four word passcode, and get a conference connection with my teacher. When you say, "instead of doing..." doesn't apply to me, because I don't do anything. It sounds like what you are describing is what the developer of that web page needs to do, but me, as a user, doesn't see any of that.
Audio processing is a risky move -- so hard to get right. We've been using https://team.video at work, and one thing I absolutely love about it is how they handle audio / muting.
When you're speaking, you don't have to wonder if others can hear you because your microphone pulses in green visually as you speak. If your audio isn't working it shows in yellow with no pulsing, and you and everyone else can see your audio is not flowing.
Also, if someone else forgot to mute and their kid is making a ruckus, you can just mute them. You don't have to wait for a moment to interject and ask them verbally, you can just go ahead and do it.
Or, when you see someone else in their video feed trying to speak up, but they forgot to unmute, you just unmute them. No everyone saying, "you're muted" over each other.
It takes a second to get used to the idea that everyone has all the power, but in practice it just makes everything go way smoother.
It's only scary in the same way that it's scary how anyone walking down the street could kick you in the pants when you're walking down the street.
They could but they won't because we live in a society. Which is great because that means we don't have to walk around in steel suits to avoid getting kicked in the pants.
I choose to trust the people I work with every day. And then as a bonus, I don't have to hear people yelling "you're muted!" at one another. We just get on with it.
No. If you mute yourself you are distancing yourself from a conversation. Unmuting someone is like following them secretly on the street into their house and listening in on them standing behind their curtain. It is creepy and wrong and shouldn't be possible. Maybe they are having a fight with their spouse? You shouldn't be able to listen in on someone who muted themselves without them aknowledging it
The only legitimate use case I see for this is if you are working with e.g. elderly people who have a hard time understanding the whole thing and even then it shouldn't be possible without them clicking on "Yes" explicitly.
The difference there is that kicking people on the street lands you in jail (likely not the first time, but if you do it repeatedly...), and remote unmuting would likely require wiretap laws in unconvenctional ways to even get a judgment of whether it's illegal or not (not even considering what it brings with it).
Also, you're implying that one would only use this technology to communicate with people you work with every day. What about a meeting with outsiders/contractors/customers? You might not actually have those yourself, but someone usually has to do those.
I know we're getting far from the original point here, but I'm going to seize this opportunity anyway: the so-called "thin blue line" is not the reason society is able to function. We work as a collective /despite/ the presence of police and the prison industrial complex, not /because/ of them.
Exactly. In the case of a video call with your colleagues, everyone collectively manages mute states so that the group can be more productive.
Then, if one asshole starts unmuting maliciously, they get shunned real quick and then fired if they keep it up. We don't need to limit ourselves with draconian measures when social norms and expectations will already suffice.
This is why I was so opposed to Zoom when my company started adopting it a few years ago: the room dictator or whatever it's called can unmute you. (Maybe only if they had muted you in the first place.)
Here's something that a colleague passed along to a group of CS profs.
It's written by a music professor and geared toward using Zoom for music, but several of us Zoom newbies found it to be helpful more generally. He mentions the issue of disabling the speech-centric audio postprocessing.
Disclaimer: The author apparently makes his money selling eBooks, so you may want to skip through the several pages of promotional material at the beginning of his PDF to get to the good stuff.
Use TeamTalk[1]. If you need high audio quality, TT beats everything else you can find, maybe except very expensive software for radio stations. I've successfully used it to stream music and it works.
It's Teamspeak and Discord like, so you need to connect to a server, either public or self-hosted, join or create a channel, and then you will be able to talk to everyone on that channel. This is perfect for permanent communities where people just hang out, but works for one-offs too. It works on Windows, Linux, Mac, iOS and Android, no web access. The server is also available for Raspberry Pi. Half of it is open-source, but the core SDK needs a license if you're developing with it. The program itself is free, even for commercial use.
It uses Opus and lets you adjust the quality and processing, so you can get a lot out of it. We've been using it in our community for about 10 years now, including for radio broadcasting, and we haven't managed to find anything better since. I know of one local radio station and recording studio who successfully use it for remote work now.
To get the best experience, disable all audio processing in the preferences, on the sound system pane, so duplex mode, automatic gain control and noise reduction should be off. If you're on Windows, use Windows Audio Session as the backend for lowest latency.
Then, connect to a server, I use the German one for public stuff, as I'm close to it geographically and you don't need to register for it, but use whichever you want. After connecting, create a channel with application set to music, bitrate set to 150000 and channels set to stereo. Those are, at least, the parameters I use, and they work great. You can adjust the rest as you see fit.
There are some video and screensharing capabilities as far as I know, but I haven't used them. Audio is definitely the primary focus. If you need any assistance, my username here at gmail dot com is the way to go.
[1] bearware.dk for desktop, App Store and Google Play for iOS and Android.
ps. I'm not affiliated with the company in any way, it's just a tool I use daily and would recommend to anyone who knows his way around the computer. It definitely doesn't pass the grandma test, though.
"High"-bitrate lossy CBR compression is probably acceptable enough — at least compared with a voice codec! mp3 at max (320) is only 320 kB/s, doesn't have the security issues that variable-bitrate compression does, and preserves audio "ok" (it does delete the high frequencies above 20-22 kHz). No patent issues anymore, either.
Ogg-Vorbis may be an even better option for all kinds of reasons, but mp3 is more universally recognized.
Awesome! I've been wondering how to do this, since I normally take calls in a quiet room with headphones, so there's no need for noise canceling. It would be nice if you could enable this on a per-call basis, though.
I work for a company that builds virtual classrooms based on webrtc. Our customers are mostly business schools, but we have some music schools. For them we activate a different profile that disables all audio processing and selects the music profile of opus (opus are in fact 2 different codecs, one aimed at speech, one aimed at music). It would likely be very easy to do something like this in jitsi meet as well, since webrtc has everything onboard. The tricky part is that you also need to disable echo cancellation. So everyone must be wearing headsets and so on.
This sounds really cool! What’s the company? (I also work on WebRTC-based classrooms, at Minerva - we haven’t looked outside of voice in the audio sphere though.)
For this use case, you need selective fidelity and shared control over a sampler, with each sample having low resolution video, high quality audio and an arbitrary number of tags or notations (with a time range).
Yes, but it is entirely unsuitable for real-time conversations. The only thing it works for is modal jams or something like 12 bar blues that loop the same fixed chord structure over and over.
My use case is I take weekly music lessons, and now they are virtual. The problem is the DSP done on audio was designed for speech. If my teacher is explaining something then plays an example on his bass, it usually sounds terrible, maybe even inaudible.
I send him pre-recorded mp3s of cover songs; ideally he could listen to it and I could comment in real time about places where things could be improved. Instead, if he is playing any music on his system, I hear nothing -- no music, no talk. It seems like the software thinks "Hey, this participant is listening to non-conference audio, so I'll just mute him (at least on skype). I'd love there to be a half duplex audio button so none of the DSP shenanigans are needed, and a high quality audio stream would be sent.