I've noticed this too. I have a few theories about this. Disclosure: I know a little about audio, and very little about audio generative AI.
First, perhaps the models are trained on relatively low-bitrate encodings. Just like image generations sometimes generate JPG artifacts, we could be hearing the known high-frequency loss of low data rate encodings. Another idea is that 'S' and 'T' sounds and similar are relatively broad-spectrum sounds. Not unlike white noise. That kind of sound is known to be difficult to encode for lossy frequency-domain encoding schemes. Perhaps these models work in a similar domain and are subject to similar constraints. Perhaps there's a balance here of low-pass filter vs. "warbly" sounds, and we're hearing a middle ground compromise.
I don't know how it happens, but when I hear the "AI" sound in music, this is usually one of the first tells.
Inconsistent paywalls are a product now. The Times (the original) has shoe-horned in some kind of AI paywall [1] which claims to maximise conversions by varying how much you get to see and for how long. It pissed me off because I was logged-in to my subscription but it was blocking me anyway.
I think it means you write a spec from the implementation. Then you write a new implementation from the spec. You might go so far as to do the second part in a "clean" room.
Heh, the original being entirely vibed had me thinking of an interesting problem: if you used the same model to generate a specification, then reset the state and passed that specification back to it for implementation, the resulting code would by design be very close to the original. With enough luck (or engineering), you could even get the same exact files in some cases.
Does this still count as clean-room? Or what if the model wasn't the same exact one, but one trained the same way on the same input material, which Anthropic never owned?
This is going to be a decade of very interesting, and probably often hypocritical lawsuits.
in a typical clean-room design, the person writing the new implementation is not supposed to have any knowledge of the original, they should only have knowledge of the specification.
if one person writes the spec from the implementation, and then also writes the new implementation, it is not clean-room design.
I believe the argument is that LLMs are stateless. So if the session writing the code isn't the same session that wrote the spec, it's effectively a clean room implementation.
There are other details of course (is the old code in the training data?) but I'm not trying to weigh in on the argument one way or the other.
First, perhaps the models are trained on relatively low-bitrate encodings. Just like image generations sometimes generate JPG artifacts, we could be hearing the known high-frequency loss of low data rate encodings. Another idea is that 'S' and 'T' sounds and similar are relatively broad-spectrum sounds. Not unlike white noise. That kind of sound is known to be difficult to encode for lossy frequency-domain encoding schemes. Perhaps these models work in a similar domain and are subject to similar constraints. Perhaps there's a balance here of low-pass filter vs. "warbly" sounds, and we're hearing a middle ground compromise.
I don't know how it happens, but when I hear the "AI" sound in music, this is usually one of the first tells.
reply