If you scroll down a bit there's a wireframe of the skeleton which is what's actually being animated, and you'll notice it's lacking in bones to define the fingers or even possibly hands. Hence why the hands maintain that weird pose throughout all examples.
My gut says that the quality could be rapidly improved without changing the underlying design at all.
The real issue with this, I think, is that motion capture for humans is already widely available and provides much higher fidelity and control than text. Unless I'm misreading the paper badly, this model was trained on exactly such data. Blending between multiple animations through motion capture is also well-understood.
So while the results are impressive, the practical gains seem very marginal. I think perhaps that the equivalent to "inpainting" (as mentioned in the text) and "style transfer" would be the big gain here? If we could use this to retarget animations to different body plans (child, adult, space monster) quickly, or for smarter interpolation between human-authored keyframes, I could see that being a much-desired tool.
I dunno, as an amateur animator and game developer this would be a huge help to me. I have a first gen Perception Neuron suit, I even wrote an addon for Blender that retargets the Neuron output for the Rigify rig.
But it's cumbersome to put on and take off, and to operate, especially when working alone. While I'm in pretty good shape, there's heaps of movements (eg. martial arts, swordplay, firing/reloading a gun etc) that would probably look silly if I performed them. I can see this being very handy at least for prototyping animations at the very least.
Replacing finger bone positions is pretty trival in Blender as well using the Pose Library feature so the lack of finger data isn't that much of a big deal.
The reason there’s likely no finger joints is because a lot of motion capture data doesn’t include fidelity beyond the wrist.
So if they’re training on the standard corpuses of motion capture data available and even mixing in their own, they likely won’t have fingers to base data on.
If you look very closely, the model does have wrists. (Most noticeable on the models left arm — that is the viewer’s right.)
Either way, the shoulder and elbow joints don’t move much during rope skipping, and no matter how it’s recorded, the motion capture data that was used for training should reflect that.
My best guess is that the model has picked up on the tiny arm motions that are present in rope skipping and wildly exaggerated them for some reason.
So today the issue with traditional animations is that you get all humans have same height, same proportions, like for example in The sims games all adults have exactly same height though you have different body shapes. If you use mods to change height, or legs length then things no longer align, like siting on a chair. This means chairs, beds,tables, doors all have the same height.
Not sure if this diffusion is the answer but some smarter way to extrapolate existing animation to work with bodies and objects that are 10-20% different and look natural.
Which image/video are you referring to? As far as I can tell, the colors are not used to represent race, but to convey different kinds of information. For example, in Figure 1 and 3 of the paper, the color indicates different points in time. In the video, the colors indicate different motion generation methods. I would not classify "orange", "blue" and "purple" as Caucasian, but if you want even more colors, you can have a look at the original paper where color coding was used to differentiate between different skinning methods (Figure 2).
Now that lighting and polygon counts and motion capture have become so crazy realistic in AAA games, I find it's actually motion (outside of cutscenes) that snaps me out of believability the most. RDR2 is the best I've ever seen, but boy there are still a lot of clunkers.
I never imagined ML or stable diffusion would be the answer here, but now I wonder if it will be? So much stable diffusion stuff has so far felt to me like just toys to play with, but modeling movement seems like it could make a gigantic difference in videogames and animation generally.
I wonder if you could just simulate a skeleton and realistic maximum torque values for the joints and factor in mass and center of mass for each segment. Obviously that doesn't get you "how" to move each limb, but it at least could work to train a network GAN style so that the motion looks realistic. I feel like movements that are physically impossible are some of the most significant things that break immersion and that's something you can model with a physics engine to set some boundaries on training a model based off motion capture training data.
Try to find the email addresses of the authors of the scientific papers that go along with the models, there's usually a good chance that someone will answer if your request is reasonable.