More

computerbuster · 2026-03-25T17:02:09 1774458129

JPEG XL is mainly based on unique image-specific research, but you're right to say a lot of the techniques are compatible with videos in theory (the XYB color space comes to mind). AVIF is an AV1 OBU in an image-specific container, and required a lot of image-specific engineering to make AV1's tools useful for images; see libaom's tune "iq", and the same in SVT-AV1. The compression gains translated when engineering effort went into creating bespoke implementations, and the same may happen for LLMs if I had to guess.

computerbuster · 2025-10-13T05:11:14 1760332274

Existing tools really just need to do a better job keeping up.

shmerl · 2025-10-13T07:57:41 1760342261

They do. It must be some kind of bad tools that don't support AVIF still.

computerbuster · 2025-10-13T03:02:40 1760324560

Very cool use case! 4:4:4 support + 10-bit color make AVIF very compelling here.

computerbuster · 2025-10-13T03:01:52 1760324512

I'm a big fan of JPEG XL, but even its most dedicated fans have given up the argument that it is the best for compression efficiency. AVIF's generational leap took place in August 2024 with Tune Still Picture in SVT-AV1-PSY, so much so that Google integrated it into their own encoder and has done very impressive work optimizing it further for the human visual system. JPEG XL's strongest quality is its featureset; lossless JPEG recompression, for example, is really incredible

ksec · 2025-10-13T05:23:56 1760333036

>but even its most dedicated fans have given up the argument that it is the best for compression efficiency....

I will need to double check. I think even after PSY JPEGXL still excel at BPP 1.0+, and that is 85% of all images served according to Chrome. AVIF is still winning on below BPP 0.8.

Anyway we only have to wait a few more months to see AV2. Let's hope finally they have everything ready.

computerbuster · 2025-10-04T20:51:33 1759611093

Hi, author here – the README covers this in the Performance section: https://github.com/gianni-rosato/fssimu2?tab=readme-ov-file#...

If you run the `validate.py` script available in the repo, you should see correlation numbers similar to what I've pre-tested & made available in the README: fssimu2 achieves 99.97% linear correlation with the reference implementation's scores.

fssimu2 is still missing some functionality (like ICC profile reading) but the goal was to produce a production-oriented implementation that is just as useful while being much faster (example: lower memory footprint and speed improvements make fssimu2 a lot more useful in a target quality loop). For research-oriented use cases where the exact SSIMULACRA2 score is desirable, the reference implementation is a better choice. It is worth evaluating whether or not this is your use case; an implementation that is 99.97% accurate is likely just as useful to you if you are doing quality benchmarks, target quality, or something else where SSIMULACRA2's correlation to subjective human ratings is more important than the exactness of the implementation to the reference.

gforce_de · 2025-10-05T05:56:31 1759643791

Thank you for clarifing this, it was a misread on my side. The overall percentage deviation from the reference implementation is marginal, but just the pure existance of 'validate.py' looked to me like it must match.

computerbuster · 2025-10-13T07:11:29 1760339489

Quick follow-up from the original SSIMULACRA2 author:

> The error will be much smaller than the error between ssimu2 and actual subjective quality, so I wouldn't worry about it.

computerbuster · 2025-08-19T06:21:38 1755584498

The past two years have seen significant advancements in video and image compression, particularly with the maturation of the SVT-AV1 video encoder and improvements to AVIF image compression. These developments, coupled with faster and more accessible developer tools, have made it easier to produce high-quality compressed media. I have a lot of optimism for the future with AV2 and the potential for further community-driven innovation in open-source compression technology.

computerbuster · 2025-05-15T19:00:25 1747335625

This is an incredibly robust solution to a really pressing problem for a lot of individuals/orgs who want to use/deploy reasonably powerful LLMs without paying through the nose for hardware. Others have mentioned the hyperscalers have solutions that make some amount of sense (Azure confidential computing, AWS nitro enclaves) but if you read a bit more about Tinfoil, it is clear they want to operate with far less explicit user trust (and thus much better security). This team is setting the standard for provably private LLM inference, and to me, it makes other solutions seem half-baked by comparison. Props to this talented group of people.

computerbuster · on Feb 22, 2025

Another resource on the same topic: https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.

FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.

dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.

While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.

I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.

cornstalks · on Feb 23, 2025

One of the fun things about dav1d is that since it’s written in assembly, they can use their own calling convention. And it can differ from method to method, so they have very few stack stores and loads compared to what a compiler will generate following normal platform calling conventions.

janwas · on Feb 23, 2025

I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?

rbultje · on Feb 23, 2025

Binary size was a concern, so excessive inlining was undesirable.

And don't forget that any asm-optimized variant always has a C fallback for generic platforms lacking a hand-optimized variant which is also used to verify the asm-optimized variant using checkasm. This might not be linked into your binary/library (the linker eliminated it because it's never used), but the code exists nonetheless.

janwas · on Feb 23, 2025

hm, fair enough. IIRC JPEG XL was a few hundred KB of SIMD code for the four or so different targets/ISAs, including the generic fallback, but I can believe video codecs are larger.

hrydgard · on Feb 23, 2025

Function calls are very fast (unless there's really a lot of parameter copying/saving-to-stack) and if you can re-use a chunk of code from multiple places, you'll reduce pressure on the instruction cache. Inlining is not always ideal.

janwas · on Feb 23, 2025

Perhaps the use cases are different (heavily data-parallel), but FWIW I do not remember many cases where we were frontend bound, so icache hasn't been a concern.

ajb · on Feb 23, 2025

Codecs often have many redundant ways of doing the same thing, which are chosen on the basis of which one uses the fewest bits, for a specific piece of data. So you can't inline them as you don't know ahead of time which will be used.

weebull · on Feb 24, 2025

Cache misses hurt.

MortyWaves · on Feb 23, 2025

Doesn’t this just make it harder to maintain ports to other architectures though?

epr · on Feb 23, 2025

For what's written in assembly, lack of portability is a given. The only exceptions would presumably be high level entry points called to from C, etc. If you wanted to support multiple targets, you have completely separate assembly modules for each architecture at least. You'd even need to bifurcate further for each simd generation (within x64 for example).

antoinealb · on Feb 23, 2025

Yes, but on projects like that, ease of maintenance is a secondary priority when compared to performance or throughput.

wolf550e · on Feb 23, 2025

There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful.

secondcoming · on Feb 23, 2025

SIMD instructions are already architecture dependent

janwas · on Feb 23, 2025

I'm also in the mission-critical camp, with perhaps an interesting counterpoint. If we're focusing on small details (or drowning in incidental complexity), it can be harder to see algorithmic optimizations. Or the friction of changing huge amounts of per-platform code can prevent us from escaping a local minimum.

Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.

This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.

Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

rbultje · on Feb 23, 2025

> I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

It should be obvious that both are pursued independently whenever it makes sense. The idea that one should precede the other or is more important than the other is simply untrue.

janwas · on Feb 23, 2025

How can tuning be independent of devising the algorithm?

Are you really suggesting writing a variant of a kernel, tuning it to the max, then discovering a new and different way to do it, and then discarding the first implementation? That seems like a lot of wasted effort.

dundarious · on Feb 23, 2025

What does Zig offer in the way of builtin SIMD support, beyond overloads for trivial arithmetic operations? 90% of the utility of SIMD is outside of those types of simple operations. I like Zig, but my understanding is you have to reach for CPU specific builtins for the vast majority of cases, just like in C/C++.

GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.

anonymoushn · on Feb 23, 2025

Zig ships LLVM's internal generic SIMD stuff, which is fairly common for newish systems languages. If you want dynamic shuffles or even moderately exotic things like maddubs or aesenc then you need to use LLVM intrinsics for specific instructions or asm.

MortyWaves · on Feb 23, 2025

I’m also wondering what “built in” even means. Many have SIMD, Vector, Matrix, Quaternions and the like as part of the standard library, but not necessarily as their own keywords. C#/.NET, Java has SIMD by this metric.

neonsunset · on Feb 23, 2025

Java's Panama Vectors are work in progress and are far from being competitive with .NET's implementation of SIMD abstractions, which is mostly on par with Zig, Swift and Mojo.

You can usually port existing SIMD algorithms from C/C++/Rust to C# with few changes retaining the same performance, and it's practically impossible to do so in Java.

I feel like C veterans often don't realize how unnecessarily ceremonious platform-specific SIMD code is given the progress in portable abstractions. Unless you need an exotic instruction that does not translate across architectures and/or common patterns nicely, there is little reason to have a bespoke platform-specific path.

kierank · on Feb 23, 2025

We in FFmpeg need all the instructions and we often need to do register allocations by hand.

neonsunset · on Feb 23, 2025

Absolutely fair! FFmpeg does fall into the category of scenarios where skipping to the very last mile optimizations is reasonable. And thank you for your work on FFmpeg!

Most code paths out there aren't like that however and compilers are not too bad at instruction selection nowadays (you'd be right to mention that they sometimes have odd regressions, I've definitely seen that being a problem in LLVM, GCC and RyuJIT).

anonymoushn · on Feb 23, 2025

I'm primarily writing "general-purpose" code (especially parsers and formatters) rather than code that does the same math operation on a big array, so it's usually not reasonable to even use the same approach to the problem with different vector extensions :(

ack_complete · on Feb 23, 2025

Even in the latter case, different approaches are often required. For an 8x8 byte block difference, SSE2 prefers horizontal accumulation (PSADBW) while ARM64 prefers vertical (UABAL). It's noticeably suboptimal if you try abstracting across these with generic primitives.

MortyWaves · on Feb 23, 2025

Exactly!

zbobet2012 · on Feb 23, 2025

So on point. We do _a lot_ of hand written SIMD on the other side (encoders) as well for similar reasons. In addition on the encoder side it's often necessary to "structure" the problem so you can perform things like early elimination of loops, and especially loads. Compilers simply can not generate autovectorized code that does those kinds of things.

computerbuster · on Sept 14, 2024

AvifHash leverages the power of AVIF to create image placeholders that are both compact and efficient.

This Proof of Concept shows promising results: at 27 characters, AvifHash outperforms BlurHash https://blurha.sh/ (using 4x3 components) in quality and detail retention. At a similar quality, BlurHash needs 54 (5x5) to 76 characters (6x6 components).

Given that AVIF decoding is done by the web engine, AvifHash is very small: the entire demo page (including parsing and re-hydration code) is only 2.3 kB gzipped.

computerbuster · on March 15, 2024

A deep dive into programming a freestanding QOI encoder using the Zig programming language.