> Why is map so much faster? I am not sure. I suspect with the map() option the Rust compiler figures out it can avoid allocations altogether by simply writing over the original vector, while with the loop it can't. Or maybe it's using SIMD? I tried to look in the compiler explorer but I'm not competent enough yet to figure it out. Maybe someone else can explain!
Yep, it's due to SIMD -- in the assembly for `using_map`, you can spot pcmpeqd, movdqu, and psubd, while `using_loop` doesn't have any of these.
It's not only because of SIMD. Contrasted to many other languages (though not all) the compiler is working with code here, not an arbitrary function pointer. In essence, JS and the like are operating with this:
let result: Vec<i32> = list.into_iter().map::<_, Box<dyn Fn...>>(Box::new(transform)).collect()
Rust is able to inline the transform code right into the loop, which then becomes available for SIMD etc.
Rust further brings really nice ergonomics and comprehensive type inference to the equation, which makes it feel like writing C#/JS/whatever. JITted languages could detect and elide creating and immediately using a function pointer, but I don't think that any do.
Edit: these languages may not always allocate (specifically if nothing is captured in a closure), but the core concept remains: they erase the type, which means that they also erase the function body.
JS, Java and C# have vastly different implementation details each.
Java uses type erasure for generics. .NET uses generic monomorphization for struct-typed generic arguments and method body sharing with virtual/interface dispatch for class-typed generic arguments (the types are never erased).
Moreover, non-capturing lambdas do not allocate, and are also get speculatively inlined by the JIT behind a guard. It's a bit limited but works quite well in production applications. You can also write struct-based iterators in C#. The main limitation is lack of full HM type inference which means having less convenient API where you can't convince the compiler to infer the full type signature.
One of the current limitations of C# is that lambdas are of type Func<T1...Tn, TResult> - calls through them are virtual. So unless JIT emits a guarded devirt path - you cannot specialize over them like over Fns in Rust which are part of the monomorphized generic signature. Various performance-oriented libraries sidestep this by implementing "value delegate" pattern where you constrain an argument over an interface implementation of an invoke-like method. Basically doing the higher order functions via struct implementations.
Java here also deserves a mention because OpenJDK is capable of inlining of shallow streams - Stream API is moderately to significantly slower than LINQ but it's not terribly slow in absolute terms.
With all that, in the recent versions, LINQ has started encroaching on the territory of performance of Rust iterators especially on large sequences where access to faster allocations and heavy pooling of underlying buffers when collecting to an array or a list allow for very efficient hot paths. LINQ also does quite a bit of "flattening" internally so chaining various operators does not necessarily add extra layer of dispatch.
Lastly, F# is capable of lambda inlining together with the function accepting it at IL level at build time and does so for various iterator expressions like Array.map, .iter and similar. You access this via `inline` bindings and `[<InlineIfLambda>]`-annotated parameters. It is also possible to implement your own zero-cost-ish iterators with computation expressions. If JIT/ILC improves at propagating exact types through struct fields in the upcoming release, it will be able to inline F# lambdas even if expansion does not happen at IL level: https://github.com/dotnet/runtime/issues/110290
NB: auto-vectorization is extremely fragile even with LLVM and kicks in only in simple scenarios, the moment you have a side effect a compiler cannot reason about it stops working.
Java’s type erasure means that generic type information is not available at runtime.
C# lambdas: although non-capturing lambdas do not allocate, capturing lambdas do.
"calls through them are virtual" is due to the underlying implementation of delegates in .NET.
Well, yes, but delegates as a term is not often used in other languages so I did not mention them for simplicity's sake.
For what it's worth - the real issue in C# is not even the virtual calls but the way Roslyn caches lazily allocated non-capturing lambda instances. It does so in a compiler-unfriendly way due to questionable design decisions inside Roslyn.
Luckily, this has a high chance of changing in .NET 10. Ideally, by the time it releases hopefully the compiler will both understand the Roslyn's pattern of caching better and be able to stack-allocate non-escaping lambda closure instances.
Lambdas capturing 'this' inside instance methods of the object they refer to do not allocate either.
SIMD is true, but the original guess is correct, and that effect is bigger!
using_map is faster because it's not allocating: it's re-using the input array. That is, it is operating on the input `v` value in place, equivalent to this:
Even when passing the array as a borrow instead of a clone[1], map still auto-vectorizes and performs the new allocation in one go, avoiding the bounds check and possible calls to grow_one
It seems `map` should have less restrictive semantics (specifically ordering) than `for`, does that allow more optimization? I don't know much about Rust internals.
Reading the godbolt it looks like for the push loop llvm is unable to remove the `grow_one` capacity check after every push. Becaue of this the Vec could possibly reallocate after every push meaning it can't auto vectorize.
It's a little bit surprising to me that LLVM can't eliminate the grow_one check. It looks like there's a test ensure it's not needed in the easier case of vec.push(vec.pop()) [0]. With the iterator the optimization is handled in the standard library using specialization and TrustedLen[1].
That's not accurate, it can be used while consuming an Iterator and, depending on the implementation, be used to guide the consumer during runtime. The stdlib likely is not doing this but the API very much allows advanced behavior. We, e. G., used this for some part of a query engine in a course in uni to guide algorithm choice for operators.
SpecFromIterNested is a specialization trait for alloc::vec::Vec's FromIterator which handles both the TrustedLen and ordinary Iterator scenarios
For an ordinary Iterator, it calls next() once to check this Iterator isn't done, if it's done, we can just give back a Vec::new() since that's exactly what was needed. Otherwise, it then consults the hint's low estimate, and it pre-allocates enough capacity on that basis, unless it's lower than Vec's own guess of the minimum worthwhile initial capacity.
For Iterators which impl TrustedLen (ie promise they know exactly how many items they yield) it instead checks the upper end of the hint, to see if it's None, if it is the iterator knows it's too big to store in memory, we should panic. Otherwise though we can Vec::with_capacity
let v: Vec<_> = (0..23456).collect();
... will just give you a Vec with the 23456 values from zero to 23455 inclusive, it won't waste time growing that Vec because it knows from the outset that there are going to be exactly 23456 items in the Vec.
If I recall correctly, there was actually some unstable specialisation in the std library that allowed reusing the backing storage if you do an `into_iter()` and then a `collect()`.
Is there a reason why the loop and map don't result in the exact same code?
It looks like the code does exactly the same thing and something the optimizer could catch. Is is because of potential side effects? If not, maybe there is a ticket to open somewhere, if it isn't done already.
Not only SIMD but also size hints which avoid bound checks and enable preallocation or even allocation reuse. These are often missing when using for loops.
The author's fold example is unfair. They could've just called accumulator.extend() and then return the accumulator inside the fold example for a fair apple to apple comparison. Just mark the accumulator as mut.
Furthermore, I'd use with_capacity in both cases: Vec::with_capacity(list_of_lists.iter().map(|l| l.len()).sum())
I'm currently doing advent of code in Scala 3 to learn a bit of this language.
For loop does weird things there. It can be used as a flat map as it can iterate over multiple iterators at once and yield a value for each combination.
What's bitten me so far multiple times is that when you iterate over a Set or a Map the result of for expression is also a Set or a Map.
But since you have access to keys during iteration then some iterations, if they return same map key or same set value, might get silently overwritten by others.
I don't remember having this problem in Rust because there I had to be very intentional about iterators.
The other thing that bit me was that arithmetic on Int overflows silently, but that's apparently a Java thing, which made me wonder how is Java an enterprise language.
Otherwise Scala 3 is superb expeirience. Syntax is ultra-flexible and local extensibility of everything and access to things from the context of where your code is defined and even from the context where it's running is magical.
Why does fallible_flatten_fold have that accumulator.clone() in it? You're cloning the in-progress vector only to throw away the original, it's extremely wasteful and completely unnecessary. Just declare the accumulator as `mut`.
Dear author (and others who writing things like this): it looks you put in considerable time on code comparisons and benchmarks. Why not publish the source code for them? Benefits include:
1. The comparisons, as you've written them up, probably will get stale fast. It would be nice to be able to re-run them.
2. Some of the examples are surprising. Why? Any bugs? Some kind of weirdness? Readers would want to explore.
3. Readers can see how you did the benchmarks. We can put more/less stock in them. Hopefully you used `criterion` or similar, with warm ups, etc.
Without looking too much at the generated assembly in https://godbolt.org/z/fKEPaTdTv, it seems that using_map uses SIMD instructions (movdqu and psubd) in the main loop, while using_loop doesn't, so that could indeed explain the performance difference.
Maybe I'm dumb, but I can't see how the code in the "Errors and map" section can compile. "transform_list" returns a Result<>, yet "result" is just a Vec. I thought you always need to wrap it with Ok()? Is that a new nightly feature?
It’s interesting to note that performance of a for loop versus the functional-style mechanism varies by language. On Java, there is a performance penalty (possibly shrunken since Java 8) for using the FP idioms while in Rust, they end up much faster.
That surprises me, and I'd expect the FP version to be at least as fast, with the option to be much faster. With the for loop, you're saying "run against this value, then run against the next one, then run against the next one, then...". If the compiler isn't certain that the iterations aren't free of side effects, then it would have to run each one in order before moving on to the next.
`.map(...)` implies "I don't care about ordering, and therefore you don't need to, either", freeing the compiler to schedule the loops in a more optimal order, or in parallel or with SIMD, or any other optimization that lets it get the job done as fast as possible. I'm sure someone will come up with an example, but I can't personally think of any way where a for-loop's semantics would let a clever compiler write faster code than the equivalent map.
Streams (FP in java) is slower than for loops for a couple reasons. A big one is java doesn't natively support real closures or lambdas. It does have syntax for them, but that is just syntactic sugar for an class with a single method under the hood. So streams end up doing lot of object allocation and garbage for the fake closures.
Also, streams operate on objects, so they have to be on the heap. You can't use them with primitives on the stack. Though with autoboxing, the JVM may play some tricks with a list of Integer objects really being primitives on the stack, but I would never count on it.
As for SIMD, Java isn't going to parallelize anything automatically. You need to tell it you run the steam in parallel which will split it into threads. Java doesn't have lightweight threads like coroutines.
I know lightweight threads are on the roadmap and maybe available in Java 21 or newer. I know real closures have been considered, but I don't if it's gone anywhere. It's hard to do a quick search because we got "closures" in Java 8 so theres a lot of noise.
And as a caveat, I am most familiar with Java 17 (and older). I expect we'll look at moving to Java 21 (current LTS) next year.
The big differences are:
1. Rust closures are by-value structs; whereas Java closures are heap objects.
2. Rust generics are monomorphized; whereas Java type-erases them -> lots of virtual call overhead when passing a closure to a generic function.
Sometimes, if the Java JIT manages to inline absolutely everything, it can optimize away these overheads. But in practice, Rust FP gets optimized a lot more reliably than Java FP.
The other problem with the parallel streams is that it’s badly implemented. Threads for parallel streams are pulled from a single thread pool shared across the whole application so if you have multiple parallel streams in an application that’s already inherently multithreaded (e.g. a web service), you end up with severe resource contention that makes the parallelism work poorly if at all and can end up causing your app to deadlock because all the threads are in use somewhere else. There’s a workaround for it, but it ends up requiring some ugly boilerplate code to work.
> It does have syntax for them, but that is just syntactic sugar for an class with a single method under the hood. So streams end up doing lot of object allocation and garbage for the fake closures.
In Rust a closure is really just a struct that implements up to three closure traits, each of which provide a single function. So from that side of things, what Java is doing for them isn't inherently different from Rust.
It's interesting -- so, logically, `map` in Rust does imply an ordering. The closure is a `FnMut`, i.e. a callback which mutably captures values, causing external side effects. And it is guaranteed that if externally visible, mutations will be done in order.
But `FnMut` is the most general possible thing you can pass in. In reality, most callbacks are pure functions that don't alter mutable state at all. With Rust's monomorphization and aggressive inlining, LLVM can figure out that there's no mutation going on and can optimize that.
There is a wrinkle here, which is that capturing variables mutably is one of two ways a function can have side effects in Rust. The other way is via interior mutability, through UnsafeCell [1], or, more commonly, a wrapper around it like Mutex or RefCell. In that case as well, Rust guarantees that function calls to map are done in order. Luckily, because UnsafeCell is the root of all interior mutability, the compiler can simply track whether an UnsafeCell is transitively involved.
If you're wondering where the humble `print!` comes in -- well, it clearly has side effects. But it acquires a global lock on standard output each time it's called [2], so UnsafeCell is involved.
Readability is subjective. I personally find fold almost always more readable than a for loop when the accumulator variable has a simple type. This is because merely seeing fold can already telling me several things: it will iterate over the entire collection without early exits like "break" in a loop; the data dependency between each iteration is made clear into a single variable.
I find it slightly difficult to read when the accumulator variable actually has multiple parts, like a complicated tuple. It's worse when part of the accumulator is a bool indicating whether it's finished; that's just a poor emulation of "break" in a for loop.
afair I've mostly only used fold when doing maths not covered by the standard sum or product. Fold is similar to map reduce but it's just one expression.
Yeah I kind of wish there was a way to still use `?` inside map/filter/etc. lambdas. Error handling with that functional style is generally way more awkward than for loops, but also it's often more elegant in other ways (e.g. Rayon).
I think Ruby has some kind of feature that works like that but IIRC it looked less foot-gun more foot-bazooka. Does anyone know of any languages that solve that problem elegantly?
I think what you want is a short-circuiting behavior.
I don't see a problem with the status quo, if map is able to short-circuit, that means that it runs in order and have side effects.
If your "map" can cause the whole function to return, then just write a for-loop.
I don't even know if it should return from the whole function or just stop the mapping?
Yep, it's due to SIMD -- in the assembly for `using_map`, you can spot pcmpeqd, movdqu, and psubd, while `using_loop` doesn't have any of these.