The humble for loop in Rust

nemo1618 · on Dec 12, 2024

> Why is map so much faster? I am not sure. I suspect with the map() option the Rust compiler figures out it can avoid allocations altogether by simply writing over the original vector, while with the loop it can't. Or maybe it's using SIMD? I tried to look in the compiler explorer but I'm not competent enough yet to figure it out. Maybe someone else can explain!

Yep, it's due to SIMD -- in the assembly for `using_map`, you can spot pcmpeqd, movdqu, and psubd, while `using_loop` doesn't have any of these.

zamalek · on Dec 12, 2024

It's not only because of SIMD. Contrasted to many other languages (though not all) the compiler is working with code here, not an arbitrary function pointer. In essence, JS and the like are operating with this:

    let result: Vec<i32> = list.into_iter().map::<_, Box<dyn Fn...>>(Box::new(transform)).collect()

Rust is able to inline the transform code right into the loop, which then becomes available for SIMD etc.

Rust further brings really nice ergonomics and comprehensive type inference to the equation, which makes it feel like writing C#/JS/whatever. JITted languages could detect and elide creating and immediately using a function pointer, but I don't think that any do.

Edit: these languages may not always allocate (specifically if nothing is captured in a closure), but the core concept remains: they erase the type, which means that they also erase the function body.

neonsunset · on Dec 12, 2024

JS, Java and C# have vastly different implementation details each.

Java uses type erasure for generics. .NET uses generic monomorphization for struct-typed generic arguments and method body sharing with virtual/interface dispatch for class-typed generic arguments (the types are never erased).

Moreover, non-capturing lambdas do not allocate, and are also get speculatively inlined by the JIT behind a guard. It's a bit limited but works quite well in production applications. You can also write struct-based iterators in C#. The main limitation is lack of full HM type inference which means having less convenient API where you can't convince the compiler to infer the full type signature.

One of the current limitations of C# is that lambdas are of type Func<T1...Tn, TResult> - calls through them are virtual. So unless JIT emits a guarded devirt path - you cannot specialize over them like over Fns in Rust which are part of the monomorphized generic signature. Various performance-oriented libraries sidestep this by implementing "value delegate" pattern where you constrain an argument over an interface implementation of an invoke-like method. Basically doing the higher order functions via struct implementations.

Java here also deserves a mention because OpenJDK is capable of inlining of shallow streams - Stream API is moderately to significantly slower than LINQ but it's not terribly slow in absolute terms.

With all that, in the recent versions, LINQ has started encroaching on the territory of performance of Rust iterators especially on large sequences where access to faster allocations and heavy pooling of underlying buffers when collecting to an array or a list allow for very efficient hot paths. LINQ also does quite a bit of "flattening" internally so chaining various operators does not necessarily add extra layer of dispatch.

Lastly, F# is capable of lambda inlining together with the function accepting it at IL level at build time and does so for various iterator expressions like Array.map, .iter and similar. You access this via `inline` bindings and `[<InlineIfLambda>]`-annotated parameters. It is also possible to implement your own zero-cost-ish iterators with computation expressions. If JIT/ILC improves at propagating exact types through struct fields in the upcoming release, it will be able to inline F# lambdas even if expansion does not happen at IL level: https://github.com/dotnet/runtime/issues/110290

NB: auto-vectorization is extremely fragile even with LLVM and kicks in only in simple scenarios, the moment you have a side effect a compiler cannot reason about it stops working.

nuancebydefault · on Dec 13, 2024

Java’s type erasure means that generic type information is not available at runtime.

C# lambdas: although non-capturing lambdas do not allocate, capturing lambdas do. "calls through them are virtual" is due to the underlying implementation of delegates in .NET.

neonsunset · on Dec 14, 2024

Well, yes, but delegates as a term is not often used in other languages so I did not mention them for simplicity's sake.

For what it's worth - the real issue in C# is not even the virtual calls but the way Roslyn caches lazily allocated non-capturing lambda instances. It does so in a compiler-unfriendly way due to questionable design decisions inside Roslyn.

Luckily, this has a high chance of changing in .NET 10. Ideally, by the time it releases hopefully the compiler will both understand the Roslyn's pattern of caching better and be able to stack-allocate non-escaping lambda closure instances.

Lambdas capturing 'this' inside instance methods of the object they refer to do not allocate either.

ridiculous_fish · on Dec 12, 2024

SIMD is true, but the original guess is correct, and that effect is bigger!

using_map is faster because it's not allocating: it's re-using the input array. That is, it is operating on the input `v` value in place, equivalent to this:

    pub fn using_map(mut v: Vec<i32>) -> Vec<i32> {
        v.iter_mut().for_each(|c| *c += 1);
        v
    }

This is a particularly fancy optimization that Rust can perform.

arllk · on Dec 13, 2024

Even when passing the array as a borrow instead of a clone[1], map still auto-vectorizes and performs the new allocation in one go, avoiding the bounds check and possible calls to grow_one

[1] https://godbolt.org/z/K9z6PvdYh

skitter · on Dec 12, 2024

This can even turn the following code into a no-op:

    vec_of_u32.into_iter().map(f32::from_bits).collect()

aidos · on Dec 12, 2024

Ah thanks! I was going to ask in here to see if anyone knew because the explanation about overwriting the same vector seemed pretty off base.

Never worked in Rust though so wondered if the iterator api had some weird optional notion of size that could be utilised throughout the chain.

c0balt · on Dec 12, 2024

> Never worked in Rust though so wondered if the iterator api had some weird optional notion of size that could be utilised throughout the chain.

Fwiw, this does exist: [Iterator::size_hint] (https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...)

hedgehog · on Dec 12, 2024

It seems `map` should have less restrictive semantics (specifically ordering) than `for`, does that allow more optimization? I don't know much about Rust internals.

ironhaven · on Dec 12, 2024

Reading the godbolt it looks like for the push loop llvm is unable to remove the `grow_one` capacity check after every push. Becaue of this the Vec could possibly reallocate after every push meaning it can't auto vectorize.

mastax · on Dec 12, 2024

It's a little bit surprising to me that LLVM can't eliminate the grow_one check. It looks like there's a test ensure it's not needed in the easier case of vec.push(vec.pop()) [0]. With the iterator the optimization is handled in the standard library using specialization and TrustedLen[1].

[0]: https://github.com/rust-lang/rust/blob/master/tests/codegen/...

[1]: https://github.com/rust-lang/rust/blob/d4025ee454169fbd22f57...

ironhaven · on Dec 12, 2024

Yep that can be used for pre allocating the Vec like in the `with_capacity` example

c0balt · on Dec 12, 2024

That's not accurate, it can be used while consuming an Iterator and, depending on the implementation, be used to guide the consumer during runtime. The stdlib likely is not doing this but the API very much allows advanced behavior. We, e. G., used this for some part of a query engine in a course in uni to guide algorithm choice for operators.

tialaramex · on Dec 13, 2024

> The stdlib likely is not doing this

Um, yes it is, extensively?

SpecFromIterNested is a specialization trait for alloc::vec::Vec's FromIterator which handles both the TrustedLen and ordinary Iterator scenarios

For an ordinary Iterator, it calls next() once to check this Iterator isn't done, if it's done, we can just give back a Vec::new() since that's exactly what was needed. Otherwise, it then consults the hint's low estimate, and it pre-allocates enough capacity on that basis, unless it's lower than Vec's own guess of the minimum worthwhile initial capacity.

For Iterators which impl TrustedLen (ie promise they know exactly how many items they yield) it instead checks the upper end of the hint, to see if it's None, if it is the iterator knows it's too big to store in memory, we should panic. Otherwise though we can Vec::with_capacity

    let v: Vec<_> = (0..23456).collect();

... will just give you a Vec with the 23456 values from zero to 23455 inclusive, it won't waste time growing that Vec because it knows from the outset that there are going to be exactly 23456 items in the Vec.

c0balt · on Dec 14, 2024

> Um, yes it is, extensively?

Sorry, that was a mistake on my part. I did not see it explicitly in any of the code. Thank you for pointing out in detail where the stdlib uses this.

aidos · on Dec 12, 2024

Interesting! Thanks.

Lvl999Noob · on Dec 12, 2024

If I recall correctly, there was actually some unstable specialisation in the std library that allowed reusing the backing storage if you do an `into_iter()` and then a `collect()`.

GuB-42 · on Dec 13, 2024

Is there a reason why the loop and map don't result in the exact same code?

It looks like the code does exactly the same thing and something the optimizer could catch. Is is because of potential side effects? If not, maybe there is a ticket to open somewhere, if it isn't done already.

skavi · on Dec 13, 2024

The FromIterator impl for Vec is specialized with unsafe code for these cases.

FromIterator is the trait that the collect method uses.

Specialization isn’t a stable feature in Rust, but is used extensively in the standard library.

bestouff · on Dec 14, 2024

Not only SIMD but also size hints which avoid bound checks and enable preallocation or even allocation reuse. These are often missing when using for loops.

dmart · on Dec 12, 2024

I would have liked to see a comparison in the fold() section that functions the same way as the original for loop:

    list_of_lists.into_iter().fold(Vec::new(), 
        |mut accumulator, list| {
            accumulator.extend(list);
            accumulator
        }
    )

kccqzy · on Dec 12, 2024

The author's fold example is unfair. They could've just called accumulator.extend() and then return the accumulator inside the fold example for a fair apple to apple comparison. Just mark the accumulator as mut.

Furthermore, I'd use with_capacity in both cases: Vec::with_capacity(list_of_lists.iter().map(|l| l.len()).sum())

scotty79 · on Dec 12, 2024

I'm currently doing advent of code in Scala 3 to learn a bit of this language.

For loop does weird things there. It can be used as a flat map as it can iterate over multiple iterators at once and yield a value for each combination.

What's bitten me so far multiple times is that when you iterate over a Set or a Map the result of for expression is also a Set or a Map.

But since you have access to keys during iteration then some iterations, if they return same map key or same set value, might get silently overwritten by others.

I don't remember having this problem in Rust because there I had to be very intentional about iterators.

The other thing that bit me was that arithmetic on Int overflows silently, but that's apparently a Java thing, which made me wonder how is Java an enterprise language.

Otherwise Scala 3 is superb expeirience. Syntax is ultra-flexible and local extensibility of everything and access to things from the context of where your code is defined and even from the context where it's running is magical.

winwang · on Dec 12, 2024

glad to see another Scala fan :)

keep in mind that "for loops" are really "for comprehensions" and desugae into flatMap/map

lilyball · on Dec 12, 2024

Why does fallible_flatten_fold have that accumulator.clone() in it? You're cloning the in-progress vector only to throw away the original, it's extremely wasteful and completely unnecessary. Just declare the accumulator as `mut`.

xpe · on Dec 13, 2024

Dear author (and others who writing things like this): it looks you put in considerable time on code comparisons and benchmarks. Why not publish the source code for them? Benefits include:

1. The comparisons, as you've written them up, probably will get stale fast. It would be nice to be able to re-run them.

2. Some of the examples are surprising. Why? Any bugs? Some kind of weirdness? Readers would want to explore.

3. Readers can see how you did the benchmarks. We can put more/less stock in them. Hopefully you used `criterion` or similar, with warm ups, etc.

underdeserver · on Dec 12, 2024

Without looking too much at the generated assembly in https://godbolt.org/z/fKEPaTdTv, it seems that using_map uses SIMD instructions (movdqu and psubd) in the main loop, while using_loop doesn't, so that could indeed explain the performance difference.

CrendKing · on Dec 12, 2024

Maybe I'm dumb, but I can't see how the code in the "Errors and map" section can compile. "transform_list" returns a Result<>, yet "result" is just a Vec. I thought you always need to wrap it with Ok()? Is that a new nightly feature?

orf · on Dec 12, 2024

No, it seems like an oversight in the code sample. It should be wrapped in Ok()

dhosek · on Dec 12, 2024

It’s interesting to note that performance of a for loop versus the functional-style mechanism varies by language. On Java, there is a performance penalty (possibly shrunken since Java 8) for using the FP idioms while in Rust, they end up much faster.

kstrauser · on Dec 12, 2024

That surprises me, and I'd expect the FP version to be at least as fast, with the option to be much faster. With the for loop, you're saying "run against this value, then run against the next one, then run against the next one, then...". If the compiler isn't certain that the iterations aren't free of side effects, then it would have to run each one in order before moving on to the next.

`.map(...)` implies "I don't care about ordering, and therefore you don't need to, either", freeing the compiler to schedule the loops in a more optimal order, or in parallel or with SIMD, or any other optimization that lets it get the job done as fast as possible. I'm sure someone will come up with an example, but I can't personally think of any way where a for-loop's semantics would let a clever compiler write faster code than the equivalent map.

varikin · on Dec 13, 2024

Streams (FP in java) is slower than for loops for a couple reasons. A big one is java doesn't natively support real closures or lambdas. It does have syntax for them, but that is just syntactic sugar for an class with a single method under the hood. So streams end up doing lot of object allocation and garbage for the fake closures.

Also, streams operate on objects, so they have to be on the heap. You can't use them with primitives on the stack. Though with autoboxing, the JVM may play some tricks with a list of Integer objects really being primitives on the stack, but I would never count on it.

As for SIMD, Java isn't going to parallelize anything automatically. You need to tell it you run the steam in parallel which will split it into threads. Java doesn't have lightweight threads like coroutines.

I know lightweight threads are on the roadmap and maybe available in Java 21 or newer. I know real closures have been considered, but I don't if it's gone anywhere. It's hard to do a quick search because we got "closures" in Java 8 so theres a lot of noise.

And as a caveat, I am most familiar with Java 17 (and older). I expect we'll look at moving to Java 21 (current LTS) next year.

ynik · on Dec 13, 2024

The big differences are: 1. Rust closures are by-value structs; whereas Java closures are heap objects. 2. Rust generics are monomorphized; whereas Java type-erases them -> lots of virtual call overhead when passing a closure to a generic function.

Sometimes, if the Java JIT manages to inline absolutely everything, it can optimize away these overheads. But in practice, Rust FP gets optimized a lot more reliably than Java FP.

dhosek · on Dec 13, 2024

The other problem with the parallel streams is that it’s badly implemented. Threads for parallel streams are pulled from a single thread pool shared across the whole application so if you have multiple parallel streams in an application that’s already inherently multithreaded (e.g. a web service), you end up with severe resource contention that makes the parallelism work poorly if at all and can end up causing your app to deadlock because all the threads are in use somewhere else. There’s a workaround for it, but it ends up requiring some ugly boilerplate code to work.

Measter · on Dec 13, 2024

> It does have syntax for them, but that is just syntactic sugar for an class with a single method under the hood. So streams end up doing lot of object allocation and garbage for the fake closures.

In Rust a closure is really just a struct that implements up to three closure traits, each of which provide a single function. So from that side of things, what Java is doing for them isn't inherently different from Rust.

kstrauser · on Dec 13, 2024

Huh, interesting. Thanks for that! Today, I learned.

recursive · on Dec 13, 2024

> `.map(...)` implies "I don't care about ordering, and therefore you don't need to, either"

In plenty of languages, no such implication exist. `map` can be specified to run in a certain order.

sunshowers · on Dec 13, 2024

It's interesting -- so, logically, `map` in Rust does imply an ordering. The closure is a `FnMut`, i.e. a callback which mutably captures values, causing external side effects. And it is guaranteed that if externally visible, mutations will be done in order.

But `FnMut` is the most general possible thing you can pass in. In reality, most callbacks are pure functions that don't alter mutable state at all. With Rust's monomorphization and aggressive inlining, LLVM can figure out that there's no mutation going on and can optimize that.

There is a wrinkle here, which is that capturing variables mutably is one of two ways a function can have side effects in Rust. The other way is via interior mutability, through UnsafeCell [1], or, more commonly, a wrapper around it like Mutex or RefCell. In that case as well, Rust guarantees that function calls to map are done in order. Luckily, because UnsafeCell is the root of all interior mutability, the compiler can simply track whether an UnsafeCell is transitively involved.

If you're wondering where the humble `print!` comes in -- well, it clearly has side effects. But it acquires a global lock on standard output each time it's called [2], so UnsafeCell is involved.

[1] https://doc.rust-lang.org/std/cell/struct.UnsafeCell.html

[2] https://doc.rust-lang.org/std/macro.print.html

dhosek · on Dec 13, 2024

The author is confusing .map(…) with the Map interface for collections.

kstrauser · on Dec 13, 2024

Then that may not be an appropriate optimization for those languages.

recursive · on Dec 13, 2024

That's what I'm saying. In fact, I think that's most languages.

chabska · on Dec 13, 2024

What languages would that be?

recursive · on Dec 13, 2024

Probably most that has imperative stuff or closures. Here's the spec from ECMAscript. https://tc39.es/ecma262/multipage/indexed-collections.html#s...

C# and Javascript are the ones that I know of for sure.

mden · on Dec 12, 2024

Anyone have examples where fold leads to easier to read code than a for loop in Rust?

kccqzy · on Dec 12, 2024

Readability is subjective. I personally find fold almost always more readable than a for loop when the accumulator variable has a simple type. This is because merely seeing fold can already telling me several things: it will iterate over the entire collection without early exits like "break" in a loop; the data dependency between each iteration is made clear into a single variable.

I find it slightly difficult to read when the accumulator variable actually has multiple parts, like a complicated tuple. It's worse when part of the accumulator is a bool indicating whether it's finished; that's just a poor emulation of "break" in a for loop.

ironhaven · on Dec 12, 2024

I have used fold for converting strings to a bitset in advent of code

    let string = "ewfsan";
    let bitset = string.bytes().fold(0u32 |acc, ch| acc | 1 << (ch - b'a'));

This is a idiom that I have used many times so this being more consice than a for loop is a plus

Of course if you have never seen a syntax before it will make less sense that anything you have seen before

bippihippi1 · on Dec 12, 2024

afair I've mostly only used fold when doing maths not covered by the standard sum or product. Fold is similar to map reduce but it's just one expression.

Mond_ · on Dec 12, 2024

I find `foo.iter().map(|x| x.bar()).collect()` almost always easier to read and better at expressing intent than a for loop.

The other direction is more interesting to me: Those are the awkward cases where people sometimes overdo it with the functional iterator heavy style.

IshKebab · on Dec 12, 2024

Yeah I kind of wish there was a way to still use `?` inside map/filter/etc. lambdas. Error handling with that functional style is generally way more awkward than for loops, but also it's often more elegant in other ways (e.g. Rayon).

I think Ruby has some kind of feature that works like that but IIRC it looked less foot-gun more foot-bazooka. Does anyone know of any languages that solve that problem elegantly?

zozbot234 · on Dec 12, 2024

"Ad-Hoc Effects in Rust" https://capi.hannobraun.com/daily/2024-12-12 discusses the issue in detail.

An early attempt at a solution (2022) was provided by https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-ge...

anon-3988 · on Dec 13, 2024

I think what you want is a short-circuiting behavior. I don't see a problem with the status quo, if map is able to short-circuit, that means that it runs in order and have side effects. If your "map" can cause the whole function to return, then just write a for-loop. I don't even know if it should return from the whole function or just stop the mapping?

mijoharas · on Dec 12, 2024

What's the ruby feature you're referring to?

IshKebab · on Dec 13, 2024

https://medium.com/swlh/returning-from-a-ruby-proc-beware-of...

worik · on Dec 12, 2024

It is not that hard: If there are side effects, use a for loop