> Why is map so much faster? I am not sure. I suspect with the map() option the ...

zamalek · on Dec 12, 2024

It's not only because of SIMD. Contrasted to many other languages (though not all) the compiler is working with code here, not an arbitrary function pointer. In essence, JS and the like are operating with this:

    let result: Vec<i32> = list.into_iter().map::<_, Box<dyn Fn...>>(Box::new(transform)).collect()

Rust is able to inline the transform code right into the loop, which then becomes available for SIMD etc.

Rust further brings really nice ergonomics and comprehensive type inference to the equation, which makes it feel like writing C#/JS/whatever. JITted languages could detect and elide creating and immediately using a function pointer, but I don't think that any do.

Edit: these languages may not always allocate (specifically if nothing is captured in a closure), but the core concept remains: they erase the type, which means that they also erase the function body.

neonsunset · on Dec 12, 2024

JS, Java and C# have vastly different implementation details each.

Java uses type erasure for generics. .NET uses generic monomorphization for struct-typed generic arguments and method body sharing with virtual/interface dispatch for class-typed generic arguments (the types are never erased).

Moreover, non-capturing lambdas do not allocate, and are also get speculatively inlined by the JIT behind a guard. It's a bit limited but works quite well in production applications. You can also write struct-based iterators in C#. The main limitation is lack of full HM type inference which means having less convenient API where you can't convince the compiler to infer the full type signature.

One of the current limitations of C# is that lambdas are of type Func<T1...Tn, TResult> - calls through them are virtual. So unless JIT emits a guarded devirt path - you cannot specialize over them like over Fns in Rust which are part of the monomorphized generic signature. Various performance-oriented libraries sidestep this by implementing "value delegate" pattern where you constrain an argument over an interface implementation of an invoke-like method. Basically doing the higher order functions via struct implementations.

Java here also deserves a mention because OpenJDK is capable of inlining of shallow streams - Stream API is moderately to significantly slower than LINQ but it's not terribly slow in absolute terms.

With all that, in the recent versions, LINQ has started encroaching on the territory of performance of Rust iterators especially on large sequences where access to faster allocations and heavy pooling of underlying buffers when collecting to an array or a list allow for very efficient hot paths. LINQ also does quite a bit of "flattening" internally so chaining various operators does not necessarily add extra layer of dispatch.

Lastly, F# is capable of lambda inlining together with the function accepting it at IL level at build time and does so for various iterator expressions like Array.map, .iter and similar. You access this via `inline` bindings and `[<InlineIfLambda>]`-annotated parameters. It is also possible to implement your own zero-cost-ish iterators with computation expressions. If JIT/ILC improves at propagating exact types through struct fields in the upcoming release, it will be able to inline F# lambdas even if expansion does not happen at IL level: https://github.com/dotnet/runtime/issues/110290

NB: auto-vectorization is extremely fragile even with LLVM and kicks in only in simple scenarios, the moment you have a side effect a compiler cannot reason about it stops working.

nuancebydefault · on Dec 13, 2024

Java’s type erasure means that generic type information is not available at runtime.

C# lambdas: although non-capturing lambdas do not allocate, capturing lambdas do. "calls through them are virtual" is due to the underlying implementation of delegates in .NET.

neonsunset · on Dec 14, 2024

Well, yes, but delegates as a term is not often used in other languages so I did not mention them for simplicity's sake.

For what it's worth - the real issue in C# is not even the virtual calls but the way Roslyn caches lazily allocated non-capturing lambda instances. It does so in a compiler-unfriendly way due to questionable design decisions inside Roslyn.

Luckily, this has a high chance of changing in .NET 10. Ideally, by the time it releases hopefully the compiler will both understand the Roslyn's pattern of caching better and be able to stack-allocate non-escaping lambda closure instances.

Lambdas capturing 'this' inside instance methods of the object they refer to do not allocate either.

ridiculous_fish · on Dec 12, 2024

SIMD is true, but the original guess is correct, and that effect is bigger!

using_map is faster because it's not allocating: it's re-using the input array. That is, it is operating on the input `v` value in place, equivalent to this:

    pub fn using_map(mut v: Vec<i32>) -> Vec<i32> {
        v.iter_mut().for_each(|c| *c += 1);
        v
    }

This is a particularly fancy optimization that Rust can perform.

arllk · on Dec 13, 2024

Even when passing the array as a borrow instead of a clone[1], map still auto-vectorizes and performs the new allocation in one go, avoiding the bounds check and possible calls to grow_one

[1] https://godbolt.org/z/K9z6PvdYh

skitter · on Dec 12, 2024

This can even turn the following code into a no-op:

    vec_of_u32.into_iter().map(f32::from_bits).collect()

aidos · on Dec 12, 2024

Ah thanks! I was going to ask in here to see if anyone knew because the explanation about overwriting the same vector seemed pretty off base.

Never worked in Rust though so wondered if the iterator api had some weird optional notion of size that could be utilised throughout the chain.

c0balt · on Dec 12, 2024

> Never worked in Rust though so wondered if the iterator api had some weird optional notion of size that could be utilised throughout the chain.

Fwiw, this does exist: [Iterator::size_hint] (https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...)

hedgehog · on Dec 12, 2024

It seems `map` should have less restrictive semantics (specifically ordering) than `for`, does that allow more optimization? I don't know much about Rust internals.

ironhaven · on Dec 12, 2024

Reading the godbolt it looks like for the push loop llvm is unable to remove the `grow_one` capacity check after every push. Becaue of this the Vec could possibly reallocate after every push meaning it can't auto vectorize.

mastax · on Dec 12, 2024

It's a little bit surprising to me that LLVM can't eliminate the grow_one check. It looks like there's a test ensure it's not needed in the easier case of vec.push(vec.pop()) [0]. With the iterator the optimization is handled in the standard library using specialization and TrustedLen[1].

[0]: https://github.com/rust-lang/rust/blob/master/tests/codegen/...

[1]: https://github.com/rust-lang/rust/blob/d4025ee454169fbd22f57...

ironhaven · on Dec 12, 2024

Yep that can be used for pre allocating the Vec like in the `with_capacity` example

c0balt · on Dec 12, 2024

That's not accurate, it can be used while consuming an Iterator and, depending on the implementation, be used to guide the consumer during runtime. The stdlib likely is not doing this but the API very much allows advanced behavior. We, e. G., used this for some part of a query engine in a course in uni to guide algorithm choice for operators.

tialaramex · on Dec 13, 2024

> The stdlib likely is not doing this

Um, yes it is, extensively?

SpecFromIterNested is a specialization trait for alloc::vec::Vec's FromIterator which handles both the TrustedLen and ordinary Iterator scenarios

For an ordinary Iterator, it calls next() once to check this Iterator isn't done, if it's done, we can just give back a Vec::new() since that's exactly what was needed. Otherwise, it then consults the hint's low estimate, and it pre-allocates enough capacity on that basis, unless it's lower than Vec's own guess of the minimum worthwhile initial capacity.

For Iterators which impl TrustedLen (ie promise they know exactly how many items they yield) it instead checks the upper end of the hint, to see if it's None, if it is the iterator knows it's too big to store in memory, we should panic. Otherwise though we can Vec::with_capacity

    let v: Vec<_> = (0..23456).collect();

... will just give you a Vec with the 23456 values from zero to 23455 inclusive, it won't waste time growing that Vec because it knows from the outset that there are going to be exactly 23456 items in the Vec.

c0balt · on Dec 14, 2024

> Um, yes it is, extensively?

Sorry, that was a mistake on my part. I did not see it explicitly in any of the code. Thank you for pointing out in detail where the stdlib uses this.

aidos · on Dec 12, 2024

Interesting! Thanks.

Lvl999Noob · on Dec 12, 2024

If I recall correctly, there was actually some unstable specialisation in the std library that allowed reusing the backing storage if you do an `into_iter()` and then a `collect()`.

GuB-42 · on Dec 13, 2024

Is there a reason why the loop and map don't result in the exact same code?

It looks like the code does exactly the same thing and something the optimizer could catch. Is is because of potential side effects? If not, maybe there is a ticket to open somewhere, if it isn't done already.

skavi · on Dec 13, 2024

The FromIterator impl for Vec is specialized with unsafe code for these cases.

FromIterator is the trait that the collect method uses.

Specialization isn’t a stable feature in Rust, but is used extensively in the standard library.

bestouff · on Dec 14, 2024

Not only SIMD but also size hints which avoid bound checks and enable preallocation or even allocation reuse. These are often missing when using for loops.