> Why is map so much faster? I am not sure. I suspect with the map() option the Rust compiler figures out it can avoid allocations altogether by simply writing over the original vector, while with the loop it can't. Or maybe it's using SIMD? I tried to look in the compiler explorer but I'm not competent enough yet to figure it out. Maybe someone else can explain!
Yep, it's due to SIMD -- in the assembly for `using_map`, you can spot pcmpeqd, movdqu, and psubd, while `using_loop` doesn't have any of these.
It's not only because of SIMD. Contrasted to many other languages (though not all) the compiler is working with code here, not an arbitrary function pointer. In essence, JS and the like are operating with this:
let result: Vec<i32> = list.into_iter().map::<_, Box<dyn Fn...>>(Box::new(transform)).collect()
Rust is able to inline the transform code right into the loop, which then becomes available for SIMD etc.
Rust further brings really nice ergonomics and comprehensive type inference to the equation, which makes it feel like writing C#/JS/whatever. JITted languages could detect and elide creating and immediately using a function pointer, but I don't think that any do.
Edit: these languages may not always allocate (specifically if nothing is captured in a closure), but the core concept remains: they erase the type, which means that they also erase the function body.
JS, Java and C# have vastly different implementation details each.
Java uses type erasure for generics. .NET uses generic monomorphization for struct-typed generic arguments and method body sharing with virtual/interface dispatch for class-typed generic arguments (the types are never erased).
Moreover, non-capturing lambdas do not allocate, and are also get speculatively inlined by the JIT behind a guard. It's a bit limited but works quite well in production applications. You can also write struct-based iterators in C#. The main limitation is lack of full HM type inference which means having less convenient API where you can't convince the compiler to infer the full type signature.
One of the current limitations of C# is that lambdas are of type Func<T1...Tn, TResult> - calls through them are virtual. So unless JIT emits a guarded devirt path - you cannot specialize over them like over Fns in Rust which are part of the monomorphized generic signature. Various performance-oriented libraries sidestep this by implementing "value delegate" pattern where you constrain an argument over an interface implementation of an invoke-like method. Basically doing the higher order functions via struct implementations.
Java here also deserves a mention because OpenJDK is capable of inlining of shallow streams - Stream API is moderately to significantly slower than LINQ but it's not terribly slow in absolute terms.
With all that, in the recent versions, LINQ has started encroaching on the territory of performance of Rust iterators especially on large sequences where access to faster allocations and heavy pooling of underlying buffers when collecting to an array or a list allow for very efficient hot paths. LINQ also does quite a bit of "flattening" internally so chaining various operators does not necessarily add extra layer of dispatch.
Lastly, F# is capable of lambda inlining together with the function accepting it at IL level at build time and does so for various iterator expressions like Array.map, .iter and similar. You access this via `inline` bindings and `[<InlineIfLambda>]`-annotated parameters. It is also possible to implement your own zero-cost-ish iterators with computation expressions. If JIT/ILC improves at propagating exact types through struct fields in the upcoming release, it will be able to inline F# lambdas even if expansion does not happen at IL level: https://github.com/dotnet/runtime/issues/110290
NB: auto-vectorization is extremely fragile even with LLVM and kicks in only in simple scenarios, the moment you have a side effect a compiler cannot reason about it stops working.
Java’s type erasure means that generic type information is not available at runtime.
C# lambdas: although non-capturing lambdas do not allocate, capturing lambdas do.
"calls through them are virtual" is due to the underlying implementation of delegates in .NET.
Well, yes, but delegates as a term is not often used in other languages so I did not mention them for simplicity's sake.
For what it's worth - the real issue in C# is not even the virtual calls but the way Roslyn caches lazily allocated non-capturing lambda instances. It does so in a compiler-unfriendly way due to questionable design decisions inside Roslyn.
Luckily, this has a high chance of changing in .NET 10. Ideally, by the time it releases hopefully the compiler will both understand the Roslyn's pattern of caching better and be able to stack-allocate non-escaping lambda closure instances.
Lambdas capturing 'this' inside instance methods of the object they refer to do not allocate either.
SIMD is true, but the original guess is correct, and that effect is bigger!
using_map is faster because it's not allocating: it's re-using the input array. That is, it is operating on the input `v` value in place, equivalent to this:
Even when passing the array as a borrow instead of a clone[1], map still auto-vectorizes and performs the new allocation in one go, avoiding the bounds check and possible calls to grow_one
It seems `map` should have less restrictive semantics (specifically ordering) than `for`, does that allow more optimization? I don't know much about Rust internals.
Reading the godbolt it looks like for the push loop llvm is unable to remove the `grow_one` capacity check after every push. Becaue of this the Vec could possibly reallocate after every push meaning it can't auto vectorize.
It's a little bit surprising to me that LLVM can't eliminate the grow_one check. It looks like there's a test ensure it's not needed in the easier case of vec.push(vec.pop()) [0]. With the iterator the optimization is handled in the standard library using specialization and TrustedLen[1].
That's not accurate, it can be used while consuming an Iterator and, depending on the implementation, be used to guide the consumer during runtime. The stdlib likely is not doing this but the API very much allows advanced behavior. We, e. G., used this for some part of a query engine in a course in uni to guide algorithm choice for operators.
SpecFromIterNested is a specialization trait for alloc::vec::Vec's FromIterator which handles both the TrustedLen and ordinary Iterator scenarios
For an ordinary Iterator, it calls next() once to check this Iterator isn't done, if it's done, we can just give back a Vec::new() since that's exactly what was needed. Otherwise, it then consults the hint's low estimate, and it pre-allocates enough capacity on that basis, unless it's lower than Vec's own guess of the minimum worthwhile initial capacity.
For Iterators which impl TrustedLen (ie promise they know exactly how many items they yield) it instead checks the upper end of the hint, to see if it's None, if it is the iterator knows it's too big to store in memory, we should panic. Otherwise though we can Vec::with_capacity
let v: Vec<_> = (0..23456).collect();
... will just give you a Vec with the 23456 values from zero to 23455 inclusive, it won't waste time growing that Vec because it knows from the outset that there are going to be exactly 23456 items in the Vec.
If I recall correctly, there was actually some unstable specialisation in the std library that allowed reusing the backing storage if you do an `into_iter()` and then a `collect()`.
Is there a reason why the loop and map don't result in the exact same code?
It looks like the code does exactly the same thing and something the optimizer could catch. Is is because of potential side effects? If not, maybe there is a ticket to open somewhere, if it isn't done already.
Not only SIMD but also size hints which avoid bound checks and enable preallocation or even allocation reuse. These are often missing when using for loops.
Yep, it's due to SIMD -- in the assembly for `using_map`, you can spot pcmpeqd, movdqu, and psubd, while `using_loop` doesn't have any of these.