Should Small Rust Structs be Passed by-copy or by-borrow?

August 26, 2019

Like many good stories, this one started with a simple question. Should small Rust structs be passed by-copy or by-borrow? For example:

struct Vector3 {
    x: f32,
    y: f32,
    z: f32
}

fn dot_product_by_copy(a: Vector3, b: Vector3) -> float {
    a.x*b.x + a.y*b.y + a.z*b.z
}

fn dot_product_by_borrow(a: &Vector3, b: &Vector3) -> float {
    a.x*b.x + a.y*b.y + a.z*b.z
}

This simple question sent me on a benchmarking odyssey with some surprising twists and discoveries.

Why It Matters

The answer to this question matters for two reasons — performance and ergonomics.

Performance

Passing by-copy should mean we copy 12-bytes per Vector3. Passing by-borrow should pass an 8-byte pointer per Vector3 (on 64-bit). That's close enough to maybe not matter.

But if we change f32 to f64 now it's 8-bytes (by-borrow) versus 24-bytes (by-copy). For code that uses a Vector4 of f64 we're suddenly talking about 8-bytes versus 32-bytes.

Ergonomics

In C++ I know exactly how I'd write this.

struct Vector3 {
    float x;
    float y;
    float z;
};

float dot_product(Vector3 const& a, Vector3 const& b) {
    return a.x*b.x + a.y*b.y + a.z*b.z
}

Easy peasy. Pass by const-reference and call it day.

The problem with Rust is ergonomics. When passing by-copy you can combine mathematical operations cleanly and simply.

fn do_math(p1: Vector3, p2: Vector3, d1: Vector3, d2: Vector3, s: f32, t: f32) -> f32 {
    let a = p1 + s*d1;
    let b = p2 + s*d2;
    dot_product(b - a, b - a)
}

However when using borrow semantics it turns into this ugly mess:

fn do_math(p1: &Vector3, p2: &Vector3, d1: &Vector3, d2: &Vector3, s: f32, t: f32) -> f32 {
    let a = p1 + &(&d1*s);
    let b = p2 + &(&d2*t);
    let result = dot_product(&(&b - &a), &(&b - &a));
}

Blech! Having to explicitly borrow temporary values is super gross. 🤮

Building a Benchmark

So, should Rust pass small structs, like Vector3, by-copy or by-borrow?

None of Twitter, Reddit, or StackOverflow had a good answer. I checked popular crates like nalgebra (by-borrow) and cgmath (by-value) and found both ways are common.

I don't like the ergonomics of by-borrow. But what about performance? If by-copy is fast then none of this matters. So I did the only thing that seemed reasonable. I built a benchmark!

I wanted some to test something slightly more than raw operator performance. It's still a silly synthetic benchmark that is not representative of a real application. But it's a good starting point. Here's roughly what I came up with.

let num_shapes = 4000;
for cycle in 0...5 {
    let (spheres, capsules, segments, triangles) = generate_shapes(num_shapes);
    for run in 0..5 {
        for (a,b) in collision_pairs {
            test_by_copy(a,b)
        }
        for pair in collision_pairs {
            test_by_borrow(&a, &b)
        }
    }
}

I randomly generate 4000 spheres, capsules, segments, and triangles. Then I perform a simple overlap test for SphereSphere, SphereCapsule, CapsuleCapsule, and SegmentTriangle for all pairs. These tests are run by-copy and by-borrow. Only time spent inside test_by_copy and test_by_borrow is counted.

Each full benchmark performs 3.2 billion comparisons and finds ~220 million overlapping pairs. Here are some results running single-threaded on my beefy i7-8700k Windows desktop. All times are in milliseconds.

  Rust
    f32 by-copy:   7,109
    f32 by-borrow: 7,172 (0.88% slower)

    f64 by-copy:   9,642
    f64 by-borrow: 9,601 (0.42% faster)

Well this is mildly surprising. Passing by-copy or by-borrow barely makes a difference! These results are quite consistent. Although a difference of less than 1% is well within the margin of error.

Is this the answer to our question? Should we pass by-copy and call it a day? I'm not ready to say.

Down the C++ Rabbit Hole

After my initial Rust benchmarks I decided to port my test suite to C++. The code is similar, but not identical. Both Rust and C++ implementations are what I would consider idiomatic in their respective languages.

  C++
    f32 by-copy:   14,526
    f32 by-borrow: 13,880 (4.5% faster)

    f64 by-copy:   13,439
    f64 by-borrow: 13,942 (3.8% slower)

Wait, what?! At least two things are super weird here.

double by-value is faster than float by-value
C++ float is twice as slow as Rust f32

Inlining

Clearly something unexpected is going on. Using Visual Studio 2019 I grabbed a pair of quick CPU profiles.

C++ Profile

Rust Profile

Ah hah! Rust appears to be inlining almost everything. Let's copy Rust and throw a quick __forceinline infront of everything in our C++ impl.

  C++ w/ inlining
    f32 by-copy:   12,688
    f32 by-borrow: 12,108 (4.5% faster)

    f64 by-copy:   11,860
    f64 by-borrow: 11,967 (0.9% slower)

Inlining C++ provides a decent ~12% boost. But double is still faster than float. And C++ is still way slower than Rust.

Aliasing

I would consider my C++ and Rust implementations to both be idiomatic. However they are different! C++ takes out parameters by reference while Rust returns a tuple. This is because Rust tuples are delightful to use and C++ tuples are a monstrosity. But I digress.

// Rust
fn closest_pt_segment_segment(p1: Vector3, q1: Vector3, p2: Vector3, q2: Vector3) 
-> (T, T, T, Vector3, Vector3) 
{
    // Do math
}

// C++
float closest_pt_segment_segment(
    Vector3 p1, Vector3 q1, Vector3 p2, Vector3 q2,
    float& s, float& t, Vector3& c1, Vector3& c2)
{
    // Do math
}

This subtle difference could cause a huge impact on performance. The C++ version compiler can't be sure the out parameters aren't aliased. Which may limit its ability to optimize. Meanwhile Rust uses and returns local variables which are known to be non-aliased.

Interestingly, fixing the aliasing above doesn't make a difference! With inlining the compiler handles it already. Much to my surprise, what C++ does not handle well is the following:

void run_test(
    vector<TSphere> const& _spheres,
    vector<TCapsule> const& _capsules,
    vector<TSegment> const& _segments,
    vector<TTriangle> const& _triangles,
    int64_t& num_overlaps,
    int64_t& milliseconds)
{
    // perform overlaps
}

Changing run_test to return std::tuple<int64_t, int64_t> provides a small but noticeable improvement.

  C++ w/ inlining, tuples
    f32 by-copy:   12,863
    f32 by-borrow: 11,555 (10.17% faster)

    f64 by-copy:   11,832
    f64 by-borrow: 11,524 (2.60% faster)

Compile Flags

At this point both C++ and Rust are compiling with default options. Visual Studio exposes a ton of flags. I tried tweaking a bunch of flags to improve performance.

Favor fast code (/Ot)
Disable exceptions
Advanced Vector Extensions 2 (/arch:AVX2)
Floating Point Mode: Fast (/fp:fast)
Enable Floating Point Exceptions: No (/fp:except-)
Disable security check /GS-
Control flow guard: No

The only flags that made a real difference were "disable exceptions" and AVX2. Each about 10%. I decided to leave off AVX2 in an attempt to equal Rust.

  C++ w/ inlining, tuples, no C++ exceptions
    f32 by-copy:   11,651
    f32 by-borrow: 10,455 (10.27% faster)

    f64 by-copy:   10,866
    f64 by-borrow: 10,467 (3.67% faster)

We've made three C++ optimizations but our two mysteries remain. Why is double faster than float? And why is C++ still so much slower than Rust?

Going Deeper

I tried looking at the disassembly in Godbolt. There's obviously differences. But I'm not smart enough to quantify them.

Next I decided to crack open VTune. Here the difference was revealed plain as day!

Branch misprediction is awful. Not surprising. Take a close look at Vector Capacity Usage (FPU). Here's how that value is reported for different builds:

  Rust f32    42.3%
  Rust f64    32.7%
  C++ float 12.5%
  C++ double 25.0%

Yowza! For whatever reason, and I don't know exactly what, Rust is remarkably more efficient at utilizing my CPU's floating point vectors. The difference is enormous! Rust f32 is almost 3.5x more efficient than C++ float. And, for some reason, C++ double is twice as efficient as float! 🤷‍♂️

The obvious guess is that the Rust compiler simply does a better job of auto-vectorization. Is that the whole story or is there more? I honestly don't know. This is as deep as I'm going to dig for now. If any experts would like to chime in with more details I'm all ears.

By-Copy versus By-Borrow

Somehow my Rust article has spent more time talking about C++. But that's ok!

Our initial tests showed a less than one-percent different between by-copy and by-borrow. In my synthetic test the key reason appears to be because Rust auto-inlined the crap out of my code. Despite my not having a single #[inline] declaration! Furthermore, Rust automatically produced code twice as fast as a similar C++ implementation.

My answer to by-copy versus by-borrow for small Rust structs is by-copy. This answer comes with several caveats.

The definition of "small" is untested and unknown.
Less mathematical structs may not work as well due to less auto-vectorization.
Performant by-copy code likely requires inlining and trusting the compiler. 😱
Including small structs implemented in another crate is untested.
Synthetic benchmarks are not reflective of the real world. Always measure!

Conclusion

I started with a simple question and went on a fun and educational journey. At this point I'm going to move forward with passing small structs by-copy. I'm not 100% convinced it's the right choice in all cases. But I'm going to give it a whirl for a vector math library.

I suspect there will be a scenario where this is not as performant. I would advise cautious optimism. Trust the compiler to make good decisions... but verify your real-world scenario.

Thanks for reading.

Source Code

Full source code for my benchmark is available.

FullProjects.zip

Rust
main.rs
by_copy.rs
by_borrow.rs

C++
main.cpp
by_value.cpp
by_ref.cpp

Clang

All C++ benchmarks above were done using the MSVC build chain in Visual Studio 2019. I wondered if Clang would do a better job. It's a little faster for by-value. But it does a remarkably poor job of handling doubles by-ref.

  C++ (Clang) w/ inlining, tuples, no C++ exceptions
    f32 by-copy:   10416
    f32 by-borrow: 10072 (3.30% faster)

    f64 by-copy:   10520
    f64 by-borrow: 11335 (7.47% slower)

All of this code could be optimized to be better. It should be possible to get C++ and Rust to the exact same level. But that's beyond the scope of this article and answering my original question.