Ruby raytracer on M1

While doing something else than I was planning to, I accedentally came back to the raytracer I once wrote in Ruby, and decided that the important thing just now is to benchmark my MacBook Pro M1 against those old results. I guess this will actually be Ruby 3 on M1 vs Ruby 2 on i7, because time happened.

Single-threaded

Starting out with the single threaded teapot rendering benchmark. This should give a pretty good measure of raw performance.

MacBook Pro M1 (2021)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 0.272459 s
Render time: 354.255657 s
 Total time: 354.528116 s
iMac i7 (2012)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 0.532384 s
Render time: 259.392946 s
 Total time: 259.92533 s

But wait? Setup time is twice as fast, but rendering is 30% slower? To quote Arueshalae, something is not right here.

Multi-threaded

This the one that made the i7 melt, without providing much more pixels per second than at half the threads. Let’s see how the M1 fares here. Both have about the same capacity of maximum performance threads, and the same total level of parallelism. 4C + 4c vs 4C * 2T.

MacBook Pro M1 (2021)
[2] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8
 Setup time: 0.261933 s
Render time: 70.432901 s
 Total time: 70.694834 s

So, render time is about 1/5 of single-threaded. With the highly suboptimal implementation, I guess this is about as good as can be expected. I think this was the first time I could actually hear the CPU fan, in the year I’ve had this machine.

iMac i7 (2012)
[3] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8
 Setup time: 0.560456 s
Render time: 77.613168 s
 Total time: 78.173624 s

Now we’re suddenly doing better than the i7, but that might be due to thermal throttling? Could also be that the “efficiency” cores add more crunch to the mix than the hyperthreads did.

Still, the setup — building the Kd-tree — is twice as fast. This does not add up.

Face palm

Those earlier benchmarks were run after all the primary optimizations had been made, but before all the extra — performance draining — render quality features were added.

Like soft shadows.

All. Those. Shadow. Rays. The Shadow? The Shadow! Rays!

Easily fixed, just adjust the spherical light source to point size, and all those extra shadow rays goes away.

Hard shadows

There we go; twice as fast single-threaded rendering.

[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 0.240659 s
Render time: 122.153057 s
 Total time: 122.393716 s

BOOM lightning fast — almost thrice as fast multi-threaded

[2] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8
 Setup time: 0.266847 s
Render time: 28.061283 s
 Total time: 28.32813 s

Mystery solved, but where’s the fun in stopping now?

Going All In — Soft Shadows and Anti-Aliasing

The final original benchmark was rendering soft shadows with velvet smooth anti-aliasing, so let’s do that too, while we’re here.

MacBook Pro M1 (2021)

This time we might have hit the thermal throttling of the M1 as well, because the CPU fan is wheezing increasingly louder, and the hand rest is getting warm and toasty like you’d want when coming home on a cold winter night.

[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8, ssaa:3
 Setup time: 0.23379 s
Render time: 601.335931 s
 Total time: 601.569721 s
iMac i7 (2012)

This time we’re only 2 1/2 times as fast on the M1, and by the sound and feel of it, thermal throttling is very much in effect.

[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8, ssaa:3
 Setup time: 0.563097 s
Render time: 1501.46532 s
 Total time: 1502.028417 s

But wait, there’s more!

My other computer is also an ARM

So, I bought two new computers last year. The other was a Raspberry Pi 400.

This is a 4 core passively cooled beast, but still running overclocked from 1.8GHz to 2.2GHz without breaking a sweat. For normal “heavy” use, like running video streams from rC3 all day, that is.

Bonus: Raspberry Pi 400 OC 2.2GHz

Sitting steady at 66°C core temperature with 30% of the rendering done, and an estimate of 2400s to go. This estimate will go up though, because the first 30% is the easy part.

[6] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:4, ssaa:3
 Setup time: 1.374378639 s
Render time: 5541.122792698 s
 Total time: 5542.497171337 s

There we go. The final results are in. M1 beats everything by far, as expected, but the RPi400 is just 9 times slower, on passive cooling. Not warmer to the touch after 1 1/2 hours of hard work, than the actively cooled M1 was after 10 minutes.

Follow-up

RasPi 400 bonus benchmarks

I didn’t quite finish the bonus benchmarks, so for the sake of completeness, here we go. The missing ones for the RasPi 400.

Hard shadows

The Raspi only got a chance to show off with the heaviest of the benchmarks, so let’s see how it does in the lower end of teapot image quality.

Raspberry Pi 400 OC 2.2GHz (single threaded)

Single threaded performance still tracks at about the same ratio of 1/9 M1.

[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 1.375927773 s
Render time: 852.156157512 s
 Total time: 853.532085285 s
Raspberry Pi 400 OC 2.2GHz (multi-threaded)

This one would seem a bit unfair, because the RasPi 400 only gets 4 threads to do the work, instead of 8 threads for the other ones.

[2] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:4
 Setup time: 1.379623305 s
Render time: 292.012566677 s
 Total time: 293.392189982 s

The huge falloff in performance for those last 4 threads shows pretty clearly here. The RasPi 400 performance/thread scales pretty well at 3/4, given the huge overhead of the threaded renderer, but the M1 only did 4/8, and the i7 was even worse with 3/8.

Still, 4x M1 is more than 3x RasPi 400, so the total performance ratio when using every single bit of crunch there is, is less than 1/10 M1.

More interesting is that the RasPi 400 did the soft-shadows render at only 19x the time for hard-shadows, but the M1 took more than 21x the time. The passive thermal design of the RasPi 400 continues to amaze me, especially given the long history of overheated pies.

Ruby Raytracer on M2 Max

A few years — and iterations of the Apple M-series — has passed, and I recently found myself in possession of a Mac Studio M2, in its base configuration of 8P + 4E CPU cores, and even more ridiculous1 amounts of cache.

But the burning question is, does it deliver on the promise of up to 18% CPU speed improvement?

Hard shadows

The base-line benchmark, as before, is the single-threaded render without any fancy rendering techniques active

Mac Studio M2 (2023)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 0.178938 s
Render time: 104.306864 s
 Total time: 104.485802 s

That’s a 17% improvement in single-core performance over the M1, so right on the spot.

MacBook Pro M1 (2021)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true
 Setup time: 0.240659 s
Render time: 122.153057 s
 Total time: 122.393716 s

Then we get into the unfair territory, because the M1 did the multi-threaded benchmark using all its P-, and E-cores, where the M2 Max can let its P-cores do all the work.

Mac Studio M2 (2023)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8
 Setup time: 0.173121 s
Render time: 15.689671 s
 Total time: 15.862792 s

BOOM even lightning faster — over six and a half times as fast multi-threaded, and almost twice (1.79x) as fast as the M1 was.

MacBook Pro M1 (2021)
[2] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8
 Setup time: 0.266847 s
Render time: 28.061283 s
 Total time: 28.32813 s

Going All In — Soft Shadows and Anti-Aliasing

This time, we’re throwing everything we have, letting the tiny E-cores do what they can to help along.

Mac Studio M2 (2023)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:12, ssaa:3
 Setup time: 0.17466 s
Render time: 304.752032 s
 Total time: 304.926692 s

Just shy of twice (1.97x) the crunch of an M1, in the extremely unfair final round.

MacBook Pro M1 (2021)
[1] pry(#<GlisteningRuby::DSL::Scene>)> render verbose:true, threads:8, ssaa:3
 Setup time: 0.23379 s
Render time: 601.335931 s
 Total time: 601.569721 s

Still, this is an extremely inefficient raytracer, and nothing we can throw at it will make renders reasonable. All the benchmarks yield a 320x256 image, which of course is the standard Amiga PAL low-res, and would take days to run with an optimized raytracer on an actual Amiga.

Which raises the next question, is it more efficient to emulate an Amiga, running an optimized raytracer, than this nonsense one I made in Ruby?

  1. This monster has 87 MiB L1+L2+L3 cache in total. It can fit my entire first hard-drive (52 MB) in CPU cache. 

Previous related

Messy shadows

Apparently, I messed up some more vector algebra. Ray / surface hits currently record the event in surface space, which works just fine for shading a unit sphere still positioned at origin. When we transform things around, and try to cast the secondary shadow rays, it becomes painfully obvious that surface space and world space no longer coincide.

All the things

One limitation so far was that there could only be one surface in a scene. To get past that, we add the Group meta-surface, that contains a number of surfaces, and intersects with rays on their behalf. To make this actually useful, surfaces also need transforms to translate them out of their shared origin.