Anthony's Blog

From 11M to 326M: How Multithreading Made astroz 30x Faster

Last week, I wrote about hitting 11 million satellite propagations per second with SIMD vectorization in my Zig library astroz. That post got way more attention than I expected. But I ended that post by saying I was adding multithreading as the next component and knew it was going to be a new milestone for this project.

The new number: 326 million propagations per second.

To put that in perspective: there are roughly 13,000 active satellites in low Earth orbit right now. At 326M props/sec, you can compute every single one of their positions 17 times per second. That's real time collision detection territory.

The Problem: SIMD Already Uses Parallelism

Here's the thing about SIMD: it's already a form of parallelism. My SGP4 implementation processes 8 satellites simultaneously using AVX512 vector instructions (or 4 with AVX2). Each CPU core is already doing 8x the work.

So when I went to add multithreading, I ran into the classic question: how do you distribute work across threads when your inner loop is already vectorized?

Naive approaches have problems:

The answer came down to how you slice up the work.

The Key Insight: Time Major Layout

When propagating a constellation, you have two dimensions: satellites and time points. If you're computing positions for 13,000 satellites across 1,440 time points (one day at minute intervals), that's ~19 million individual propagations.

There are two ways to divide this across threads:

Satellite Major: Thread 1 handles satellites 0-999 for all 1,440 time points. Thread 2 handles satellites 1000-1999 for all time points. And so on.

Time Major: Thread 1 handles all 13,000 satellites for time points 0-89. Thread 2 handles all satellites for time points 90-179. And so on.

Time major wins consistently when it comes to speed, and here's why:

1. Cache locality for orbital elements Each satellite has a set of orbital parameters (mean motion, eccentricity, inclination, etc.) that SGP4 needs. In time major mode, when a thread processes a time point, it iterates through all satellites sequentially. When it moves to the next time point, it walks through those same orbital elements again in the same order, maximizing cache reuse.

2. GMST precomputation GMST (Greenwich Mean Sidereal Time) is the Earth's rotation angle, needed for converting from the TEME reference frame to ECEF coordinates. Since GMST only depends on time (not on which satellite), we precompute sin(GMST) and cos(GMST) for all 1,440 time points before threading begins. That's 1,440 trig calls instead of 19 million.

3. Natural SIMD alignment The SIMD vectorization operates over the satellite axis (8 satellites per batch on AVX512). Time major means each thread does complete sweeps over all batches, keeping the vector pipeline full.

The implementation in Zig is straightforward. Threads get assigned ranges of time indices, and each iterates through all satellite batches for its assigned times.

Results

Benchmarked on AMD Ryzen 7 7840U (8 cores, 16 threads), propagating the full active satellite catalog (13,478 satellites) for one day at minute intervals:

Implementation Props/sec vs python-sgp4
astroz (16 threads) 326M 121x
heyoka 155.6M 57x
Rust sgp4 (rayon) 47.9M 17x
astroz (1 thread) 37.7M 14x
python-sgp4 2.7M 1x

The 16 thread version is 8.6x faster than single-threaded; not quite linear scaling, but solid for a memory bound workload. More importantly, it's now the fastest open source CPU based SGP4 implementation I'm aware of (let me know if you find something faster!).

At 326M props/sec, some interesting applications become practical:

Try It

pip install astroz
from astroz import Constellation
import numpy as np

# Load all Starlink satellites
constellation = Constellation("starlink")

# Propagate for 1 day at 1-minute intervals
positions = constellation.propagate(np.arange(1440), output="ecef")

Check out the live Cesium demo to see it in action, or browse the source on GitHub.


Who Am I?

Anthony Templeton is a software engineer passionate about high-performance computing and aerospace applications. You can connect with me on LinkedIn or check out more of my work on GitHub.