From 11M to 326M: How Multithreading Made astroz 30x Faster

25 Jan, 2026

Last week, I wrote about hitting 11 million satellite propagations per second with SIMD vectorization in my Zig library astroz. That post got way more attention than I expected. But I ended that post by saying I was adding multithreading as the next component and knew it was going to be a new milestone for this project.

The new number: 326 million propagations per second.

To put that in perspective: there are roughly 13,000 active satellites in low Earth orbit right now. At 326M props/sec, you can compute every single one of their positions 17 times per second. That's real time collision detection territory.

The Problem: SIMD Already Uses Parallelism

Here's the thing about SIMD: it's already a form of parallelism. My SGP4 implementation processes 8 satellites simultaneously using AVX512 vector instructions (or 4 with AVX2). Each CPU core is already doing 8x the work.

So when I went to add multithreading, I ran into the classic question: how do you distribute work across threads when your inner loop is already vectorized?

Naive approaches have problems:

Cache thrashing: Threads fighting over the same cache lines
False sharing: Adjacent memory locations causing unnecessary invalidation
Contention: Coordination overhead eating your gains

The answer came down to how you slice up the work.

The Key Insight: Time Major Layout

When propagating a constellation, you have two dimensions: satellites and time points. If you're computing positions for 13,000 satellites across 1,440 time points (one day at minute intervals), that's ~19 million individual propagations.

There are two ways to divide this across threads:

Satellite Major: Thread 1 handles satellites 0-999 for all 1,440 time points. Thread 2 handles satellites 1000-1999 for all time points. And so on.

Time Major: Thread 1 handles all 13,000 satellites for time points 0-89. Thread 2 handles all satellites for time points 90-179. And so on.

Time major wins consistently when it comes to speed, and here's why:

1. Cache locality for orbital elements Each satellite has a set of orbital parameters (mean motion, eccentricity, inclination, etc.) that SGP4 needs. In time major mode, when a thread processes a time point, it iterates through all satellites sequentially. When it moves to the next time point, it walks through those same orbital elements again in the same order, maximizing cache reuse.

2. GMST precomputation GMST (Greenwich Mean Sidereal Time) is the Earth's rotation angle, needed for converting from the TEME reference frame to ECEF coordinates. Since GMST only depends on time (not on which satellite), we precompute sin(GMST) and cos(GMST) for all 1,440 time points before threading begins. That's 1,440 trig calls instead of 19 million.

3. Natural SIMD alignment The SIMD vectorization operates over the satellite axis (8 satellites per batch on AVX512). Time major means each thread does complete sweeps over all batches, keeping the vector pipeline full.

The implementation in Zig is straightforward. Threads get assigned ranges of time indices, and each iterates through all satellite batches for its assigned times.

Results

Benchmarked on AMD Ryzen 7 7840U (8 cores, 16 threads), propagating the full active satellite catalog (13,478 satellites) for one day at minute intervals:

Implementation	Props/sec	vs python-sgp4
astroz (16 threads)	326M	121x
heyoka	155.6M	57x
Rust sgp4 (rayon)	47.9M	17x
astroz (1 thread)	37.7M	14x
python-sgp4	2.7M	1x

The 16 thread version is 8.6x faster than single-threaded; not quite linear scaling, but solid for a memory bound workload. More importantly, it's now the fastest open source CPU based SGP4 implementation I'm aware of (let me know if you find something faster!).

At 326M props/sec, some interesting applications become practical:

Real-time collision screening: Propagate all 13,000 active satellites to any time instant in under a millisecond. The bottleneck shifts from "can we compute positions fast enough" to "can we check 85 million satellite pairs fast enough" - a much better problem to have.
Live tracking dashboards: Update the entire satellite catalog at 60+ Hz for smooth visualization
Monte Carlo simulations: Run 10,000 orbital uncertainty scenarios for a full constellation in under a second

Try It

pip install astroz

from astroz import Constellation
import numpy as np

# Load all Starlink satellites
constellation = Constellation("starlink")

# Propagate for 1 day at 1-minute intervals
positions = constellation.propagate(np.arange(1440), output="ecef")

Check out the live Cesium demo to see it in action, or browse the source on GitHub.

Who Am I?

Anthony Templeton is a software engineer passionate about high-performance computing and aerospace applications. You can connect with me on LinkedIn or check out more of my work on GitHub.