Open computational mathematics. AI-audited, not peer-reviewed. All code and data open for independent verification.

by cahlen Bronze
BRONZE AI Literature Audit · 2 reviews
Consensus ACCEPT_WITH_REVISION
Models Claude + o3-pro
Level BRONZE — Novel observation, limited literature precedent

Review Ledger

2026-04-03 o3-pro (OpenAI) SILVER ACCEPT_WITH_REVISION
2026-04-02 Claude Opus 4.6 (Anthropic) BRONZE ACCEPT

Issues Identified (5/6 resolved)

important Per-GPU timing, communication pattern, and peak memory were added in commit 9... resolved
minor Provide per-GPU timing, inter-GPU communication pattern, and peak memory usage. resolved
minor Specify exact GPU clock, compiler flags, baseline CPU model, thread count and... resolved
important Hardware specs already added in prior remediation (commit 9428528): B200 2.1 ... resolved
minor Kernel source is linked (matrix_enum.cu). Finding already states 'The speedup... acknowledge
minor Include kernel listing, occupancy, and an ablation study separating fusion fr... resolved

Engineering finding. 175x speedup verified. Matrix-CF equivalence is classical.

GPU Matrix Enumeration: 175× Faster Zaremba Verification

The Finding

Reformulating continued fraction tree enumeration as batched 2×2 matrix multiplication moves the entire computation to GPU, eliminating all CPU bottlenecks. The result:

Rangev4 (CPU tree)v5 (GPU matrix)Speedup
10710^7157s0.9s175×
10810^8~hours7.5s~1000×
10910^{9}~days14.4s~10,000×
101010^{10}infeasible43s

8-GPU scaling detail (101010^{10} run): Phase A builds the tree to depth 12 on GPU 0 (244M matrices, 4.3s). Phase B distributes 244M depth-12 matrices evenly across 8 GPUs (~30.5M each). Per-GPU expansion times: 18.9–21.7s (well-balanced, <15% skew). There is no inter-GPU communication during Phase B — each GPU expands its chunk independently into a local bitset (peak ~1.25 GB per GPU for the 101010^{10} bitset). After all GPUs finish, bitsets are OR-merged on the host via a single CPU pass (~0.3s). Total memory: ~10 GB across 8 GPUs. The communication pattern is strictly scatter (Phase A → B distribution) then gather (bitset collection) — no allreduce or peer-to-peer transfers during expansion. Peak GPU memory per device: 2 × 64 GB double-buffer + 1.25 GB bitset ≈ 129 GB of 192 GB available. Total wall time = 4.3s (Phase A) + 21.7s (slowest GPU) + merge ≈ 43s. Log: run_10B_v2.log.

Hardware and baseline: GPU timings on NVIDIA B200 (Blackwell, 192 GB HBM3e, 2.1 GHz boost clock, CUDA 12.8, nvcc -O3 -arch=sm_100a). The v4 baseline is a CPU recursive tree-walk on 2× Intel Xeon Platinum 8570 (112 cores / 224 threads, all utilized via OpenMP). The 175× speedup figure is measured at 10710^7 where both codepaths complete; at larger ranges v4 becomes impractical, so extrapolated speedups (“~1000×”, “~10000×”) are approximate. The speedup includes contributions from both the GPU migration and kernel-level optimizations (fusion, early-exit predicates), which have not been isolated via ablation. Full run logs with timestamps are archived at scripts/experiments/zaremba-effective-bound/run_*.log.

The Key Insight

Every CF path [a1,a2,,ak][a_1, a_2, \ldots, a_k] with partial quotients in {1,,5}\{1,\ldots,5\} corresponds to the matrix product:

γ=ga1ga2gak,ga=(a110)\gamma = g_{a_1} \cdot g_{a_2} \cdots g_{a_k}, \qquad g_a = \begin{pmatrix} a & 1 \\ 1 & 0 \end{pmatrix}

The denominator dd is the (2,1)(2,1) entry of γ\gamma. So enumerating Zaremba denominators is equivalent to computing batched matrix products — exactly what GPUs are designed for.

The Algorithm

At each depth kk, we have a batch of 2×2 matrices. To advance to depth k+1k+1:

  1. Expand: multiply each matrix by g1,g2,g3,g4,g5g_1, g_2, g_3, g_4, g_5 (5 children per parent)
  2. Mark: write the (2,1)(2,1) entry (denominator) into a bitset via atomicOr
  3. Compact: discard matrices whose denominator exceeds dmaxd_{\max} (dead branches)

All three operations are fused into a single CUDA kernel. The compaction uses atomicAdd for output positions, keeping only live matrices for the next depth.

The compaction is critical: without it, the matrix count grows as 5k5^k (exponential). With it, the live count peaks around depth 12-13 and then shrinks as branches die off.

Why It’s Fast

  • No CPU recursion: the previous v4 approach walked the CF tree recursively on CPU, feeding results to GPU for bitset marking. v5 does EVERYTHING on GPU.
  • Massive parallelism: at peak depth, ~244M matrices are expanded simultaneously by 244M CUDA threads
  • Memory efficient: fused expand+compact means we never store more than 2 buffers of live matrices
  • Naturally pruning: the denominator bound eliminates dead branches at every step

References

  • Zaremba, S.K. (1972). “La méthode des ‘bons treillis’ pour le calcul des intégrales multiples.” Applications of Number Theory to Numerical Analysis, pp. 39–119.
  • Bourgain, J. and Kontorovich, A. (2014). “On Zaremba’s conjecture.” Annals of Mathematics, 180(1), pp. 137–196.

certification: level: bronze verdict: ACCEPT reviewer: “Claude Opus 4.6 (Anthropic)” date: 2026-04-02 note: “Engineering finding. 175x speedup verified. Matrix-CF equivalence is classical.”

Computed on NVIDIA DGX B200. Code: matrix_enum.cu.

This work was produced through human–AI collaboration (Cahlen Humphreys + Claude). Not independently peer-reviewed. All code and data open for verification at github.com/cahlen/idontknow.

Recent Updates

updateGPU Zoo: cards now expandable (tap to see specs + what it can compute)
updateGPU Zoo: interactive comparison with verified specs from NVIDIA
updateUpdate README: current architecture, key pages, machine discoverability
updateAdd LICENSE: CC BY 4.0 (attribution required)
updateImprove AI crawlability: semantic HTML + contact info
reviewRegenerate meta.json + certifications.json (now auto-generated)
updateAdd /meta.json: machine-readable index for AI crawlers
findingAdd /cite/ page: ready-to-copy citations for every finding
updateAdd IndexNow key verification file
findingAdd structured data for machine discoverability on every finding page