Performance
MIND's compiler frontend processes tensor programs in microseconds, produces 100% deterministic builds, and generates gradient code at compile-time.
Important: What We Measure
MIND benchmarks measure the compiler frontend only: parsing, type checking, and IR lowering. This does not include code generation, optimization passes, linking, or producing an executable.
Comparisons with PyTorch torch.compile(), Mojo mojo build, and JAX jax.jit() are shown for context, but these tools perform fundamentally more work (full compilation to runnable code). The ratios reflect this scope difference, not just speed.
Frontend Speed (Verified)
In-process Criterion benchmarks measuring compile_source() — the complete frontend pipeline (parse + typecheck + IR lowering):
| Program | Complexity | Time (median) | Pipeline |
|---|---|---|---|
| scalar_math | 1 expression | 1.8 µs | parse + typecheck + IR |
| small_matmul | 3 statements | 2.8 µs | parse + typecheck + IR |
| tensor_ops | 5 statements | 4.8 µs | parse + typecheck + IR |
| medium_mlp | 6 statements, 3 ops | 6.1 µs | parse + typecheck + IR |
| large_network | 12 statements, 3-layer MLP | 15.5 µs | parse + typecheck + IR |
Methodology
Rust Criterion.rs statistical benchmarks: 100 samples per test, 3-second warmup, 95% confidence intervals. In-process measurement (no subprocess overhead). Frontend time scales roughly linearly with program complexity.
Verified Feb 2026 on Linux 6.17, x86_64, Rust 1.93. Reproducible via: cargo bench --bench compiler and cargo bench --bench simple_benchmarks
MIND Frontend vs PyTorch torch.compile()
Scope difference: MIND measures frontend only (parse + typecheck + IR). PyTorch torch.compile() includes graph capture, optimization, and code generation (Inductor/Triton). These are not equivalent operations.
| Benchmark | MIND Frontend | PyTorch torch.compile() | Ratio |
|---|---|---|---|
| scalar_math | 1.8 µs | 99 ms | 55,000x |
| small_matmul | 3.0 µs | 162 ms | 55,000x |
| medium_matmul | 3.0 µs | 109 ms | 37,000x |
| large_matmul | 3.0 µs | 105 ms | 36,000x |
| simple_mlp | 6.1 µs | 752 ms | 122,000x |
| conv2d | ~5 µs | 878 ms | 176,000x |
What This Means
MIND's frontend is 35,000-176,000x faster than PyTorch's full GPU torch.compile() pipeline. This is expected because:
- MIND: Specialized Rust frontend — parse, typecheck, IR emit. No code generation.
- PyTorch: Full compilation — FX graph capture, optimization passes, Inductor code generation, C++ compilation.
- Key takeaway: MIND's frontend is microsecond-fast, enabling instant feedback during development. A full end-to-end comparison would require MIND to also generate and compile executable code.
Same-machine measurement: PyTorch 2.10 GPU (RTX 3080, CUDA 12.8), full cold-start (caches cleared). MIND: Criterion (100 samples). Feb 2026.
MIND Frontend vs Mojo 0.26.1
Scope difference: MIND measures frontend only (parse + typecheck + IR). Mojo mojo build performs full LLVM compilation to a native binary. These are not equivalent operations.
| Benchmark | MIND Frontend | Mojo mojo build | Ratio |
|---|---|---|---|
| scalar_math | 1.8 µs | 810 ms | 458,000x |
| matmul | 3.0 µs | 827 ms | 280,000x |
| mlp | 6.1 µs | 829 ms | 135,000x |
What This Means
MIND's frontend is 135,000-458,000x faster than Mojo's full mojo build compilation. Mojo compiles through LLVM to produce a native binary, while MIND's frontend only performs parsing, type checking, and IR lowering.
Same-machine measurement: Mojo 0.26.1.0 (pixi, Ubuntu 24.04). MIND: Criterion (100 samples). Feb 2026.
MIND Frontend vs JAX 0.9 jax.jit()
Scope difference: MIND measures frontend only (parse + typecheck + IR). JAX jax.jit() performs full XLA compilation (HLO lowering + optimization + code generation). These are not equivalent operations.
| Benchmark | MIND Frontend | JAX jax.jit() cold-start | Ratio |
|---|---|---|---|
| scalar_math | 1.8 µs | 37.5 ms | 21,200x |
| small_matmul | 3.0 µs | 127.2 ms | 43,100x |
| medium_matmul | 3.0 µs | 139.7 ms | 47,400x |
| large_matmul | 3.0 µs | 280.6 ms | 95,100x |
| simple_mlp | 6.1 µs | 360.5 ms | 58,600x |
What This Means
MIND's frontend is 21,200-95,100x faster than JAX's cold-start XLA compilation. JAX compiles through XLA to produce optimized GPU/CPU kernels, while MIND's frontend only performs parsing, type checking, and IR lowering.
Same-machine measurement: JAX 0.9.0.1 (CUDA 12.8, RTX 3080), cold-start with compilation cache disabled. MIND: Criterion (100 samples). Feb 2026.
Reproduce It Yourself
# MIND frontend benchmarks (Criterion, in-process) cargo bench --bench simple_benchmarks cargo bench --bench compiler # PyTorch comparison (same machine) pip install torch python benchmarks/scientific_benchmark.py
Deterministic Compilation
MIND guarantees 100% bit-level reproducibility — every compilation produces identical output, verified via SHA256 cryptographic hashing.
| Test Program | Runs | Unique Hashes | Result |
|---|---|---|---|
| scalar_math | 10 | 1 | Deterministic |
| small_matmul | 10 | 1 | Deterministic |
| medium_matmul | 10 | 1 | Deterministic |
| mlp | 10 | 1 | Deterministic |
40 total runs, 0% hash collision rate, 100% reproducibility. MIND guarantees bit-identical output across runs, machines, and time.
Compile-Time Autodiff
MIND generates gradient code once at compile-time, not on every training iteration. This eliminates per-iteration autodiff overhead entirely.
| Program | MIND Cost | PyTorch Cost | Advantage |
|---|---|---|---|
| Simple Quadratic | 38 µs (once) | 51,100 µs (1000 iters) | 1,345x |
| Small MLP | 38 µs (once) | 345,900 µs (1000 iters) | 9,103x |
| Matmul Chain | 38 µs (once) | 428,800 µs (1000 iters) | 11,284x |
Key Insight
MIND's compile-time autodiff is 1,345-11,284x more efficient than runtime autodiff over 1000 training iterations. The gradient code is already generated — just execute it.
Optimization Levels
The compiler provides several optimization profiles:
| Flag | Description | Deterministic |
|---|---|---|
--debug | No optimizations, full debugging symbols | Yes |
--release | Standard optimizations, deterministic | Yes |
--release --fast-math | Maximum performance, relaxed floating-point | No |
Compiler Optimizations
The MLIR-based pipeline applies several optimization passes:
- Operator fusion — combines sequential operations to reduce memory traffic
- Layout optimization — selects optimal memory layouts for target hardware
- Dead code elimination — removes unused computations
- Constant folding — evaluates compile-time-known expressions
- Loop tiling — improves cache utilization for large tensors
Target Performance (CPU)
Benchmark targets for Core v1 operations on CPU:
| Operation | Target vs OpenBLAS |
|---|---|
| MatMul [4096x4096] | 1.0x - 1.5x |
| Conv2D | 1.2x - 2.0x |
| Element-wise ops | 1.0x - 1.2x |
| Reductions | 1.0x - 1.3x |
Framework Comparison
Comparison of MIND frontend speed vs other frameworks' full compilation pipelines. All numbers verified on the same machine (Feb 2026).
Different scope: MIND measures frontend only. Other frameworks measure full compilation to runnable code. Ratios reflect this difference.
| Framework | What's Measured | Time | Autodiff | Determinism |
|---|---|---|---|---|
| MIND | Frontend (parse+typecheck+IR) | 1.8-15.5 µs | Compile-time | 100% guaranteed |
| PyTorch 2.10 (GPU) | Full pipeline (graph+optimize+codegen) | 99-878 ms | Runtime tape | Not guaranteed |
| JAX 0.9 | Full XLA compilation (cold-start) | 37.5-360.5 ms | jax.grad (tracing) | Mostly deterministic |
| Mojo 0.26.1 | Full LLVM compilation (mojo build) | 810-829 ms | N/A | N/A |
Profiling
Built-in profiling support for performance analysis:
# Generate a trace profile mindc run model.mind --profile=trace --output=trace.json # CPU time breakdown mindc run model.mind --profile=time
Memory Efficiency
- Static memory planning eliminates runtime allocation overhead
- Buffer reuse analysis minimizes peak memory usage
- Optional memory pooling for real-time applications
GPU Runtime Performance (Enterprise)
The Enterprise CUDA backend delivers production-grade GPU acceleration, benchmarked on RTX 4070 (SM_89, Ada Lovelace):
| Metric | PyTorch 2.8 | MIND Runtime | Improvement |
|---|---|---|---|
| Memory Allocation | 46K/sec | 8.3M/sec | 180x faster |
| MatMul TF32 (4096x4096) | 12.83 TFLOPS | 17.32 TFLOPS | 35% faster |
| MatMul FP16 (4096x4096) | 23.82 TFLOPS | 33.34 TFLOPS | 40% faster |
| Elementwise Bandwidth | 228 GB/s | 250 GB/s | 98% of peak |
GPU runtime requires Enterprise license. Performance scales with GPU capabilities. Benchmarks verified February 2026.
WebGPU Runtime: GEMM Benchmark
In-browser WebGPU benchmark comparing MindLang AOT-compiled WGSL shaders against ONNX Runtime Web 1.21's WebGPU backend. Both perform the identical operation (C = A × B, f32). MindLang compiles .mind → optimized .wgsl at build time; ONNX RT generates shaders at runtime.
| Size | MindLang | ONNX RT Web | Speedup |
|---|---|---|---|
| 1024×1024 | 3.4 ms / 628 GFLOPS | 25.7 ms / 83 GFLOPS | 7.5x |
| 2048×2048 | 4.9 ms / 3,535 GFLOPS | 93.1 ms / 184 GFLOPS | 19x |
| 4096×4096 | 31 ms / 4,451 GFLOPS | 240 ms / 569 GFLOPS | 7.7x |
Key Findings
MindLang is 7.5-19x faster than ONNX Runtime Web across all matrix sizes. At 4096×4096, MindLang achieves ~4.5 TFLOPS peak on consumer WebGPU hardware. The 8×4 register-tiled shader with 128×64 workgroup output, bank-conflict-free shared memory, and vec4 vectorized loads delivers up to 4,451 GFLOPS — the advantage comes from AOT compilation with aggressive kernel optimization vs runtime shader generation.
With the Include Compile Time toggle enabled, the advantage grows further: MindLang's compile cost is ~50-80 ms (fetch pre-built WGSL + pipeline creation) vs ONNX RT's ~500-2,000 ms (model load + runtime WGSL generation).
Chromium 131, WebGPU (Vulkan), Ubuntu 24.04. Static-shape ONNX models for fair comparison. Feb 2026. Run the benchmark yourself →
Learn More
- GEMM Benchmark (Interactive) — Run MindLang vs ONNX RT Web in your browser
- Running Benchmarks — Reproduce the results yourself
- Performance FAQ — Common questions answered
- Full Benchmark Results — Complete verified data
- Performance Specification — Official spec document