WebGPU Benchmark

GEMM — Matrix Multiplication

Apples-to-apples comparison: MindLang AOT-compiled WGSL shader vs ONNX Runtime Web's WebGPU backend. Same operation (1024×1024 GEMM), same GPU, same browser — 2 GFLOP of floating-point work per run.

Matrix Size

Runs

Include Compile

MindLang

AOT-compiled WGSL via mindc --target webgpu. Fetches pre-compiled gemm.wgsl with 8x4 register tiling, vec4 loads, and bank-conflict-free shared memory. Dispatches 128x64 output tiles via 16x16 workgroups.

Avg Time—

GFLOPS—

Min Dispatch—

Shader Compile—

ONNX RT Web

ONNX Runtime Web 1.21 with WebGPU execution provider. Loads a static-shape MatMul model (matching the selected size), creates InferenceSession, and runs inference.

Avg Time—

GFLOPS—

Min Dispatch—

Session Init—

Output will appear here...

Source Code

gemm.mind

MindLang source — tiled GEMM kernel with static tensor types, shared memory tiles, and workgroup barrier synchronization.

gemm.wgsl

Compiled WGSL output from mindc. This is exactly what runs on your GPU — no runtime JIT, no graph construction overhead.

mind.toml

MIND project manifest. Declares the WebGPU target, 16x16 workgroup size, and high-performance power preference.

Methodology

Both paths perform the identical mathematical operation: C = A × B where A, B, C are 1024×1024 f32 matrices. Each run includes 1 warmup dispatch (not counted) followed by 5 timed dispatches with queue.onSubmittedWorkDone() synchronization between each.

MindLang uses a pre-compiled WGSL compute shader fetched from /bench/gemm/gemm.wgsl. The shader uses 8×4 register tiling (32 FMAs per inner-loop iteration), bank-conflict-free shared memory (stride-17 padding), and vec4 vectorized loads. Shader compile time is measured separately and not included in the dispatch average.

ONNX Runtime Web v1.21 uses the WebGPU execution provider. A static-shape MatMul ONNX model matching the selected size is loaded, giving ONNX RT full opportunity to specialize its kernel. Session init time (including ONNX graph compilation to WGSL) is measured separately.

Include Compiletoggle amortizes each side's compile/init cost across all timed runs: effective = (compile + Σdispatch) / N. MindLang's “compile” is fetching a pre-built WGSL file and creating a pipeline; ONNX RT's init includes runtime WGSL shader generation from the ONNX graph.

The WebGPU compile path used to produce this shader ships in the commercial mind-runtime; the open-source compiler ships the deterministic CPU backend. This page measures GEMM dispatch speed on your GPU — byte-identical determinism across GPU substrates remains a roadmap item, not a property of this benchmark.

Results vary by GPU, driver version, browser, and system load. Requires Chrome 113+, Edge 113+, or another browser with WebGPU enabled.