WebGPU Benchmark

GEMM — Matrix Multiplication

Apples-to-apples comparison: MindLang AOT-compiled WGSL shader vs ONNX Runtime Web's WebGPU backend. Same operation (1024×1024 GEMM), same GPU, same browser — 2 GFLOP of floating-point work per run.

MindLang

AOT-compiled WGSL via mindc --target webgpu. Fetches pre-compiled gemm.wgsl with 8x4 register tiling, vec4 loads, and bank-conflict-free shared memory. Dispatches 128x64 output tiles via 16x16 workgroups.

Avg Time
GFLOPS
Min Dispatch
Shader Compile
ONNX RT Web

ONNX Runtime Web 1.21 with WebGPU execution provider. Loads a static-shape MatMul model (matching the selected size), creates InferenceSession, and runs inference.

Avg Time
GFLOPS
Min Dispatch
Session Init
Output will appear here...

Methodology

Both paths perform the identical mathematical operation: C = A × B where A, B, C are 1024×1024 f32 matrices. Each run includes 1 warmup dispatch (not counted) followed by 5 timed dispatches with queue.onSubmittedWorkDone() synchronization between each.

MindLang uses a pre-compiled WGSL compute shader fetched from /bench/gemm/gemm.wgsl. The shader uses 8×4 register tiling (32 FMAs per inner-loop iteration), bank-conflict-free shared memory (stride-17 padding), and vec4 vectorized loads. Shader compile time is measured separately and not included in the dispatch average.

ONNX Runtime Web v1.21 uses the WebGPU execution provider. A static-shape MatMul ONNX model matching the selected size is loaded, giving ONNX RT full opportunity to specialize its kernel. Session init time (including ONNX graph compilation to WGSL) is measured separately.

Include Compile toggle amortizes each side's compile/init cost across all timed runs: effective = (compile + Σdispatch) / N. MindLang's “compile” is fetching a pre-built WGSL file and creating a pipeline; ONNX RT's init includes runtime WGSL shader generation from the ONNX graph.

Results vary by GPU, driver version, browser, and system load. Requires Chrome 113+, Edge 113+, or another browser with WebGPU enabled.