[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731
[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731JCNTH wants to merge 4 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20731
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Cancelled JobAs of commit 5c35cb5 with merge base 4af91c3 ( NEW FAILURE - The following job has failed:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude throoughly review, check for overengineering, correctness, and areaas for modularization |
|
Claude finished @JCNTH's task in 1m 14s —— View job Review: steel q4gsw config
Verdict: LGTM. Correct, minimal, and consistent with the existing pattern. No changes needed. Details below. CorrectnessThe
Tolerances Python ↔ C++ mirroringBoth tables agree field-for-field:
Placed first in the Overengineering / modularizationNothing to flag. It's 4 lines into an existing dataclass/struct sweep — no new abstraction, no premature helper. The Python↔C++ duplication is a pre-existing intentional design ("single source of truth, mirrored by the C++ table"); this PR correctly follows it rather than trying to refactor it. Comment verbosity matches the surrounding file style. Minor observations (non-blocking)
|
Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
Stack from ghstack (oldest at bottom):
Add steel-GEMM coverage to the
et_vk.linear_q4gswgolden sweep (stacked on the steel op diff).The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (
K % 16 == 0), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (q_proj_4k,gate_proj_pf,down_proj_pf,shmem_edge) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.Changes:
test_quantized_linear.py/test_webgpu_native.cpp: add thesteelconfig (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).K % 16 == 0shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.Co-authored-with: Claude Code.
@exported-using-ghexport
Differential Revision: D110660966
Differential Revision: D110660966