Skip to content

[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731

Open
JCNTH wants to merge 4 commits into
gh/JCNTH/5/basefrom
gh/JCNTH/5/head
Open

[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731
JCNTH wants to merge 4 commits into
gh/JCNTH/5/basefrom
gh/JCNTH/5/head

Conversation

@JCNTH

@JCNTH JCNTH commented Jul 5, 2026

Copy link
Copy Markdown

Stack from ghstack (oldest at bottom):

Add steel-GEMM coverage to the et_vk.linear_q4gsw golden sweep (stacked on the steel op diff).

The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (K % 16 == 0), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (q_proj_4k, gate_proj_pf, down_proj_pf, shmem_edge) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.

Changes:

  • test_quantized_linear.py / test_webgpu_native.cpp: add the steel config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).
  • Document that M>1 K % 16 == 0 shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D110660966

Differential Revision: D110660966

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jul 5, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20731

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit 5c35cb5 with merge base 4af91c3 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jul 5, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 5, 2026
@JCNTH

JCNTH commented Jul 5, 2026

Copy link
Copy Markdown
Author

@claude throoughly review, check for overengineering, correctness, and areaas for modularization

@claude

claude Bot commented Jul 5, 2026

Copy link
Copy Markdown

Claude finished @JCNTH's task in 1m 14s —— View job


Review: steel q4gsw config

  • Read the diff and both mirrored config tables in context
  • Verify config values / shape constraints
  • Check Python↔C++ mirroring
  • Assess overengineering & modularization

Verdict: LGTM. Correct, minimal, and consistent with the existing pattern. No changes needed. Details below.

Correctness

The steel shape (M=96, K=2048, N=256) satisfies every q4gsw constraint and lands exactly where the comment claims:

Constraint Check
N % 8 == 0 (torchao scale layout) 256 % 8 == 0
K % group_size == 0 (32) 2048 % 32 == 0
K % 16 == 0 (steel routing) 2048 % 16 == 0
below shmem thresholds (K<4096, N<2048) 2048<4096, 256<2048 ✅ — pre-steel this was register-tiled, so it uniquely pins the steel branch

M=96 gives a partial trailing tile (96 = 64 + 32) — assuming the steel kernel tiles at 64 rows per the op diff (not visible in this stacked diff), this correctly exercises the edge-mask path. Worth a quick confirm against the kernel's tile height in #20730.

Tolerances (1e-4, 1e-3) match the other K=2048 configs, which is the right basis — the file notes tol scales with K (accum depth), not M. required=true, heavy=false is consistent with the other small non-heavy shapes; golden is only 96*256 = 24576 floats (~96KB), so no fixture-size concern.

Python ↔ C++ mirroring

Both tables agree field-for-field:

  • Python: Q4gswConfig("steel", 96, 2048, 256) → name/m/k/n + defaults group_size=32, heavy=False
  • C++: {"steel", 96, 2048, 256, 1e-4f, 1e-3f, true, false} → name/m/k/n/tol_abs/tol_rel/required/heavy

Placed first in the M>1 group in both files, with parallel comments. The single-source-of-truth invariant is preserved.

Overengineering / modularization

Nothing to flag. It's 4 lines into an existing dataclass/struct sweep — no new abstraction, no premature helper. The Python↔C++ duplication is a pre-existing intentional design ("single source of truth, mirrored by the C++ table"); this PR correctly follows it rather than trying to refactor it. Comment verbosity matches the surrounding file style.

Minor observations (non-blocking)

  1. This is also the only lightweight M>1 register-tiled config on SwiftShader. kv_proj_4k (M=4096, K=2048, N=512) already hits register-tiled on CI, but it's a full-tile 4k-token shape; steel is the one that pins the partial-tile register-tiled path there too. So the config earns coverage on both device tiers — steel-branch isolation on lvp, partial-tile register-tiled on SwiftShader. The comment captures this but could make the CI-side value slightly more explicit if you want.
  2. The "SwiftShader caps at 128" parenthetical is a device-specific claim I can't verify from this diff — just make sure it stays accurate if the min-invocation threshold in the op diff changes (both are now coupled by prose only).
    · branch gh/JCNTH/5/head

[ghstack-poisoned]
JCNTH added a commit that referenced this pull request Jul 5, 2026
Pull Request resolved: #20731

**Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).**

The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.

**Changes:**
- `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).
- Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.

Co-authored-with: Claude Code.
ghstack-source-id: 399938970
@exported-using-ghexport

Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
[ghstack-poisoned]
JCNTH added a commit that referenced this pull request Jul 5, 2026
Pull Request resolved: #20731

**Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).**

The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.

**Changes:**
- `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).
- Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.

Co-authored-with: Claude Code.
ghstack-source-id: 399938970
@exported-using-ghexport

Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
[ghstack-poisoned]
JCNTH added a commit that referenced this pull request Jul 5, 2026
Pull Request resolved: #20731

**Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).**

The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.

**Changes:**
- `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).
- Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.

Co-authored-with: Claude Code.
ghstack-source-id: 399938970
@exported-using-ghexport

Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant