[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage by JCNTH · Pull Request #20731 · pytorch/executorch

JCNTH · 2026-07-05T01:20:05Z

Stack from ghstack (oldest at bottom):

Add steel-GEMM coverage to the et_vk.linear_q4gsw golden sweep (stacked on the steel op diff).

The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (K % 16 == 0), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (q_proj_4k, gate_proj_pf, down_proj_pf, shmem_edge) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing.

Changes:

test_quantized_linear.py / test_webgpu_native.cpp: add the steel config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking).
Document that M>1 K % 16 == 0 shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D110660966

[ghstack-poisoned]

pytorch-bot · 2026-07-05T01:20:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20731

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit 5c35cb5 with merge base 4af91c3 ():

NEW FAILURE - The following job has failed:

pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t ae95e977762a919408531de21a139861942438dac4c55b321e7d1a003bc69584 /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-07-05T01:20:10Z

The committers listed above are authorized under a signed CLA.

✅ login: JCNTH / name: Julian Ng-Thow-Hing (5dbd084, c87361d)

github-actions · 2026-07-05T01:20:46Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JCNTH · 2026-07-05T01:22:42Z

@claude throoughly review, check for overengineering, correctness, and areaas for modularization

claude · 2026-07-05T01:23:32Z

Claude finished @JCNTH's task in 1m 14s —— View job

Review: steel q4gsw config

Read the diff and both mirrored config tables in context
Verify config values / shape constraints
Check Python↔C++ mirroring
Assess overengineering & modularization

Verdict: LGTM. Correct, minimal, and consistent with the existing pattern. No changes needed. Details below.

Correctness

The steel shape (M=96, K=2048, N=256) satisfies every q4gsw constraint and lands exactly where the comment claims:

Constraint	Check
`N % 8 == 0` (torchao scale layout)	`256 % 8 == 0` ✅
`K % group_size == 0` (32)	`2048 % 32 == 0` ✅
`K % 16 == 0` (steel routing)	`2048 % 16 == 0` ✅
below shmem thresholds (`K<4096`, `N<2048`)	`2048<4096`, `256<2048` ✅ — pre-steel this was register-tiled, so it uniquely pins the steel branch

M=96 gives a partial trailing tile (96 = 64 + 32) — assuming the steel kernel tiles at 64 rows per the op diff (not visible in this stacked diff), this correctly exercises the edge-mask path. Worth a quick confirm against the kernel's tile height in #20730.

Tolerances (1e-4, 1e-3) match the other K=2048 configs, which is the right basis — the file notes tol scales with K (accum depth), not M. required=true, heavy=false is consistent with the other small non-heavy shapes; golden is only 96*256 = 24576 floats (~96KB), so no fixture-size concern.

Python ↔ C++ mirroring

Both tables agree field-for-field:

Python: Q4gswConfig("steel", 96, 2048, 256) → name/m/k/n + defaults group_size=32, heavy=False
C++: {"steel", 96, 2048, 256, 1e-4f, 1e-3f, true, false} → name/m/k/n/tol_abs/tol_rel/required/heavy

Placed first in the M>1 group in both files, with parallel comments. The single-source-of-truth invariant is preserved.

Overengineering / modularization

Nothing to flag. It's 4 lines into an existing dataclass/struct sweep — no new abstraction, no premature helper. The Python↔C++ duplication is a pre-existing intentional design ("single source of truth, mirrored by the C++ table"); this PR correctly follows it rather than trying to refactor it. Comment verbosity matches the surrounding file style.

Minor observations (non-blocking)

This is also the only lightweight M>1 register-tiled config on SwiftShader. kv_proj_4k (M=4096, K=2048, N=512) already hits register-tiled on CI, but it's a full-tile 4k-token shape; steel is the one that pins the partial-tile register-tiled path there too. So the config earns coverage on both device tiers — steel-branch isolation on lvp, partial-tile register-tiled on SwiftShader. The comment captures this but could make the CI-side value slightly more explicit if you want.
The "SwiftShader caps at 128" parenthetical is a device-specific claim I can't verify from this diff — just make sure it stays accurate if the min-invocation threshold in the op diff changes (both are now coupled by prose only).
· branch gh/JCNTH/5/head

[ghstack-poisoned]

Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)

[ghstack-poisoned]

Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)

[ghstack-poisoned]

Pull Request resolved: #20731 **Add steel-GEMM coverage to the `et_vk.linear_q4gsw` golden sweep (stacked on the steel op diff).** The op diff routes M>1 q4gsw prefill to the new steel GEMM on a >=256-invocation device (`K % 16 == 0`), falling back to shmem/register-tiled otherwise. The existing M>1 CONFIGS (`q_proj_4k`, `gate_proj_pf`, `down_proj_pf`, `shmem_edge`) already exercise steel on such a device via the shape-discovering native sweep; this adds one small config that isolates the steel branch specifically and documents the routing. **Changes:** - `test_quantized_linear.py` / `test_webgpu_native.cpp`: add the `steel` config (M=96, K=2048, N=256) — below the shmem thresholds (K<4096, N<2048) so pre-steel it was register-tiled, which uniquely pins the steel branch; M=96 exercises the partial 64-row tile (edge masking). - Document that M>1 `K % 16 == 0` shapes prefer steel on a >=256-invocation device (lvp) and fall back on a <256 device (SwiftShader) — the same fp64 golden validates whichever kernel runs. Co-authored-with: Claude Code. ghstack-source-id: 399938970 @exported-using-ghexport Differential Revision: [D110660966](https://our.internmc.facebook.com/intern/diff/D110660966/)

Update

5dbd084

[ghstack-poisoned]

JCNTH temporarily deployed to cadence July 5, 2026 01:20 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 5, 2026

JCNTH temporarily deployed to cadence July 5, 2026 01:49 — with GitHub Actions Inactive

Update

c87361d

[ghstack-poisoned]

JCNTH temporarily deployed to cadence July 5, 2026 02:27 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jul 5, 2026

JCNTH temporarily deployed to cadence July 5, 2026 02:52 — with GitHub Actions Inactive

Update

73b4901

[ghstack-poisoned]

JCNTH temporarily deployed to cadence July 5, 2026 04:31 — with GitHub Actions Inactive

JCNTH temporarily deployed to cadence July 5, 2026 04:59 — with GitHub Actions Inactive

Update

5c35cb5

[ghstack-poisoned]

JCNTH temporarily deployed to cadence July 5, 2026 06:18 — with GitHub Actions Inactive

JCNTH temporarily deployed to cadence July 5, 2026 06:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731

[ExecuTorch][WebGPU] Steel q4gsw prefill GEMM — test coverage#20731
JCNTH wants to merge 4 commits into
gh/JCNTH/5/basefrom
gh/JCNTH/5/head

JCNTH commented Jul 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

JCNTH commented Jul 5, 2026

Uh oh!

claude Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JCNTH commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20731

❌ 1 New Failure, 1 Cancelled Job

Uh oh!

linux-foundation-easycla Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 5, 2026

This PR needs a release notes: label

Uh oh!

JCNTH commented Jul 5, 2026

Uh oh!

claude Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: steel q4gsw config

Correctness

Python ↔ C++ mirroring

Overengineering / modularization

Minor observations (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JCNTH commented Jul 5, 2026 •

edited

Loading

pytorch-bot Bot commented Jul 5, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jul 5, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jul 5, 2026 •

edited

Loading