Highlight fidelity for structured-comment dialects: lineComment regions, continuation brackets, contextualScopes, comment markup#58
Open
theoephraim wants to merge 6 commits into
Conversation
…dialects
A grammar whose comment token matches only the INTRODUCER (a bare `#`) while
the comment content is parser-tokenized (env-spec decorator comments) renders
badly from flat token rules: the prose after `#` keeps the ordinary text-token
scopes (string.unquoted, constant.numeric, ...), so themes color comments like
code.
New highlight-only token metadata — mirroring `interpolation` — declares the
shape instead:
token('#', { scope: 'comment.line', lineComment: { richStarters: [DEC_NAME] } })
gen-tm then emits to-end-of-line regions in place of the flat rule:
- a RICH region gated on a lookahead that a declared rich-starter opens the
body (`# @decorator(...)`) — full token highlighting applies inside via
$self, on top of the comment base scope, so structured comments keep their
colors while stray text dims;
- a PLAIN region for every other comment — no inner patterns, so the body
falls to the comment scope and dims like any comment.
The introducer captures as punctuation.definition.comment. Lexer/parser and
all other generators are untouched; grammars without the metadata are
byte-identical (npm run gen produces no diffs for the six built-in grammars).
Also registers test/env-spec-regressions.ts as a check gate (it was previously
manual-only) and extends it with the new contracts.
…ch comments
A bracket left OPEN in a rich comment continues the construct across
consecutive introducer-prefixed lines (env-spec multi-line decorator calls and
literals):
# @import(
# ./.env.shared,
# pick=[
# KEY1, # note
# ],
# )
Declared via `lineComment.continuationBrackets` (bracket pairs). Each pair
emits a begin/end region nested inside the line-scoped rich region — a
TextMate child region suspends its parent's $ end, so the construct spans
lines while single-line comments still close at EOL. Inside a construct:
a line-start introducer is a CONTINUATION MARKER (comment punctuation, not a
new comment); any other introducer opens an embedded comment that dims to
end-of-line; brackets nest recursively; everything else highlights via $self.
An unclosed bracket runs the region to the next closer — the standard
hand-written-grammar hazard; the parser stays the authority on validity.
Opt-in as before: `npm run gen` produces no diffs for the built-in grammars;
41/41 gates pass (env-spec regression contracts extended to 18).
Two more generic highlight-fidelity features hand-written grammars have that
generated ones lacked:
1. `contextualScopes` — grammar-level, highlight-only: token T carries scope S
when it appears within rule R (T's immediate enclosing rule):
contextualScopes: [
{ token: ASSIGN_KEY, within: [FunctionArgKeyValue], scope: 'entity.other.attribute-name' },
]
Each generator consumes the SAME declaration at its own fidelity:
- tree-sitter: exact `(rule (token) @capture)` queries, emitted last
(highlight resolution is last-wins) so they override the token's flat
capture inside the declared rules only;
- gen-tm: the flat grammar approximates rule context with derived CONSTRUCT
regions — a call-argument region (callee + `(` … `)`, nested bracket
regions inside, only emitted when contextual scopes are declared) and the
lineComment continuation-bracket interiors — where the override rules are
tried before the token's flat rule; top-level occurrences keep the
declared scope. The callee capture reuses the grammar's own callee-token
scope when one is declared.
Motivating case: env-spec option keys (`fn(retry=3)`, `@import(pick=…)`)
styling as attribute names, distinct from top-level env var keys.
2. `lineComment.markup` — declared doc-markup patterns (token-pattern IR, e.g.
`**bold**` / `__italic__`) highlighted inside PLAIN comment bodies.
Opt-in as before: npm run gen produces no diffs for the six built-in grammars;
41/41 gates pass (env-spec regression contracts extended to 25).
…token
The generic identifier fallback can resolve to a placeholder (never-matching)
token in indentation grammars, leaving ctx-call-args dead. Prefer the token
whose pattern is gated on a following '(' (the grammar's own callee), and skip
the region entirely when no usable callee pattern exists.
The previous contracts asserted the generated grammar's internal shape (repository keys, include ordering) — they break on refactor without saying what behavior to preserve. Rewritten against RENDERED OUTPUT instead: a real document is tokenized with vscode-textmate and the assertions read as the spec, line by line — - the same key token paints three ways: env key at top level, attribute name inside call args (contextualScopes), attribute name inside decorator comment constructs - plain comment prose dims as comment (never as a value string); declared markup (**bold**) highlights inside it - decorator comments keep rich token scopes - an open bracket continues the construct across #-prefixed lines: content keeps token scopes, the line-start # is a continuation marker, an embedded '# aside' dims to end-of-line, and after the closer a plain comment dims again - the parser CST is byte-identical with and without the highlight metadata (highlight-only, proven not asserted) - without the metadata, generation is unchanged (opt-in) - tree-sitter emits the same declaration as exact last-wins queries The implementation can be thrown away; these tests say what any replacement must do.
… their closer
A callee-anchored call-args region (reached via $self inside a rich comment)
starts matching at the CALLEE — an earlier position than the continuation
bracket's bare '(' — so it hijacked '# @dec=someFn(' constructs with an
interior that treats a line-start '#' as an embedded comment: the '# )' closer
was eaten, the region ran away past the comment block, and the FOLLOWING
config item painted as comment/attribute content.
Emit a continuation-aware call variant inside rich comments and construct
interiors (same marker-aware interior as the paren construct, begin consumes
the callee), tried before $self so it wins the position tie.
Spec test added: a call construct in a decorator comment keeps continuation
content rich, ends at its '# )' closer, and the following config item carries
no comment scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The shape of this may not be quite right, but I tried to add some tests clearly showing what I am trying to accomplish. Happy to chat further if helpful.
Problem
A grammar whose comment token matches only the introducer (a bare
#) while the comment content is parser-tokenized renders badly from flat token rules. env-spec is the motivating case: its comments carry a decorator DSL (# @required @type=enum(a, b)), so the parser must tokenize comment bodies — but then plain comment prose keeps the ordinary text-token scopes (string.unquoted,constant.numeric, …) in the generated TextMate grammar, and themes color comments like code:Approach
New highlight-only token metadata, mirroring the existing
interpolationhint — the lexer/parser and all other generators are untouched:gen-tmthen emits to-end-of-line regions in place of the flat rule:# @decorator(...)) — full token highlighting applies inside via$self, on top of the comment base scope, so structured comments keep their colors while stray text dims;The introducer captures as
punctuation.definition.comment. Rich is registered before plain (the gated form wins); both sit at the token's position in the top-level pattern list, so declaration-order precedence is preserved.The rich/plain split is declared rather than derived because it is genuinely not derivable: whether a
#opens prose or structure is decided by rule context the lexer never sees (even the parser only distinguishes them via which comment rule matches).Multi-line constructs (
continuationBrackets)Structured-comment dialects have multi-line forms where every continuation line is introducer-prefixed:
Declaring
lineComment: { richStarters: [...], continuationBrackets: [['(', ')'], ['[', ']'], ['{', '}']] }makes each bracket pair a begin/end region nested inside the line-scoped rich region — a TextMate child region suspends its parent's$end, so the construct spans lines while single-line comments still close at EOL. Inside a construct: a line-start introducer is a continuation marker (comment punctuation, not a new comment); any other introducer opens an embedded comment that dims to end-of-line; brackets nest recursively; everything else highlights via$self. A CALL opening a construct (# @dec=someFn() gets the same marker-aware interior via a callee-consuming variant tried before$self— otherwise a callee-anchored region (e.g. the contextualScopes call-args region) would win the position tie with an interior that treats# )as an embedded comment and leak past the closer. An unclosed bracket runs the region to the next closer — the standard hazard of every hand-written grammar with multi-line constructs; the parser stays the authority on validity.Contextual token scopes (
contextualScopes) + comment markupTwo more fidelity features hand-written grammars have that flat generation lacked:
contextualScopes(grammar-level, highlight-only) — token T carries scope S within rule R (T's immediate enclosing rule):The declaration names rules, so each generator consumes it at its own fidelity: tree-sitter emits exact
(rule (token) @capture)queries (last, since highlight resolution is last-wins); gen-tm approximates rule context with derived construct regions — a call-argument region (callee +(…), nested bracket regions inside, emitted only when contextual scopes are declared) plus the continuation-bracket interiors — where the overrides are tried before the token's flat rule. Top-level occurrences keep the declared scope. Motivating case: env-spec option keys (fn(retry=3),@import(pick=…)) styling as attribute names, distinct from top-level env var keys.lineComment.markup— declared doc-markup patterns (token-pattern IR, e.g.**bold**/__italic__) highlighted inside plain comment bodies.Blast radius
npm run genproduces no diffs for the six built-in grammars.npm run check: 41/41 gates pass. This PR also registerstest/env-spec-regressions.tsas a check gate (it was previously manual-only).#-prefixed continuation lines with embedded comments, state reverting after the closer, and a call-opened construct ending at its# )closer without swallowing the following config item. Highlight-only-ness is proven (the parser CST is byte-identical with and without the metadata), and the opt-in property is asserted directly. The implementation is disposable; the tests say what any replacement must do.Downstream validation: dmno-dev/varlock#744 generates its shipped VSCode grammar from this branch — plain comments dim, decorator comments and multi-line constructs stay rich, option keys style as attribute names (also carried into the tree-sitter highlights), and
**bold**/__italic__markup renders in comment text — all verified with vscode-textmate snapshot tests against a fixture corpus.