Skip to content

Highlight fidelity for structured-comment dialects: lineComment regions, continuation brackets, contextualScopes, comment markup#58

Open
theoephraim wants to merge 6 commits into
johnsoncodehk:masterfrom
dmno-dev:feat/line-comment-regions
Open

Highlight fidelity for structured-comment dialects: lineComment regions, continuation brackets, contextualScopes, comment markup#58
theoephraim wants to merge 6 commits into
johnsoncodehk:masterfrom
dmno-dev:feat/line-comment-regions

Conversation

@theoephraim

@theoephraim theoephraim commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

The shape of this may not be quite right, but I tried to add some tests clearly showing what I am trying to accomplish. Happy to chat further if helpful.


Problem

A grammar whose comment token matches only the introducer (a bare #) while the comment content is parser-tokenized renders badly from flat token rules. env-spec is the motivating case: its comments carry a decorator DSL (# @required @type=enum(a, b)), so the parser must tokenize comment bodies — but then plain comment prose keeps the ordinary text-token scopes (string.unquoted, constant.numeric, …) in the generated TextMate grammar, and themes color comments like code:

# this is a comment          ← "this is a comment" renders in string color
VAL=foo  # contact bob@x.io  ← same
# @required @sensitive       ← decorators (correctly) colored

Approach

New highlight-only token metadata, mirroring the existing interpolation hint — the lexer/parser and all other generators are untouched:

const HASH = token(seq(notPrecededBy(noneOf(' ', '\t', '\n', '\r')), '#'), {
  scope: 'comment.line',
  lineComment: { richStarters: [DEC_NAME] },
});

gen-tm then emits to-end-of-line regions in place of the flat rule:

  • a rich region gated on a lookahead that a declared rich-starter opens the body (# @decorator(...)) — full token highlighting applies inside via $self, on top of the comment base scope, so structured comments keep their colors while stray text dims;
  • a plain region for every other comment — no inner patterns, so the body falls to the comment scope and dims like any comment.

The introducer captures as punctuation.definition.comment. Rich is registered before plain (the gated form wins); both sit at the token's position in the top-level pattern list, so declaration-order precedence is preserved.

The rich/plain split is declared rather than derived because it is genuinely not derivable: whether a # opens prose or structure is decided by rule context the lexer never sees (even the parser only distinguishes them via which comment rule matches).

Multi-line constructs (continuationBrackets)

Structured-comment dialects have multi-line forms where every continuation line is introducer-prefixed:

# @import(
#   ./.env.shared,
#   pick=[
#     KEY1, # note
#   ],
# )

Declaring lineComment: { richStarters: [...], continuationBrackets: [['(', ')'], ['[', ']'], ['{', '}']] } makes each bracket pair a begin/end region nested inside the line-scoped rich region — a TextMate child region suspends its parent's $ end, so the construct spans lines while single-line comments still close at EOL. Inside a construct: a line-start introducer is a continuation marker (comment punctuation, not a new comment); any other introducer opens an embedded comment that dims to end-of-line; brackets nest recursively; everything else highlights via $self. A CALL opening a construct (# @dec=someFn() gets the same marker-aware interior via a callee-consuming variant tried before $self — otherwise a callee-anchored region (e.g. the contextualScopes call-args region) would win the position tie with an interior that treats # ) as an embedded comment and leak past the closer. An unclosed bracket runs the region to the next closer — the standard hazard of every hand-written grammar with multi-line constructs; the parser stays the authority on validity.

Contextual token scopes (contextualScopes) + comment markup

Two more fidelity features hand-written grammars have that flat generation lacked:

contextualScopes (grammar-level, highlight-only) — token T carries scope S within rule R (T's immediate enclosing rule):

contextualScopes: [
  { token: ASSIGN_KEY, within: [FunctionArgKeyValue], scope: 'entity.other.attribute-name' },
]

The declaration names rules, so each generator consumes it at its own fidelity: tree-sitter emits exact (rule (token) @capture) queries (last, since highlight resolution is last-wins); gen-tm approximates rule context with derived construct regions — a call-argument region (callee + (), nested bracket regions inside, emitted only when contextual scopes are declared) plus the continuation-bracket interiors — where the overrides are tried before the token's flat rule. Top-level occurrences keep the declared scope. Motivating case: env-spec option keys (fn(retry=3), @import(pick=…)) styling as attribute names, distinct from top-level env var keys.

lineComment.markup — declared doc-markup patterns (token-pattern IR, e.g. **bold**/__italic__) highlighted inside plain comment bodies.

Blast radius

  • Grammars without the metadata are byte-identicalnpm run gen produces no diffs for the six built-in grammars.
  • npm run check: 41/41 gates pass. This PR also registers test/env-spec-regressions.ts as a check gate (it was previously manual-only).
  • The new tests are a behavioral spec, not implementation contracts: a real document is tokenized with vscode-textmate and the assertions pin the rendered outcome — the same key token painting three ways (env key / call-arg attribute / decorator-construct attribute), comment prose dimming while markup and decorators stay highlighted, multi-line constructs surviving #-prefixed continuation lines with embedded comments, state reverting after the closer, and a call-opened construct ending at its # ) closer without swallowing the following config item. Highlight-only-ness is proven (the parser CST is byte-identical with and without the metadata), and the opt-in property is asserted directly. The implementation is disposable; the tests say what any replacement must do.

Downstream validation: dmno-dev/varlock#744 generates its shipped VSCode grammar from this branch — plain comments dim, decorator comments and multi-line constructs stay rich, option keys style as attribute names (also carried into the tree-sitter highlights), and **bold**/__italic__ markup renders in comment text — all verified with vscode-textmate snapshot tests against a fixture corpus.

…dialects

A grammar whose comment token matches only the INTRODUCER (a bare `#`) while
the comment content is parser-tokenized (env-spec decorator comments) renders
badly from flat token rules: the prose after `#` keeps the ordinary text-token
scopes (string.unquoted, constant.numeric, ...), so themes color comments like
code.

New highlight-only token metadata — mirroring `interpolation` — declares the
shape instead:

    token('#', { scope: 'comment.line', lineComment: { richStarters: [DEC_NAME] } })

gen-tm then emits to-end-of-line regions in place of the flat rule:
- a RICH region gated on a lookahead that a declared rich-starter opens the
  body (`# @decorator(...)`) — full token highlighting applies inside via
  $self, on top of the comment base scope, so structured comments keep their
  colors while stray text dims;
- a PLAIN region for every other comment — no inner patterns, so the body
  falls to the comment scope and dims like any comment.

The introducer captures as punctuation.definition.comment. Lexer/parser and
all other generators are untouched; grammars without the metadata are
byte-identical (npm run gen produces no diffs for the six built-in grammars).

Also registers test/env-spec-regressions.ts as a check gate (it was previously
manual-only) and extends it with the new contracts.
@theoephraim theoephraim marked this pull request as draft July 4, 2026 06:11
…ch comments

A bracket left OPEN in a rich comment continues the construct across
consecutive introducer-prefixed lines (env-spec multi-line decorator calls and
literals):

    # @import(
    #   ./.env.shared,
    #   pick=[
    #     KEY1, # note
    #   ],
    # )

Declared via `lineComment.continuationBrackets` (bracket pairs). Each pair
emits a begin/end region nested inside the line-scoped rich region — a
TextMate child region suspends its parent's $ end, so the construct spans
lines while single-line comments still close at EOL. Inside a construct:
a line-start introducer is a CONTINUATION MARKER (comment punctuation, not a
new comment); any other introducer opens an embedded comment that dims to
end-of-line; brackets nest recursively; everything else highlights via $self.

An unclosed bracket runs the region to the next closer — the standard
hand-written-grammar hazard; the parser stays the authority on validity.

Opt-in as before: `npm run gen` produces no diffs for the built-in grammars;
41/41 gates pass (env-spec regression contracts extended to 18).
Two more generic highlight-fidelity features hand-written grammars have that
generated ones lacked:

1. `contextualScopes` — grammar-level, highlight-only: token T carries scope S
   when it appears within rule R (T's immediate enclosing rule):

       contextualScopes: [
         { token: ASSIGN_KEY, within: [FunctionArgKeyValue], scope: 'entity.other.attribute-name' },
       ]

   Each generator consumes the SAME declaration at its own fidelity:
   - tree-sitter: exact `(rule (token) @capture)` queries, emitted last
     (highlight resolution is last-wins) so they override the token's flat
     capture inside the declared rules only;
   - gen-tm: the flat grammar approximates rule context with derived CONSTRUCT
     regions — a call-argument region (callee + `(` … `)`, nested bracket
     regions inside, only emitted when contextual scopes are declared) and the
     lineComment continuation-bracket interiors — where the override rules are
     tried before the token's flat rule; top-level occurrences keep the
     declared scope. The callee capture reuses the grammar's own callee-token
     scope when one is declared.

   Motivating case: env-spec option keys (`fn(retry=3)`, `@import(pick=…)`)
   styling as attribute names, distinct from top-level env var keys.

2. `lineComment.markup` — declared doc-markup patterns (token-pattern IR, e.g.
   `**bold**` / `__italic__`) highlighted inside PLAIN comment bodies.

Opt-in as before: npm run gen produces no diffs for the six built-in grammars;
41/41 gates pass (env-spec regression contracts extended to 25).
…token

The generic identifier fallback can resolve to a placeholder (never-matching)
token in indentation grammars, leaving ctx-call-args dead. Prefer the token
whose pattern is gated on a following '(' (the grammar's own callee), and skip
the region entirely when no usable callee pattern exists.
The previous contracts asserted the generated grammar's internal shape
(repository keys, include ordering) — they break on refactor without saying
what behavior to preserve. Rewritten against RENDERED OUTPUT instead: a real
document is tokenized with vscode-textmate and the assertions read as the
spec, line by line —

- the same key token paints three ways: env key at top level, attribute name
  inside call args (contextualScopes), attribute name inside decorator
  comment constructs
- plain comment prose dims as comment (never as a value string); declared
  markup (**bold**) highlights inside it
- decorator comments keep rich token scopes
- an open bracket continues the construct across #-prefixed lines: content
  keeps token scopes, the line-start # is a continuation marker, an embedded
  '# aside' dims to end-of-line, and after the closer a plain comment dims
  again
- the parser CST is byte-identical with and without the highlight metadata
  (highlight-only, proven not asserted)
- without the metadata, generation is unchanged (opt-in)
- tree-sitter emits the same declaration as exact last-wins queries

The implementation can be thrown away; these tests say what any replacement
must do.
@theoephraim theoephraim changed the title gen-tm: line-comment introducer regions for structured-comment dialects Highlight fidelity for structured-comment dialects: lineComment regions, continuation brackets, contextualScopes, comment markup Jul 4, 2026
… their closer

A callee-anchored call-args region (reached via $self inside a rich comment)
starts matching at the CALLEE — an earlier position than the continuation
bracket's bare '(' — so it hijacked '# @dec=someFn(' constructs with an
interior that treats a line-start '#' as an embedded comment: the '# )' closer
was eaten, the region ran away past the comment block, and the FOLLOWING
config item painted as comment/attribute content.

Emit a continuation-aware call variant inside rich comments and construct
interiors (same marker-aware interior as the paren construct, begin consumes
the callee), tried before $self so it wins the position tie.

Spec test added: a call construct in a decorator comment keeps continuation
content rich, ends at its '# )' closer, and the following config item carries
no comment scope.
@theoephraim theoephraim marked this pull request as ready for review July 5, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant