Add coauthor-corroborated topic split to predict by atalyaalon · Pull Request #74 · allenai/S2AND

atalyaalon · 2026-06-29T07:43:34Z

Splits over-merged same-name clusters using embedding topics gated by coauthor disjointness. See #75 for a many-way alternative that also handles clusters with >2 distinct identities (e.g. n_harel).

Motivation

Same-name authors who work in different fields — e.g. Erick Matsen IV (computational biology) vs his father Frederick Matsen III (orthopedic surgery) — get over-merged. Their papers' SPECTER embeddings sit close in absolute cosine (most academic titles are ~0.7–0.9 apart), so the pairwise model treats them as compatible and average-linkage fuses them. Enabling the SPECTER2 title-only embedding fallback makes this worse: on f matsen it dropped best-cluster purity from 72% to 49%.

The separating signal is relative — each person's papers form a tight embedding sub-cloud with a distinct centroid — but topic alone is unsafe: real single authors also span topics, and splitting on topic bimodality alone over-fragments them (s2and-mini mean B3 F1 −0.030).

Change

Post-clustering pass (s2and/topic_split.py): a recursive 2-means split on a cluster's embeddings, accepted only when (a) the split is bimodal (silhouette ≥ 0.15) and (b) the two halves have near-disjoint coauthors. Genuine authors carry recurring coauthors across topics; distinct same-name people do not. Affiliation is not used — shared geography tokens cause false overlap. Only ever splits, never merges, so it cannot lower recall on already-separated clusters.

Wired into the full-block predict() path, default on, configurable via Clusterer.topic_split_* attributes, read through getattr so older pickled bundles fall back to defaults.

Results (s2and-mini, clean integrated predict, OFF vs ON)

dataset	F1 OFF	F1 ON	Δ
arnetminer	0.894	0.889	−0.005
inspire	0.979	0.973	−0.006
kisti	0.958	0.955	−0.003
pubmed	0.940	0.940	0
qian	0.955	0.955	0
zbmath	0.966	0.966	0
mean	0.9487	0.9463	−0.0023

Pathological blocks (best-cluster purity, recall held at 0.99):

block	OFF	ON
f_matsen	0.49	0.70
n_harel	0.03	0.03 (unchanged — see below)

vs #75 (many-way)

	this (2-way)	#75 (many-way)
f_matsen	0.70	0.71
n_harel	0.03 (unchanged)	0.58
s2and-mini mean B3 F1	−0.0023	−0.0028

This 2-means + silhouette gate never fires on the n_harel 268-mention blob (10+ identities → top-level silhouette 0.10 < 0.15), so it leaves many-identity over-merges alone. #75 handles them at marginally higher benchmark cost. Pick one.

Scope / notes

Full-block predict() only — not predict_incremental.
tests/test_topic_split.py added.
Unrelated pre-existing failures (rust-extension version, stale v1.2 eps fixture) are not touched here.

Same-name authors in different fields get over-merged because their SPECTER embeddings sit close in absolute cosine, so average-linkage fuses them. This post-clustering pass splits a cluster only when a 2-means embedding split is corroborated by disjoint coauthors between the two sides. It only ever splits, never merges, so recall on already-separated clusters is unaffected. Default on, configurable via Clusterer.topic_split_* attributes. f_matsen best-cluster purity 0.49 -> 0.70 (recall held at 0.99); s2and-mini mean B3 F1 0.9487 -> 0.9463.

atalyaalon requested review from sergeyf and removed request for sergeyf June 29, 2026 07:54

atalyaalon mentioned this pull request Jun 29, 2026

Add many-way coauthor-corroborated topic split to predict #75

Draft

Base automatically changed from v0.50.1 to main June 29, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add coauthor-corroborated topic split to predict#74

Add coauthor-corroborated topic split to predict#74
atalyaalon wants to merge 1 commit into
mainfrom
coauthor-topic-split

atalyaalon commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

atalyaalon commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Change

Results (s2and-mini, clean integrated predict, OFF vs ON)

vs #75 (many-way)

Scope / notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atalyaalon commented Jun 29, 2026 •

edited

Loading