Splat-Diffusion

Abstract

We propose a method for maintaining contextual grounding in diffusion-based text generation without additional fine-tuning. The mechanism exploits a mathematical property of well-trained transformer embedding spaces: the Jacobian of the input-to-representation mapping is sufficiently smooth that representational structure learned during autoregressive pretraining remains valid under non-autoregressive generation topologies. A dual-pass embedding comparison, computing a candidate response's representation both in isolation and conditioned on conversational context, yields a scalar divergence signal that steers diffusion denoising toward contextually relevant outputs. We formalize why this works without retraining: the neuron-to-manifold type coercion inherent in transformer architectures preserves semantic continuity when Jacobians are smooth, allowing the pretrained model to serve as its own zero-shot relevance critic during parallel generation.

Problem Statement

Diffusion-based text generation offers parallel token production, global structural coherence, and iterative refinement. Removing sequential token prediction, however, eliminates the implicit planning mechanism autoregressive models exploit: each token chosen in light of all preceding tokens. The result is semantic drift. Diffusion-generated responses gravitate toward generic, context-insensitive outputs because the generation process lacks an intrinsic tether to the conversational context that prompted them.

Existing remedies (fine-tuning diffusion models on dialogue data, or adapting classifier-free guidance from image diffusion) require additional training. We argue this is unnecessary.

Core Thesis

The representational quality of a pretrained transformer, specifically its embedding manifold geometry, is a property of its weights, not its generation procedure. Autoregressive training optimizes for next-token prediction, but the side effect is a densely structured embedding space where semantic similarity, topical relevance, and contextual dependency are encoded as geometric relationships: distances, angles, curvatures.

This structure survives changes in generation topology because the Jacobian of the mapping from token sequences to manifold positions is smooth. Let $\mathbf{f}_\theta : \mathbf{T}^* \to \mathbb{R}^d$ map a token sequence to its embedding representation under frozen parameters $\theta$. If the Jacobian

$$\mathbf{J}_f = \frac{\partial \mathbf{f}}{\partial \mathbf{x}}$$

has bounded spectral norm and condition number across the relevant input domain, then small perturbations in input produce proportionally small perturbations in representation. The manifold geometry, and any signal derived from it, remains reliable even when the input was produced by a diffusion process rather than autoregressive sampling.

A single pretrained autoregressive model can serve as both the generation substrate (its weights parameterize the diffusion denoiser) and the relevance critic (its embedding space provides the steering signal), with zero additional training.

Proposed Method

At each diffusion denoising step $t$, given candidate response tokens $\mathbf{r}_t$:

(a) Forward pass 1: Compute $\mathbf{h}_r = \mathbf{f}_\theta(\mathbf{r}_t)$. The embedding of the response in isolation.
(b) Forward pass 2: Compute $\mathbf{h}_{c|r} = \mathbf{f}_\theta(\mathbf{c} \oplus \mathbf{r}_t)$ restricted to response token positions. The embedding of the response conditioned on conversation $\mathbf{c}$.
(c) Compute divergence:
$$\delta_t = 1 - \cos(\mathbf{h}_r,\; \mathbf{h}_{c|r})$$
(d) Steer: If $\delta_t$ falls below threshold $\tau$ (response semantics insufficiently shaped by context), inject a gradient nudge toward higher divergence, biasing the next denoising step toward candidates whose meaning is more context-dependent.

Threshold $\tau$ and nudge magnitude are hyperparameters. We propose adaptive scheduling tied to diffusion timestep: looser early in denoising when the response is still mostly noise, tightening as the output crystallizes and semantic identity stabilizes.

Theoretical Justification: The Type Coercion Argument

Individual neuron activations in a transformer are scalars, elements of $\mathbb{R}$. A hidden state is a vector in $\mathbb{R}^d$. Semantically meaningful representations do not fill $\mathbb{R}^d$ uniformly; they concentrate on a lower-dimensional manifold $\mathcal{M} \subset \mathbb{R}^d$ whose shape was sculpted by training data statistics. The mapping from token sequences to points on $\mathcal{M}$ is a type coercion: discrete symbolic objects (tokens) are cast to continuous manifold coordinates (embeddings).

For this coercion to preserve semantic content, for nearby meanings to map to nearby points on $\mathcal{M}$, the Jacobian of the mapping must be smooth: bounded, non-degenerate, slowly varying. Autoregressive training with gradient descent inherently produces this. Exploding or collapsing Jacobians would surface as training instability (gradient explosion, vanishing gradients), so stable convergence is indirect evidence of Jacobian regularity.

The embedding manifold geometry and its Jacobian smoothness are properties of the frozen weights $\theta$, not of how tokens were sampled. They remain valid when generation topology changes from autoregressive to diffusion. The dual-pass divergence signal is a geometric probe of $\mathcal{M}$. It asks: "does the conversation shift where this response sits on $\mathcal{M}$?" That question is well-posed regardless of how the response was produced.

No fine-tuning is required: the manifold is already smooth, the probe already works, and changing the generation process damages neither.

Expected Contributions

A training-free method for contextual coherence in diffusion text generation. A formal grounding of the method in Jacobian smoothness of pretrained embedding manifolds. Empirical evidence that autoregressive representational quality transfers zero-shot to non-autoregressive generation steering. And a reframe that may be the most consequential piece:

Autoregressive training is not a generation strategy; it is a representational optimization whose products are topology-independent.

Limitations and Open Questions

The Jacobian smoothness assumption may not hold uniformly across all manifold regions. Adversarial or far-out-of-distribution inputs could probe zones where the Jacobian degenerates. Characterizing the boundary of the safe region, the manifold subspace where the smoothness guarantee holds with confidence, is necessary follow-up work.

The dual forward pass doubles compute per steering check. Whether sparse checking at every $k$-th step preserves steering quality is an open empirical question.

The method inherits whatever biases exist in the pretrained model's embedding space. It creates no new representational capacity; it repurposes existing geometry. If the pretrained model has blind spots, splat-diffusion will share them.

Long-range coherence across multi-paragraph or multi-turn outputs may require more than a single scalar signal. Extensions to vector-valued or trajectory-valued steering, tracking not just "how different" but "different in which direction on $\mathcal{M}$," are a natural next step.

Citation

@article{kohm2026splatdiffusion, title = {Splat-Diffusion: Training-Free Contextual Steering of Diffusion Language Models via Dual-Pass Embedding Manifold Divergence}, author = {Kohm, Nicole}, year = {2026}, url = {https://nicolekai.github.io/splat-diffusion} }