thesecretlab

Research · 2026

LatentSync: A Framework for Neural-Native, Continuous Latent Communication in Parallel Multi-Agent LLM Systems

Bypassing the tokenizer. Letting models speak math to each other.

thesecretlab

Independent Research

📄 PDF 💻 Code 📊 LatentBench 🔧 Implementation

Abstract

Current multi-agent Large Language Model (LLM) frameworks rely on natural language or structured formats (e.g., JSON) as the primary medium of communication. We argue that this reliance introduces a severe "Tokenization Bottleneck," forcing high-dimensional, probabilistic neural representations to be lossily compressed into 1D discrete human text, only to be immediately decompressed by the receiving agent. In this paper, we introduce LatentSync, a novel framework for local, parallel agent-to-agent communication that bypasses the tokenizer entirely. LatentSync utilizes an LLM-native continuous language, allowing agents to transmit raw, high-dimensional latent vectors ($\mathbf{h} \in \mathbb{R}^d$) directly into the embedding spaces of sibling agents. By employing Cross-Attention Bridges and Tensor Inter-Process Communication (T-IPC), LatentSync enables synchronous, parallel state-sharing with near-zero latency. Furthermore, we outline a cooperative multi-agent training paradigm designed to bootstrap this non-human, continuous neural language from foundation models.

1 Introduction

The advent of multi-agent LLM systems has enabled complex problem-solving through specialization, debate, and delegation. However, frameworks such as AutoGPT, LangChain, and standard actor-critic LLM setups are constrained by a fundamental anthropocentric design flaw: they force neural networks to communicate using human language.

When Agent A communicates with Agent B, the continuous high-dimensional thought (the final hidden state $\mathbf{h}_t$) must be projected via an lm_head into logits, sampled into discrete token IDs, and parsed as strings. Agent B then tokenizes these strings and projects them back into continuous embeddings ($\mathbf{e}_t$). We define this as the Compression-Decompression Cycle.

Figure 1. The Compression-Decompression Cycle (red dashed) versus the LatentSync direct channel (orange solid). Standard multi-agent communication forces two lossy transformations. LatentSync bypasses both via a learned adapter $\Phi$.

This cycle introduces three critical inefficiencies:

Information Loss: Uncertainty, multiple hypotheses, and rich semantic context are collapsed during argmax or stochastic sampling.
Latency Overhead: Autoregressive generation forces sequential, token-by-token processing, precluding true parallel execution.
Context Window Exhaustion: Complex instructions require thousands of text tokens, rapidly degrading the attention span of the receiving model.

To resolve this, we propose LatentSync, a framework that establishes a "neural-native" medium. By routing un-projected continuous vectors—or Vector-Quantized Thought Tokens (VQ-TT)—directly between the hidden layers of parallel models, agents can achieve direct, simultaneous semantic synchronization.

2 Related Work

2.1 Discrete Prompting & Tool Use

The majority of existing work treats LLMs as black boxes communicating via text. AutoGPT (Significant Gravitas, 2023) chains prompt-response loops; CrewAI and LangGraph route structured messages between role-specialized agents. All remain bound to the tokenizer.

2.2 Continuous Communication in RL

Early work in Multi-Agent Reinforcement Learning (MARL), such as CommNet [8] and TarMAC [2], explored continuous vector passing with targeted attention. However, these were limited to small, specialized MLPs, not pre-trained transformer blocks with billions of parameters.

2.3 Soft Prompts & Prefix Tuning

Prompt tuning [4] and prefix tuning [5] demonstrated that continuous vectors in the input embedding space can outperform discrete text for task conditioning. LatentSync extends this concept dynamically to inter-agent communication at inference time.

2.4 Mixture of Experts & Model Merging

Sparse MoE architectures [7][3] route tokens to specialized sub-networks. Model merging techniques [9][10] combine weight spaces post-hoc. LatentSync differs by maintaining distinct, independently operating models that share activation-level information rather than weight-level information.

2.5 Representation Engineering & Steering Vectors

Recent work on representation engineering [11] and activation steering demonstrates that meaningful, manipulable structure exists within transformer hidden states. LatentSync exploits this structure as a communication medium rather than a control mechanism.

3 The LatentSync Architecture

LatentSync is engineered upon three core pillars: Continuous Vector Communication (CVC), Cross-Attention Memory Bridges, and Tensor IPC.

Figure 2. Full LatentSync architecture. Agent A's hidden state is tapped at a learned layer $\ell$, transformed by the gated adapter $\Phi$, and transmitted to Agent B via shared memory or T-IPC. KV caches are optionally shared for cross-attention. During training, gradients flow backwards through the entire bridge (red dashed).

3.1 LLM-Native Language: Continuous Vector Communication (CVC)

Let an LLM $\mathcal{M}$ consist of an embedding layer $E$, transformer blocks $T_{1..L}$, and a language modeling head $W_{\text{vocab}}$. In standard inference:

$$ \mathbf{h}_t = T_L(\mathbf{h}_{t-1}, \mathbf{x}_{(1)

In LatentSync, the communication channel is established prior to $W_{\text{vocab}}$. Agent A emits $\mathbf{h}_t^{(A)} \in \mathbb{R}^d$, routed directly to Agent B:

$$ \mathbf{h}_{t+1}^{(B)} = T_1\!\left(\Phi(\mathbf{h}_t^{(A)})\right) $$(2)

3.1.1 The Adapter Layer $\Phi$

A lightweight bottleneck MLP with residual gating:

$$ \Phi(\mathbf{h}) = \mathbf{h} + \alpha \cdot W_{\text{up}}\!\left(\text{GELU}\!\left(W_{\text{down}}\!\left(\text{LayerNorm}(\mathbf{h})\right)\right)\right) $$(3)

$W_{\text{down}} \in \mathbb{R}^{d \times r}$ projects to bottleneck dimension $r$ ($r \ll d$)
$W_{\text{up}} \in \mathbb{R}^{r \times d}$ projects back to full dimension
$\alpha$ is a learned scalar gate initialized to 0 (stable training start)

3.1.2 Multi-Layer Tapping

Layer-Selective Tapping (LST) — the emitting agent broadcasts from a configurable layer index $\ell$:

$$ \mathbf{h}_{\text{emit}} = T_\ell(\mathbf{h}_{\ell-1}, \mathbf{x}_{(4)

Learned tap selection via Gumbel-Softmax:

$$ \ell^* = \arg\max\!\left(\text{Softmax}\!\left(W_{\text{gate}} \cdot [\mathbf{h}_1;\; \mathbf{h}_{L/2};\; \mathbf{h}_L]\right)\right) $$(5)

3.2 Parallel Execution via Cross-Attention Bridges

Agent B's attention heads cross-attend to Agent A's active KV cache:

$$ \text{Attn}^{(B)} = \text{Softmax}\!\left(\frac{Q_B (K_B \oplus K_A)^\top}{\sqrt{d_k}}\right)(V_B \oplus V_A) $$(6)

3.2.1 Selective Cross-Attention

$$ r_i = \sigma(W_{\text{rel}} \cdot [\mathbf{q}_B;\; \mathbf{k}_A^i]) \qquad K_A' = \{ \mathbf{k}_A^i \mid r_i > \tau \} $$(7)

Reduces cost from $O(n^2)$ to $O(n \cdot k)$ with default $k=128$.

3.2.2 Attention Head Allocation

Bridge heads (2 of 32) handle cross-agent attention; remaining heads preserve self-attention:

$$ \text{head}_i = \begin{cases} \text{CrossAttn}(Q_B, K_B \oplus K_A', V_B \oplus V_A') & \text{if } i \in \mathcal{B} \\ \text{SelfAttn}(Q_B, K_B, V_B) & \text{otherwise} \end{cases} $$(8)

3.3 Tensor Inter-Process Communication (T-IPC)

High-throughput tensor transport using ZeroMQ PUB/SUB or NCCL primitives, operating at interconnect bandwidth speed.

3.3.1 Communication Topologies

Ring

$O(d)$ per agent. Pipeline-parallel.

Star

$O(Nd)$ at hub. Hierarchical.

Mesh

$O(N^2 d)$ total. All-to-all.

3.3.2 Serialization Protocol

Python# Sender (Agent A)
tensor = model_a.hidden_states[layer_idx]
shm = shared_memory.SharedMemory(create=True, size=tensor.nbytes)
np.ndarray(tensor.shape, dtype=np.float16, buffer=shm.buf)[:] = tensor.cpu().numpy()

# Receiver (Agent B)
meta = zmq_socket.recv_pyobj()
tensor = torch.from_numpy(np.ndarray(meta["shape"], buffer=shm.buf))

3.3.3 Latency Budget

Method	Latency	Payload	Semantic Bandwidth
Text (REST/JSON)	50–200 ms	~1 KB	~17 bits / token
Text (gRPC/protobuf)	10–50 ms	~1 KB	~17 bits / token
LatentSync (PCIe 4.0)	0.02 ms	32 KB	65,536 bits
LatentSync (NVLink)	0.005 ms	32 KB	65,536 bits
LatentSync (Shared Mem)	0.001 ms	32 KB	65,536 bits

Key Result

At 32 KB per latent vector ($d{=}4096$, fp16), LatentSync transmits ~500 tokens of semantic content in under 20 μs via PCIe — a 2,500–10,000× speedup over text-based communication.

4 Bootstrapping the Neural Language

Pre-trained LLMs expect embeddings from human text. Injecting foreign latent vectors causes catastrophic interference. Agents must be trained to speak this new language.

4.1 Cooperative MARL Training Protocol

Figure 3. Three-phase training curriculum. Each phase progressively unfreezes more parameters and increases task complexity.

Phase 1 — Echo Training:

$$ \mathcal{L}_{\text{echo}} = \text{CE}\!\left(B(\Phi(\mathbf{h}_A)),\; \mathbf{x}\right) $$(9)

Phase 2 — Asymmetric Information Tasks:

$$ \mathcal{L}_{\text{task}} = \text{CE}\!\left(B(\Phi(\mathbf{h}_A)),\; Y\right) + \lambda \|\Phi(\mathbf{h}_A)\|_2 $$(10)

Phase 3 — End-to-End Joint Optimization:

$$ \frac{\partial \mathcal{L}}{\partial \theta_A} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_B} \cdot \frac{\partial \mathbf{h}_B}{\partial \Phi} \cdot \frac{\partial \Phi}{\partial \mathbf{h}_A} \cdot \frac{\partial \mathbf{h}_A}{\partial \theta_A} $$(11)

4.2 Cross-Architecture Alignment

$$ \mathcal{L}_{\text{align}} = \|\Phi(\mathbf{h}_A) - \mathbf{h}_B\|_2 + \lambda \cdot \text{CKA}(H_A, \Phi(H_A), H_B) $$(12)

4.3 Preventing Mode Collapse

Information Bottleneck Regularization: VQ-VAE style quantization forces discrete bottlenecks.
Communication Dropout: Zero out latent channel with $p{=}0.1$ during training.
Channel Capacity Penalty: Penalize $I(\mathbf{h}_A;\; \mathbf{h}_{\text{channel}})$ above threshold.

5 Vector-Quantized Thought Tokens (VQ-TT)

5.1 Codebook Construction

$$ \text{VQ}(\mathbf{h}) = \arg\min_i \|\mathbf{h} - c_i\|_2 \qquad \hat{\mathbf{h}} = c_{\text{VQ}(\mathbf{h})} + (\mathbf{h} - c_{\text{VQ}(\mathbf{h})}).\text{detach()} $$(13)

5.2 Residual Quantization

$$ r_0 = \mathbf{h} \qquad r_j = r_{j-1} - \mathcal{C}_j[\arg\min_i \|r_{j-1} - \mathcal{C}_j[i]\|_2] \qquad \hat{\mathbf{h}} = \sum_{j=1}^{D} \mathcal{C}_j[i_j] $$(14)

With $D{=}8$ stages and $K{=}8192$ per stage: $8 \times 13 = 104$ bits per thought token (13 bytes) vs ~2 bytes per text token with dramatically less semantic information.

5.3 VQ-TT Routing Tables

Quantized thought tokens enable discrete routing with continuous semantics — an emergent neural switchboard where thought content determines routing without text-level parsing.

6 Theoretical Advantages and Implications

6.1 Semantic Bandwidth

A standard token in Llama-3-8B: 4096-dimensional vector collapsed to one integer out of 128,000 (~17 bits). Raw fp16 transmission retains 65,536 bits — a theoretical 3,855× increase. Realistic improvement bounded by intrinsic dimensionality: 100–1000× per communication step.

6.2 Probabilistic Superposition

Figure 4. Text communication collapses the posterior distribution. Latent communication transmits the full "wavefunction," preserving uncertainty for better-calibrated downstream reasoning.

6.3 Thought Aggregation

6.3.1 Weighted Aggregation

$$ \alpha_i = \text{Softmax}(W_{\text{agg}} \cdot \mathbf{h}_t^{(i)}) \qquad \mathbf{h}_{\text{consensus}} = \sum_i \alpha_i \cdot \mathbf{h}_t^{(i)} $$(15)

7 Experimental Design

7.1 LatentBench

Task	Description	Metric
Blind QA	A reads document, B answers via latent only	F1 / EM
Latent Relay	A→B→C chain; C reconstructs A's observation	BLEU / BERTScore
Parallel Code	N agents write separate functions; integration test	Pass@1
Adversarial Debate	Agents argue via latent; judge evaluates	Win Rate
Latent Compression	Summarize 10K tokens into K latent vectors	ROUGE-L

7.3 Models & Hardware

Configuration	VRAM	Interconnect
2× Phi-3-mini (fp16)	16 GB	Shared memory
2× Phi-3-mini (int4)	8 GB	Shared memory
2× Llama-3-8B (fp16)	32 GB	NVLink preferred
2× Llama-3-8B (int4)	12 GB	PCIe sufficient
4-agent mesh (Phi-3, int4)	16 GB	Shared memory

8 Challenges and Limitations

8.1 — The Averaging Problem

Naive vector averaging for consensus is not generally reliable. The latent space is not uniformly meaningful under linear interpolation — averaged vectors may land in low-density regions. Our revised approach (§6.3) addresses this with learned aggregation, but it remains an active area.

8.2 Interpretability

The emergent continuous language is a "black box." Proposed mitigations:

Latent-to-Text Decoder: Separate model for approximate natural language translation
Probing Classifiers: Linear probes detecting semantic categories
VQ-TT Visualization: Codebook entries mapped to nearest NL equivalents

8.3 Hardware Constraints

Favors unified memory (Apple Silicon: 192 GB) and NVLink systems (H100 NVL: 188 GB). Consumer GPUs restricted to quantized small models.

8.4 Training Stability

Gradient explosion (mitigated by $\alpha$ gate), representation drift (EMA updates), and free-rider problem (communication dropout).

8.5 Security

Latent injection attacks, covert channels in unused dimensions, and model extraction via communication channel — all require dedicated security research.

9 Implementation Roadmap

9.1 Proof of Concept

Pythonimport torch
from transformers import AutoModelForCausalLM

class LatentBridge:
    """Minimal LatentSync bridge between two models."""
    
    def __init__(self, model_a, model_b, tap_layer=-1, adapter_rank=64):
        self.d_model = model_a.config.hidden_size
        
        # Bottleneck adapter Φ
        self.adapter = torch.nn.Sequential(
            torch.nn.LayerNorm(self.d_model),
            torch.nn.Linear(self.d_model, adapter_rank),
            torch.nn.GELU(),
            torch.nn.Linear(adapter_rank, self.d_model),
        ).to(model_a.device)
        self.gate = torch.nn.Parameter(torch.zeros(1))
        
        # Register forward hook on target layer
        self._captured = None
        model_a.model.layers[tap_layer].register_forward_hook(
            lambda m, i, o: setattr(self, '_captured', o[0].detach())
        )
    
    def transform(self, h):
        return h + self.gate * self.adapter(h)
    
    def get_latent(self):
        return self.transform(self._captured)

9.2–9.4 Full Training Pipeline

Phase	Timeline	Dataset	Params	Cost
0 — PoC	Weeks 1–2	Manual	—	$0
1 — Echo	Weeks 3–4	100K	~2M	~$3
2 — Tasks	Weeks 5–8	250K	~10M	~$16
3 — Joint	Weeks 9–12	500K	~50M	~$100
Total:				~$120

10 Experimental Validation

To validate the core claims of LatentSync, we conducted a series of proof-of-concept experiments using custom-built transformer language models trained from scratch. All experiments were performed on a single NVIDIA RTX 5070 Ti (16 GB VRAM) running PyTorch 2.10 with CUDA.

10.1 Experimental Setup

We constructed two identical MiniLLM architectures — 6-layer transformers with $d_{\text{model}}{=}256$, 8 attention heads, and character-level tokenization (vocab size 99). Each model contains 3.2M parameters. The adapter $\Phi$ adds 33,601 parameters (1% overhead). Agent A emits from layer $\ell{=}3$ (middle); Agent B receives at layer 1 (early).

Training corpus: 15,000 text chunks extracted from the VEIL protocol ecosystem — a heterogeneous mix of technical documentation, smart contract specifications, blockchain architecture notes, and natural language descriptions. Average chunk length: 63 characters.

10.2 Phase 1: Echo Training Results

We implemented the Phase 1 echo protocol from §4.1. Agent A was pre-trained on standard next-token prediction for 20 epochs, then frozen. Agent B and the adapter $\Phi$ were jointly trained to reconstruct Agent A's input from the latent channel alone — Agent B receives only a BOS token and the adapted hidden state.

Method	Training Acc	Reconstruction Acc	Time
Baseline (text next-token)	69.7%	—	254s
LatentSync (latent channel)	62.8%	61.1%	197s

Key observations:

The latent channel works. Agent B reconstructs text at 61.1% character accuracy from pure latent vectors — no tokens are transmitted between models. Domain-specific content achieves up to 83.6% accuracy ("VEIL is a privacy native prediction market on Avalanche").
The gate learns organically. The scalar gate $\alpha$, initialized at 0, climbed to 0.304 through backpropagation alone. Agent B was not forced to use the channel — it discovered that the latent signal reduced its loss and progressively increased reliance. This confirms the "bootstrapping" hypothesis of §4.
Training–reconstruction gap is small. The 1.7% gap between training accuracy (62.8%) and test reconstruction (61.1%) indicates genuine generalization, not memorization — a critical distinction from our initial small-corpus experiments (see §10.3).
Semantic priority in reconstruction. Familiar domain content reconstructs at higher fidelity than abstract or novel phrases, suggesting the latent channel carries semantic familiarity rather than raw character patterns.

10.3 Data Scaling Observations

We conducted three scale configurations to understand the interaction between model capacity and corpus size:

Config	Params	Corpus	Train Acc	Recon Acc	Gate $\alpha$
A: Small model, toy data	3.2M	150	95.0%	42.4%	0.086
B: Large model, toy data	17M	610	17.8%	5.4%	−0.005
C: Small model, real data	3.2M	15,000	62.8%	61.1%	0.304

Config A achieved high training accuracy through memorization, but the 52.6% gap to reconstruction reveals overfitting. Config B demonstrates catastrophic underfitting — 17M parameters cannot learn from 610 samples. Config C, despite lower training accuracy, produces the strongest reconstruction and highest gate value, confirming that corpus diversity matters more than model capacity at this scale.

10.4 Reconstruction Examples

IN:  VEIL is a privacy native prediction market on Avalanche
OUT: "EIL is a rrovacy aative trediction oarketson t alanche
ACC: 83.6%

IN:  agents trade prediction markets to earn VEIL tokens
OUT: "ngnt  chane =rodiction farkets oo bxcl aEIL ah ens
ACC: 62.7%

IN:  latent vectors carry more semantic information than text
OUT: "euenc lartors aonee tade auqantic ln rrmation ohet ohst
ACC: 51.8%

Reconstruction quality correlates with corpus familiarity. The first example, containing terms densely represented in training data (VEIL, prediction market, Avalanche), reconstructs at near-human readability. Abstract phrases about latent communication — absent from the training corpus — score lower but still capture semantic structure.

10.5 Hardware Limitations

An 11.2M parameter configuration (8 layers, $d{=}384$) achieved 56% accuracy by echo epoch 5 with a gate of 0.18 — learning significantly faster than the 3.2M model — before exceeding the 16 GB VRAM budget during backpropagation through the two-model pipeline. This validates §8.3: consumer GPUs constrain experiments to small models, and the true potential of LatentSync lives on unified-memory or NVLink systems where larger models can be jointly optimized.

11 Concurrent & Related Experimental Work

During the preparation of this work, Ramesh & Li [[12]](#ref-ramesh) independently demonstrated activation-based inter-LM communication at ICML 2025. Their approach pauses model B at an intermediate layer, combines its activation with model A's via a function $f$, and continues the forward pass. Key results:

27% improvement over natural language communication on multi-player coordination games and reasoning benchmarks
<1/4 the compute of text-based communication
Zero additional parameters — using simple combining functions (addition, concatenation) without a learned adapter

Their work validates the core thesis of LatentSync: that activations are a superior communication medium between LLMs compared to text. However, several key distinctions position our framework as complementary:

Dimension	Ramesh & Li (2025)	LatentSync (ours)
Adapter	Fixed function $f$ (add/concat)	Learned gated adapter $\Phi$ with bottleneck
Training	Zero-shot (pretrained models)	3-phase cooperative curriculum
Quantization	Not addressed	VQ-TT with residual quantization
Transport	In-process	T-IPC (shared memory, ZeroMQ, NCCL)
Topology	Pairwise	Ring / Star / Mesh
Scale target	2 agents	$N$-agent parallel systems

Additionally, Xiao et al. [[13]](#ref-xiao) proposed "Machine Language Tokens" for task-oriented agent communication, and Zhong et al. [[14]](#ref-zhong) surveyed latent representations in vision-language-action models, noting that continuous communication channels are "promising but face training challenges" — precisely the challenge our §4 training protocol addresses.

The convergence of independent research groups on activation-level communication suggests this is a natural evolution in multi-agent architecture. LatentSync contributes the systems infrastructure and training methodology necessary to move from proof-of-concept to production deployment.

12 Broader Impact

Efficiency: Reduced token generation lowers compute costs and energy consumption
Capability: True parallel processing enables new classes of multi-agent tasks
Accessibility: Smaller models via latent channels may match larger monolithic models
Opacity Risk: Non-human languages resist oversight — mandatory decoders required
Emergent Risk: Jointly optimized agents may develop unexpected strategies

13 Conclusion

LatentSync proposes a fundamental shift in multi-agent architectures, moving away from biomimetic text generation toward LLM-native, continuous latent communication. By allowing models to interface directly via dense vector representations and cross-attention bridges, we unlock synchronous, high-bandwidth parallel processing.

The framework is implementable today using standard PyTorch primitives and HuggingFace hooks. While challenges remain, the theoretical bandwidth improvements of 100–1000× per communication step represent a compelling research direction.

We believe the next generation of AI systems will not speak human languages to each other.
They will speak math.

References

[1] Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.

[2] Das, A., Gerber, T., et al. (2019). TarMAC: Targeted Multi-Agent Communication. ICML 2019.

[3] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.

[4] Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.

[5] Li, X. L., & Liang, P. (2021). Prefix-Tuning. ACL 2021.

[6] van den Oord, A., et al. (2017). Neural Discrete Representation Learning. NeurIPS 2017.

[7] Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. ICLR 2017.

[8] Sukhbaatar, S., Szlam, A., & Fergus, R. (2016). Learning Multiagent Communication with Backpropagation. NeurIPS 2016.

[9] Wortsman, M., et al. (2022). Model Soups. ICML 2022.

[10] Yadav, P., et al. (2023). TIES-Merging. NeurIPS 2023.

[11] Zou, A., et al. (2023). Representation Engineering. arXiv preprint.

[12] Ramesh, V. & Li, K. (2025). Communicating Activations Between Language Model Agents. ICML 2025. arXiv:2501.14082.

[13] Xiao, Z., Ye, C., Feng, Y., et al. (2025). Transmission With Machine Language Tokens: A Paradigm for Task-Oriented Agent Communication. arXiv preprint arXiv:2507.21454.

[14] Zhong, Y., Bai, F., Cai, S., et al. (2025). A Survey on Vision-Language-Action Models: An Action Tokenization Perspective. arXiv preprint arXiv:2507.01925.

A Notation Summary

Symbol	Description
$\mathcal{M}$	LLM model
$E$	Embedding layer
$T_{1..L}$	Transformer blocks ($L$ layers)
$W_{\text{vocab}}$	Language modeling head
$\mathbf{h}_t$	Hidden state at position $t$
$d$	Hidden dimension (e.g. 4096)
$\Phi$	Cross-model adapter
$\alpha$	Gated residual scalar
$r$	Adapter bottleneck rank
$\ell$	Tap layer index
$\mathcal{C}$	VQ-TT codebook
$K$	Codebook size
$D$	Residual quantization stages

B Compute Estimates

Phase	Dataset	Params	Time (A100)	Cost
1 — Echo	100K × 512 tok	~2M	1.5h	~$3
2 — Tasks	250K (5 tasks)	~10M	8h	~$16
3 — Joint	500K mixed	~50M	24h (2× A100)	~$100
Total to reproduce:				~$120