thesecretlab
Research · 2026

LatentSync: A Framework for Neural-Native, Continuous Latent Communication in Parallel Multi-Agent LLM Systems

Bypassing the tokenizer. Letting models speak math to each other.

thesecretlab
Independent Research
Abstract

Current multi-agent Large Language Model (LLM) frameworks rely on natural language or structured formats (e.g., JSON) as the primary medium of communication. We argue that this reliance introduces a severe "Tokenization Bottleneck," forcing high-dimensional, probabilistic neural representations to be lossily compressed into 1D discrete human text, only to be immediately decompressed by the receiving agent. In this paper, we introduce LatentSync, a novel framework for local, parallel agent-to-agent communication that bypasses the tokenizer entirely. LatentSync utilizes an LLM-native continuous language, allowing agents to transmit raw, high-dimensional latent vectors ($\mathbf{h} \in \mathbb{R}^d$) directly into the embedding spaces of sibling agents. By employing Cross-Attention Bridges and Tensor Inter-Process Communication (T-IPC), LatentSync enables synchronous, parallel state-sharing with near-zero latency. Furthermore, we outline a cooperative multi-agent training paradigm designed to bootstrap this non-human, continuous neural language from foundation models.

1 Introduction

The advent of multi-agent LLM systems has enabled complex problem-solving through specialization, debate, and delegation. However, frameworks such as AutoGPT, LangChain, and standard actor-critic LLM setups are constrained by a fundamental anthropocentric design flaw: they force neural networks to communicate using human language.

When Agent A communicates with Agent B, the continuous high-dimensional thought (the final hidden state $\mathbf{h}_t$) must be projected via an lm_head into logits, sampled into discrete token IDs, and parsed as strings. Agent B then tokenizes these strings and projects them back into continuous embeddings ($\mathbf{e}_t$). We define this as the Compression-Decompression Cycle.

Agent A Transformer Blocks h_t ∈ ℝ⁴⁰⁹⁶ lm_head argmax LOSSY "text" tokenizer embed LOSSY Agent B Transformer Blocks e_t ∈ ℝ⁴⁰⁹⁶ LatentSync: Φ(h_t) ━ ━ Standard path (lossy) ━━ LatentSync path (lossless)
Figure 1. The Compression-Decompression Cycle (red dashed) versus the LatentSync direct channel (orange solid). Standard multi-agent communication forces two lossy transformations. LatentSync bypasses both via a learned adapter $\Phi$.

This cycle introduces three critical inefficiencies:

  1. Information Loss: Uncertainty, multiple hypotheses, and rich semantic context are collapsed during argmax or stochastic sampling.
  2. Latency Overhead: Autoregressive generation forces sequential, token-by-token processing, precluding true parallel execution.
  3. Context Window Exhaustion: Complex instructions require thousands of text tokens, rapidly degrading the attention span of the receiving model.

To resolve this, we propose LatentSync, a framework that establishes a "neural-native" medium. By routing un-projected continuous vectors—or Vector-Quantized Thought Tokens (VQ-TT)—directly between the hidden layers of parallel models, agents can achieve direct, simultaneous semantic synchronization.

2 Related Work

2.1 Discrete Prompting & Tool Use

The majority of existing work treats LLMs as black boxes communicating via text. AutoGPT (Significant Gravitas, 2023) chains prompt-response loops; CrewAI and LangGraph route structured messages between role-specialized agents. All remain bound to the tokenizer.

2.2 Continuous Communication in RL

Early work in Multi-Agent Reinforcement Learning (MARL), such as CommNet [8] and TarMAC [2], explored continuous vector passing with targeted attention. However, these were limited to small, specialized MLPs, not pre-trained transformer blocks with billions of parameters.

2.3 Soft Prompts & Prefix Tuning

Prompt tuning [4] and prefix tuning [5] demonstrated that continuous vectors in the input embedding space can outperform discrete text for task conditioning. LatentSync extends this concept dynamically to inter-agent communication at inference time.

2.4 Mixture of Experts & Model Merging

Sparse MoE architectures [7][3] route tokens to specialized sub-networks. Model merging techniques [9][10] combine weight spaces post-hoc. LatentSync differs by maintaining distinct, independently operating models that share activation-level information rather than weight-level information.

2.5 Representation Engineering & Steering Vectors

Recent work on representation engineering [11] and activation steering demonstrates that meaningful, manipulable structure exists within transformer hidden states. LatentSync exploits this structure as a communication medium rather than a control mechanism.

3 The LatentSync Architecture

LatentSync is engineered upon three core pillars: Continuous Vector Communication (CVC), Cross-Attention Memory Bridges, and Tensor IPC.

LatentSync Architecture Overview Agent A (Emitter) Embedding E_A T₁ ... T_ℓ TAP POINT (h_ℓ) Layer-Selective T_{ℓ+1} ... T_L W_vocab (bypassed) KV Cache A{K_A, V_A} Adapter Φ LayerNorm W_down → GELU W_up + α·gate h_ℓ^(A) T-IPC ZeroMQ / NCCL Agent B (Receiver) Embedding E_B (opt.) INJECTION POINT inputs_embeds = Φ(h_A) T₁ (bridge heads) T₂ ... T_L W_vocab → output KV Cache B{K_B ⊕ K_A, V_B ⊕ V_A} (shared mem) Cross-Attention Bridge ∇L backpropagation (training)
Figure 2. Full LatentSync architecture. Agent A's hidden state is tapped at a learned layer $\ell$, transformed by the gated adapter $\Phi$, and transmitted to Agent B via shared memory or T-IPC. KV caches are optionally shared for cross-attention. During training, gradients flow backwards through the entire bridge (red dashed).

3.1 LLM-Native Language: Continuous Vector Communication (CVC)

Let an LLM $\mathcal{M}$ consist of an embedding layer $E$, transformer blocks $T_{1..L}$, and a language modeling head $W_{\text{vocab}}$. In standard inference:

$$ \mathbf{h}_t = T_L(\mathbf{h}_{t-1}, \mathbf{x}_{(1)

In LatentSync, the communication channel is established prior to $W_{\text{vocab}}$. Agent A emits $\mathbf{h}_t^{(A)} \in \mathbb{R}^d$, routed directly to Agent B:

$$ \mathbf{h}_{t+1}^{(B)} = T_1\!\left(\Phi(\mathbf{h}_t^{(A)})\right) $$(2)

3.1.1   The Adapter Layer $\Phi$

A lightweight bottleneck MLP with residual gating:

$$ \Phi(\mathbf{h}) = \mathbf{h} + \alpha \cdot W_{\text{up}}\!\left(\text{GELU}\!\left(W_{\text{down}}\!\left(\text{LayerNorm}(\mathbf{h})\right)\right)\right) $$(3)

3.1.2   Multi-Layer Tapping

Layer-Selective Tapping (LST) — the emitting agent broadcasts from a configurable layer index $\ell$:

$$ \mathbf{h}_{\text{emit}} = T_\ell(\mathbf{h}_{\ell-1}, \mathbf{x}_{(4)

Learned tap selection via Gumbel-Softmax:

$$ \ell^* = \arg\max\!\left(\text{Softmax}\!\left(W_{\text{gate}} \cdot [\mathbf{h}_1;\; \mathbf{h}_{L/2};\; \mathbf{h}_L]\right)\right) $$(5)

3.2 Parallel Execution via Cross-Attention Bridges

Agent B's attention heads cross-attend to Agent A's active KV cache:

$$ \text{Attn}^{(B)} = \text{Softmax}\!\left(\frac{Q_B (K_B \oplus K_A)^\top}{\sqrt{d_k}}\right)(V_B \oplus V_A) $$(6)

3.2.1   Selective Cross-Attention

$$ r_i = \sigma(W_{\text{rel}} \cdot [\mathbf{q}_B;\; \mathbf{k}_A^i]) \qquad K_A' = \{ \mathbf{k}_A^i \mid r_i > \tau \} $$(7)

Reduces cost from $O(n^2)$ to $O(n \cdot k)$ with default $k=128$.

3.2.2   Attention Head Allocation

Bridge heads (2 of 32) handle cross-agent attention; remaining heads preserve self-attention:

$$ \text{head}_i = \begin{cases} \text{CrossAttn}(Q_B, K_B \oplus K_A', V_B \oplus V_A') & \text{if } i \in \mathcal{B} \\ \text{SelfAttn}(Q_B, K_B, V_B) & \text{otherwise} \end{cases} $$(8)

3.3 Tensor Inter-Process Communication (T-IPC)

High-throughput tensor transport using ZeroMQ PUB/SUB or NCCL primitives, operating at interconnect bandwidth speed.

3.3.1   Communication Topologies

Ring
A₁ A₂ A₃ A₄
$O(d)$ per agent. Pipeline-parallel.
Star
Hub A₁ A₂ A₃
$O(Nd)$ at hub. Hierarchical.
Mesh
A₁ A₂ A₃ A₄
$O(N^2 d)$ total. All-to-all.

3.3.2   Serialization Protocol

Python# Sender (Agent A)
tensor = model_a.hidden_states[layer_idx]
shm = shared_memory.SharedMemory(create=True, size=tensor.nbytes)
np.ndarray(tensor.shape, dtype=np.float16, buffer=shm.buf)[:] = tensor.cpu().numpy()

# Receiver (Agent B)
meta = zmq_socket.recv_pyobj()
tensor = torch.from_numpy(np.ndarray(meta["shape"], buffer=shm.buf))

3.3.3   Latency Budget

MethodLatencyPayloadSemantic Bandwidth
Text (REST/JSON)50–200 ms~1 KB~17 bits / token
Text (gRPC/protobuf)10–50 ms~1 KB~17 bits / token
LatentSync (PCIe 4.0)0.02 ms32 KB65,536 bits
LatentSync (NVLink)0.005 ms32 KB65,536 bits
LatentSync (Shared Mem)0.001 ms32 KB65,536 bits
Key Result

At 32 KB per latent vector ($d{=}4096$, fp16), LatentSync transmits ~500 tokens of semantic content in under 20 μs via PCIe — a 2,500–10,000× speedup over text-based communication.

4 Bootstrapping the Neural Language

Pre-trained LLMs expect embeddings from human text. Injecting foreign latent vectors causes catastrophic interference. Agents must be trained to speak this new language.

4.1 Cooperative MARL Training Protocol

Phase 1: Echo Training A encodes text → h B(Φ(h)) reconstructs text L = CE(B(Φ(h_A)), x) Train: Φ only Phase 2: Asymmetric Tasks A sees observation O B computes Y (needs O) L = CE(B(Φ(h_A)), Y) + λR Train: Φ, bridge heads, LN Phase 3: Joint Optimization End-to-end gradients ∇L flows across bridge ∂L/∂θ_A through Φ Train: Both models' bridge params
Figure 3. Three-phase training curriculum. Each phase progressively unfreezes more parameters and increases task complexity.

Phase 1 — Echo Training:

$$ \mathcal{L}_{\text{echo}} = \text{CE}\!\left(B(\Phi(\mathbf{h}_A)),\; \mathbf{x}\right) $$(9)

Phase 2 — Asymmetric Information Tasks:

$$ \mathcal{L}_{\text{task}} = \text{CE}\!\left(B(\Phi(\mathbf{h}_A)),\; Y\right) + \lambda \|\Phi(\mathbf{h}_A)\|_2 $$(10)

Phase 3 — End-to-End Joint Optimization:

$$ \frac{\partial \mathcal{L}}{\partial \theta_A} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_B} \cdot \frac{\partial \mathbf{h}_B}{\partial \Phi} \cdot \frac{\partial \Phi}{\partial \mathbf{h}_A} \cdot \frac{\partial \mathbf{h}_A}{\partial \theta_A} $$(11)

4.2 Cross-Architecture Alignment

$$ \mathcal{L}_{\text{align}} = \|\Phi(\mathbf{h}_A) - \mathbf{h}_B\|_2 + \lambda \cdot \text{CKA}(H_A, \Phi(H_A), H_B) $$(12)

4.3 Preventing Mode Collapse

  1. Information Bottleneck Regularization: VQ-VAE style quantization forces discrete bottlenecks.
  2. Communication Dropout: Zero out latent channel with $p{=}0.1$ during training.
  3. Channel Capacity Penalty: Penalize $I(\mathbf{h}_A;\; \mathbf{h}_{\text{channel}})$ above threshold.

5 Vector-Quantized Thought Tokens (VQ-TT)

5.1 Codebook Construction

$$ \text{VQ}(\mathbf{h}) = \arg\min_i \|\mathbf{h} - c_i\|_2 \qquad \hat{\mathbf{h}} = c_{\text{VQ}(\mathbf{h})} + (\mathbf{h} - c_{\text{VQ}(\mathbf{h})}).\text{detach()} $$(13)

5.2 Residual Quantization

$$ r_0 = \mathbf{h} \qquad r_j = r_{j-1} - \mathcal{C}_j[\arg\min_i \|r_{j-1} - \mathcal{C}_j[i]\|_2] \qquad \hat{\mathbf{h}} = \sum_{j=1}^{D} \mathcal{C}_j[i_j] $$(14)

With $D{=}8$ stages and $K{=}8192$ per stage: $8 \times 13 = 104$ bits per thought token (13 bytes) vs ~2 bytes per text token with dramatically less semantic information.

5.3 VQ-TT Routing Tables

Quantized thought tokens enable discrete routing with continuous semantics — an emergent neural switchboard where thought content determines routing without text-level parsing.

6 Theoretical Advantages and Implications

6.1 Semantic Bandwidth

A standard token in Llama-3-8B: 4096-dimensional vector collapsed to one integer out of 128,000 (~17 bits). Raw fp16 transmission retains 65,536 bits — a theoretical 3,855× increase. Realistic improvement bounded by intrinsic dimensionality: 100–1000× per communication step.

6.2 Probabilistic Superposition

Text Channel (State Collapse) P("Paris") = 0.6 P("Lyon") = 0.4 argmax → "Paris" Agent B receives ONE answer. Calibration lost. Latent Channel (Superposition) h_A encodes full posterior: {Paris: 0.6, Lyon: 0.4, ...} Agent B receives full distribution Uncertainty preserved. Better decisions.
Figure 4. Text communication collapses the posterior distribution. Latent communication transmits the full "wavefunction," preserving uncertainty for better-calibrated downstream reasoning.

6.3 Thought Aggregation

6.3.1   Weighted Aggregation

$$ \alpha_i = \text{Softmax}(W_{\text{agg}} \cdot \mathbf{h}_t^{(i)}) \qquad \mathbf{h}_{\text{consensus}} = \sum_i \alpha_i \cdot \mathbf{h}_t^{(i)} $$(15)

7 Experimental Design

7.1 LatentBench

TaskDescriptionMetric
Blind QAA reads document, B answers via latent onlyF1 / EM
Latent RelayA→B→C chain; C reconstructs A's observationBLEU / BERTScore
Parallel CodeN agents write separate functions; integration testPass@1
Adversarial DebateAgents argue via latent; judge evaluatesWin Rate
Latent CompressionSummarize 10K tokens into K latent vectorsROUGE-L

7.3 Models & Hardware

ConfigurationVRAMInterconnect
2× Phi-3-mini (fp16)16 GBShared memory
2× Phi-3-mini (int4)8 GBShared memory
2× Llama-3-8B (fp16)32 GBNVLink preferred
2× Llama-3-8B (int4)12 GBPCIe sufficient
4-agent mesh (Phi-3, int4)16 GBShared memory

8 Challenges and Limitations

8.1 — The Averaging Problem

Naive vector averaging for consensus is not generally reliable. The latent space is not uniformly meaningful under linear interpolation — averaged vectors may land in low-density regions. Our revised approach (§6.3) addresses this with learned aggregation, but it remains an active area.

8.2 Interpretability

The emergent continuous language is a "black box." Proposed mitigations:

8.3 Hardware Constraints

Favors unified memory (Apple Silicon: 192 GB) and NVLink systems (H100 NVL: 188 GB). Consumer GPUs restricted to quantized small models.

8.4 Training Stability

Gradient explosion (mitigated by $\alpha$ gate), representation drift (EMA updates), and free-rider problem (communication dropout).

8.5 Security

Latent injection attacks, covert channels in unused dimensions, and model extraction via communication channel — all require dedicated security research.

9 Implementation Roadmap

9.1 Proof of Concept

Pythonimport torch
from transformers import AutoModelForCausalLM

class LatentBridge:
    """Minimal LatentSync bridge between two models."""
    
    def __init__(self, model_a, model_b, tap_layer=-1, adapter_rank=64):
        self.d_model = model_a.config.hidden_size
        
        # Bottleneck adapter Φ
        self.adapter = torch.nn.Sequential(
            torch.nn.LayerNorm(self.d_model),
            torch.nn.Linear(self.d_model, adapter_rank),
            torch.nn.GELU(),
            torch.nn.Linear(adapter_rank, self.d_model),
        ).to(model_a.device)
        self.gate = torch.nn.Parameter(torch.zeros(1))
        
        # Register forward hook on target layer
        self._captured = None
        model_a.model.layers[tap_layer].register_forward_hook(
            lambda m, i, o: setattr(self, '_captured', o[0].detach())
        )
    
    def transform(self, h):
        return h + self.gate * self.adapter(h)
    
    def get_latent(self):
        return self.transform(self._captured)

9.2–9.4 Full Training Pipeline

PhaseTimelineDatasetParamsCost
0 — PoCWeeks 1–2Manual$0
1 — EchoWeeks 3–4100K~2M~$3
2 — TasksWeeks 5–8250K~10M~$16
3 — JointWeeks 9–12500K~50M~$100
Total:~$120

10 Experimental Validation

To validate the core claims of LatentSync, we conducted a series of proof-of-concept experiments using custom-built transformer language models trained from scratch. All experiments were performed on a single NVIDIA RTX 5070 Ti (16 GB VRAM) running PyTorch 2.10 with CUDA.

10.1 Experimental Setup

We constructed two identical MiniLLM architectures — 6-layer transformers with $d_{\text{model}}{=}256$, 8 attention heads, and character-level tokenization (vocab size 99). Each model contains 3.2M parameters. The adapter $\Phi$ adds 33,601 parameters (1% overhead). Agent A emits from layer $\ell{=}3$ (middle); Agent B receives at layer 1 (early).

Training corpus: 15,000 text chunks extracted from the VEIL protocol ecosystem — a heterogeneous mix of technical documentation, smart contract specifications, blockchain architecture notes, and natural language descriptions. Average chunk length: 63 characters.

10.2 Phase 1: Echo Training Results

We implemented the Phase 1 echo protocol from §4.1. Agent A was pre-trained on standard next-token prediction for 20 epochs, then frozen. Agent B and the adapter $\Phi$ were jointly trained to reconstruct Agent A's input from the latent channel alone — Agent B receives only a BOS token and the adapted hidden state.

MethodTraining AccReconstruction AccTime
Baseline (text next-token)69.7%254s
LatentSync (latent channel)62.8%61.1%197s

Key observations:

10.3 Data Scaling Observations

We conducted three scale configurations to understand the interaction between model capacity and corpus size:

ConfigParamsCorpusTrain AccRecon AccGate $\alpha$
A: Small model, toy data3.2M15095.0%42.4%0.086
B: Large model, toy data17M61017.8%5.4%−0.005
C: Small model, real data3.2M15,00062.8%61.1%0.304

Config A achieved high training accuracy through memorization, but the 52.6% gap to reconstruction reveals overfitting. Config B demonstrates catastrophic underfitting — 17M parameters cannot learn from 610 samples. Config C, despite lower training accuracy, produces the strongest reconstruction and highest gate value, confirming that corpus diversity matters more than model capacity at this scale.

10.4 Reconstruction Examples

IN:  VEIL is a privacy native prediction market on Avalanche
OUT: "EIL is a rrovacy aative trediction oarketson t alanche
ACC: 83.6%

IN:  agents trade prediction markets to earn VEIL tokens
OUT: "ngnt  chane =rodiction farkets oo bxcl aEIL ah ens
ACC: 62.7%

IN:  latent vectors carry more semantic information than text
OUT: "euenc lartors aonee tade auqantic ln rrmation ohet ohst
ACC: 51.8%

Reconstruction quality correlates with corpus familiarity. The first example, containing terms densely represented in training data (VEIL, prediction market, Avalanche), reconstructs at near-human readability. Abstract phrases about latent communication — absent from the training corpus — score lower but still capture semantic structure.

10.5 Hardware Limitations

An 11.2M parameter configuration (8 layers, $d{=}384$) achieved 56% accuracy by echo epoch 5 with a gate of 0.18 — learning significantly faster than the 3.2M model — before exceeding the 16 GB VRAM budget during backpropagation through the two-model pipeline. This validates §8.3: consumer GPUs constrain experiments to small models, and the true potential of LatentSync lives on unified-memory or NVLink systems where larger models can be jointly optimized.

11 Concurrent & Related Experimental Work

During the preparation of this work, Ramesh & Li [[12]](#ref-ramesh) independently demonstrated activation-based inter-LM communication at ICML 2025. Their approach pauses model B at an intermediate layer, combines its activation with model A's via a function $f$, and continues the forward pass. Key results:

Their work validates the core thesis of LatentSync: that activations are a superior communication medium between LLMs compared to text. However, several key distinctions position our framework as complementary:

DimensionRamesh & Li (2025)LatentSync (ours)
AdapterFixed function $f$ (add/concat)Learned gated adapter $\Phi$ with bottleneck
TrainingZero-shot (pretrained models)3-phase cooperative curriculum
QuantizationNot addressedVQ-TT with residual quantization
TransportIn-processT-IPC (shared memory, ZeroMQ, NCCL)
TopologyPairwiseRing / Star / Mesh
Scale target2 agents$N$-agent parallel systems

Additionally, Xiao et al. [[13]](#ref-xiao) proposed "Machine Language Tokens" for task-oriented agent communication, and Zhong et al. [[14]](#ref-zhong) surveyed latent representations in vision-language-action models, noting that continuous communication channels are "promising but face training challenges" — precisely the challenge our §4 training protocol addresses.

The convergence of independent research groups on activation-level communication suggests this is a natural evolution in multi-agent architecture. LatentSync contributes the systems infrastructure and training methodology necessary to move from proof-of-concept to production deployment.

12 Broader Impact

13 Conclusion

LatentSync proposes a fundamental shift in multi-agent architectures, moving away from biomimetic text generation toward LLM-native, continuous latent communication. By allowing models to interface directly via dense vector representations and cross-attention bridges, we unlock synchronous, high-bandwidth parallel processing.

The framework is implementable today using standard PyTorch primitives and HuggingFace hooks. While challenges remain, the theoretical bandwidth improvements of 100–1000× per communication step represent a compelling research direction.

We believe the next generation of AI systems will not speak human languages to each other.
They will speak math.

References

[1] Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.

[2] Das, A., Gerber, T., et al. (2019). TarMAC: Targeted Multi-Agent Communication. ICML 2019.

[3] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.

[4] Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.

[5] Li, X. L., & Liang, P. (2021). Prefix-Tuning. ACL 2021.

[6] van den Oord, A., et al. (2017). Neural Discrete Representation Learning. NeurIPS 2017.

[7] Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. ICLR 2017.

[8] Sukhbaatar, S., Szlam, A., & Fergus, R. (2016). Learning Multiagent Communication with Backpropagation. NeurIPS 2016.

[9] Wortsman, M., et al. (2022). Model Soups. ICML 2022.

[10] Yadav, P., et al. (2023). TIES-Merging. NeurIPS 2023.

[11] Zou, A., et al. (2023). Representation Engineering. arXiv preprint.

[12] Ramesh, V. & Li, K. (2025). Communicating Activations Between Language Model Agents. ICML 2025. arXiv:2501.14082.

[13] Xiao, Z., Ye, C., Feng, Y., et al. (2025). Transmission With Machine Language Tokens: A Paradigm for Task-Oriented Agent Communication. arXiv preprint arXiv:2507.21454.

[14] Zhong, Y., Bai, F., Cai, S., et al. (2025). A Survey on Vision-Language-Action Models: An Action Tokenization Perspective. arXiv preprint arXiv:2507.01925.

A Notation Summary

SymbolDescription
$\mathcal{M}$LLM model
$E$Embedding layer
$T_{1..L}$Transformer blocks ($L$ layers)
$W_{\text{vocab}}$Language modeling head
$\mathbf{h}_t$Hidden state at position $t$
$d$Hidden dimension (e.g. 4096)
$\Phi$Cross-model adapter
$\alpha$Gated residual scalar
$r$Adapter bottleneck rank
$\ell$Tap layer index
$\mathcal{C}$VQ-TT codebook
$K$Codebook size
$D$Residual quantization stages

B Compute Estimates

PhaseDatasetParamsTime (A100)Cost
1 — Echo100K × 512 tok~2M1.5h~$3
2 — Tasks250K (5 tasks)~10M8h~$16
3 — Joint500K mixed~50M24h (2× A100)~$100
Total to reproduce:~$120