thesecretlab
Bypassing the tokenizer. Letting models speak math to each other.
Current multi-agent Large Language Model (LLM) frameworks rely on natural language or structured formats (e.g., JSON) as the primary medium of communication. We argue that this reliance introduces a severe "Tokenization Bottleneck," forcing high-dimensional, probabilistic neural representations to be lossily compressed into 1D discrete human text, only to be immediately decompressed by the receiving agent. In this paper, we introduce LatentSync, a novel framework for local, parallel agent-to-agent communication that bypasses the tokenizer entirely. LatentSync utilizes an LLM-native continuous language, allowing agents to transmit raw, high-dimensional latent vectors ($\mathbf{h} \in \mathbb{R}^d$) directly into the embedding spaces of sibling agents. By employing Cross-Attention Bridges and Tensor Inter-Process Communication (T-IPC), LatentSync enables synchronous, parallel state-sharing with near-zero latency. Furthermore, we outline a cooperative multi-agent training paradigm designed to bootstrap this non-human, continuous neural language from foundation models.
The advent of multi-agent LLM systems has enabled complex problem-solving through specialization, debate, and delegation. However, frameworks such as AutoGPT, LangChain, and standard actor-critic LLM setups are constrained by a fundamental anthropocentric design flaw: they force neural networks to communicate using human language.
When Agent A communicates with Agent B, the continuous high-dimensional thought (the final hidden state $\mathbf{h}_t$) must be projected via an lm_head into logits, sampled into discrete token IDs, and parsed as strings. Agent B then tokenizes these strings and projects them back into continuous embeddings ($\mathbf{e}_t$). We define this as the Compression-Decompression Cycle.
This cycle introduces three critical inefficiencies:
To resolve this, we propose LatentSync, a framework that establishes a "neural-native" medium. By routing un-projected continuous vectors—or Vector-Quantized Thought Tokens (VQ-TT)—directly between the hidden layers of parallel models, agents can achieve direct, simultaneous semantic synchronization.
The majority of existing work treats LLMs as black boxes communicating via text. AutoGPT (Significant Gravitas, 2023) chains prompt-response loops; CrewAI and LangGraph route structured messages between role-specialized agents. All remain bound to the tokenizer.
Early work in Multi-Agent Reinforcement Learning (MARL), such as CommNet [8] and TarMAC [2], explored continuous vector passing with targeted attention. However, these were limited to small, specialized MLPs, not pre-trained transformer blocks with billions of parameters.
Prompt tuning [4] and prefix tuning [5] demonstrated that continuous vectors in the input embedding space can outperform discrete text for task conditioning. LatentSync extends this concept dynamically to inter-agent communication at inference time.
Sparse MoE architectures [7][3] route tokens to specialized sub-networks. Model merging techniques [9][10] combine weight spaces post-hoc. LatentSync differs by maintaining distinct, independently operating models that share activation-level information rather than weight-level information.
Recent work on representation engineering [11] and activation steering demonstrates that meaningful, manipulable structure exists within transformer hidden states. LatentSync exploits this structure as a communication medium rather than a control mechanism.
LatentSync is engineered upon three core pillars: Continuous Vector Communication (CVC), Cross-Attention Memory Bridges, and Tensor IPC.
Let an LLM $\mathcal{M}$ consist of an embedding layer $E$, transformer blocks $T_{1..L}$, and a language modeling head $W_{\text{vocab}}$. In standard inference:
In LatentSync, the communication channel is established prior to $W_{\text{vocab}}$. Agent A emits $\mathbf{h}_t^{(A)} \in \mathbb{R}^d$, routed directly to Agent B:
A lightweight bottleneck MLP with residual gating:
Layer-Selective Tapping (LST) — the emitting agent broadcasts from a configurable layer index $\ell$:
Learned tap selection via Gumbel-Softmax:
Agent B's attention heads cross-attend to Agent A's active KV cache:
Reduces cost from $O(n^2)$ to $O(n \cdot k)$ with default $k=128$.
Bridge heads (2 of 32) handle cross-agent attention; remaining heads preserve self-attention:
High-throughput tensor transport using ZeroMQ PUB/SUB or NCCL primitives, operating at interconnect bandwidth speed.
Python# Sender (Agent A)
tensor = model_a.hidden_states[layer_idx]
shm = shared_memory.SharedMemory(create=True, size=tensor.nbytes)
np.ndarray(tensor.shape, dtype=np.float16, buffer=shm.buf)[:] = tensor.cpu().numpy()
# Receiver (Agent B)
meta = zmq_socket.recv_pyobj()
tensor = torch.from_numpy(np.ndarray(meta["shape"], buffer=shm.buf))
| Method | Latency | Payload | Semantic Bandwidth |
|---|---|---|---|
| Text (REST/JSON) | 50–200 ms | ~1 KB | ~17 bits / token |
| Text (gRPC/protobuf) | 10–50 ms | ~1 KB | ~17 bits / token |
| LatentSync (PCIe 4.0) | 0.02 ms | 32 KB | 65,536 bits |
| LatentSync (NVLink) | 0.005 ms | 32 KB | 65,536 bits |
| LatentSync (Shared Mem) | 0.001 ms | 32 KB | 65,536 bits |
At 32 KB per latent vector ($d{=}4096$, fp16), LatentSync transmits ~500 tokens of semantic content in under 20 μs via PCIe — a 2,500–10,000× speedup over text-based communication.
Pre-trained LLMs expect embeddings from human text. Injecting foreign latent vectors causes catastrophic interference. Agents must be trained to speak this new language.
Phase 1 — Echo Training:
Phase 2 — Asymmetric Information Tasks:
Phase 3 — End-to-End Joint Optimization:
With $D{=}8$ stages and $K{=}8192$ per stage: $8 \times 13 = 104$ bits per thought token (13 bytes) vs ~2 bytes per text token with dramatically less semantic information.
Quantized thought tokens enable discrete routing with continuous semantics — an emergent neural switchboard where thought content determines routing without text-level parsing.
A standard token in Llama-3-8B: 4096-dimensional vector collapsed to one integer out of 128,000 (~17 bits). Raw fp16 transmission retains 65,536 bits — a theoretical 3,855× increase. Realistic improvement bounded by intrinsic dimensionality: 100–1000× per communication step.
| Task | Description | Metric |
|---|---|---|
| Blind QA | A reads document, B answers via latent only | F1 / EM |
| Latent Relay | A→B→C chain; C reconstructs A's observation | BLEU / BERTScore |
| Parallel Code | N agents write separate functions; integration test | Pass@1 |
| Adversarial Debate | Agents argue via latent; judge evaluates | Win Rate |
| Latent Compression | Summarize 10K tokens into K latent vectors | ROUGE-L |
| Configuration | VRAM | Interconnect |
|---|---|---|
| 2× Phi-3-mini (fp16) | 16 GB | Shared memory |
| 2× Phi-3-mini (int4) | 8 GB | Shared memory |
| 2× Llama-3-8B (fp16) | 32 GB | NVLink preferred |
| 2× Llama-3-8B (int4) | 12 GB | PCIe sufficient |
| 4-agent mesh (Phi-3, int4) | 16 GB | Shared memory |
Naive vector averaging for consensus is not generally reliable. The latent space is not uniformly meaningful under linear interpolation — averaged vectors may land in low-density regions. Our revised approach (§6.3) addresses this with learned aggregation, but it remains an active area.
The emergent continuous language is a "black box." Proposed mitigations:
Favors unified memory (Apple Silicon: 192 GB) and NVLink systems (H100 NVL: 188 GB). Consumer GPUs restricted to quantized small models.
Gradient explosion (mitigated by $\alpha$ gate), representation drift (EMA updates), and free-rider problem (communication dropout).
Latent injection attacks, covert channels in unused dimensions, and model extraction via communication channel — all require dedicated security research.
Pythonimport torch
from transformers import AutoModelForCausalLM
class LatentBridge:
"""Minimal LatentSync bridge between two models."""
def __init__(self, model_a, model_b, tap_layer=-1, adapter_rank=64):
self.d_model = model_a.config.hidden_size
# Bottleneck adapter Φ
self.adapter = torch.nn.Sequential(
torch.nn.LayerNorm(self.d_model),
torch.nn.Linear(self.d_model, adapter_rank),
torch.nn.GELU(),
torch.nn.Linear(adapter_rank, self.d_model),
).to(model_a.device)
self.gate = torch.nn.Parameter(torch.zeros(1))
# Register forward hook on target layer
self._captured = None
model_a.model.layers[tap_layer].register_forward_hook(
lambda m, i, o: setattr(self, '_captured', o[0].detach())
)
def transform(self, h):
return h + self.gate * self.adapter(h)
def get_latent(self):
return self.transform(self._captured)
| Phase | Timeline | Dataset | Params | Cost |
|---|---|---|---|---|
| 0 — PoC | Weeks 1–2 | Manual | — | $0 |
| 1 — Echo | Weeks 3–4 | 100K | ~2M | ~$3 |
| 2 — Tasks | Weeks 5–8 | 250K | ~10M | ~$16 |
| 3 — Joint | Weeks 9–12 | 500K | ~50M | ~$100 |
| Total: | ~$120 | |||
LatentSync proposes a fundamental shift in multi-agent architectures, moving away from biomimetic text generation toward LLM-native, continuous latent communication. By allowing models to interface directly via dense vector representations and cross-attention bridges, we unlock synchronous, high-bandwidth parallel processing.
The framework is implementable today using standard PyTorch primitives and HuggingFace hooks. While challenges remain, the theoretical bandwidth improvements of 100–1000× per communication step represent a compelling research direction.
We believe the next generation of AI systems will not speak human languages to each other.
They will speak math.
[1] Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.
[2] Das, A., Gerber, T., et al. (2019). TarMAC: Targeted Multi-Agent Communication. ICML 2019.
[3] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.
[4] Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.
[5] Li, X. L., & Liang, P. (2021). Prefix-Tuning. ACL 2021.
[6] van den Oord, A., et al. (2017). Neural Discrete Representation Learning. NeurIPS 2017.
[7] Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. ICLR 2017.
[8] Sukhbaatar, S., Szlam, A., & Fergus, R. (2016). Learning Multiagent Communication with Backpropagation. NeurIPS 2016.
[9] Wortsman, M., et al. (2022). Model Soups. ICML 2022.
[10] Yadav, P., et al. (2023). TIES-Merging. NeurIPS 2023.
[11] Zou, A., et al. (2023). Representation Engineering. arXiv preprint.
| Symbol | Description |
|---|---|
| $\mathcal{M}$ | LLM model |
| $E$ | Embedding layer |
| $T_{1..L}$ | Transformer blocks ($L$ layers) |
| $W_{\text{vocab}}$ | Language modeling head |
| $\mathbf{h}_t$ | Hidden state at position $t$ |
| $d$ | Hidden dimension (e.g. 4096) |
| $\Phi$ | Cross-model adapter |
| $\alpha$ | Gated residual scalar |
| $r$ | Adapter bottleneck rank |
| $\ell$ | Tap layer index |
| $\mathcal{C}$ | VQ-TT codebook |
| $K$ | Codebook size |
| $D$ | Residual quantization stages |
| Phase | Dataset | Params | Time (A100) | Cost |
|---|---|---|---|---|
| 1 — Echo | 100K × 512 tok | ~2M | 1.5h | ~$3 |
| 2 — Tasks | 250K (5 tasks) | ~10M | 8h | ~$16 |
| 3 — Joint | 500K mixed | ~50M | 24h (2× A100) | ~$100 |
| Total to reproduce: | ~$120 | |||