Technical overview

CTM as a controlling layer for transformer architectures

The Continuous Thought Machine, introduced by Sakana AI (arXiv:2505.05522), augments neural networks with an internal thinking dimension. Each neuron maintains its own temporal history, and synchronization patterns between neurons form the basis for output generation. We extend this architecture as a drop-in replacement for the feed-forward layers in existing pretrained transformers.

The problem

Why feed-forward networks are a bottleneck

Standard FFN transformer

  • Single forward pass per token, fixed compute budget
  • No mechanism for iterative refinement
  • Scaling requires more parameters (cost grows quadratically)
  • Knowledge stored implicitly in weight matrices
  • No post-deployment learning: weights frozen after training
  • Identical processing for trivial and complex tokens

CTM-augmented transformer

  • K thinking iterations per token (default K=32)
  • Internal recurrence enables iterative refinement
  • Vertical scaling: deeper thinking, not wider models
  • Neuron history provides explicit temporal memory
  • Neuroplastic weight updates from conversation context
  • Adaptive compute: certainty-based early stopping

The deeper problem

Every "memory" in current LLMs is a context window hack

Every memory innovation in feed-forward transformers operates within the same constraint: the context window. RAG retrieves text and pastes it into the prompt. Summarization compresses conversation history into shorter text. MemGPT pages information in and out of context. Vector databases store embeddings that get injected as tokens. These are all variations of the same approach: compress information into text, lose most of it, and hope the relevant fragments survive.

This is inherently lossy. A 200-page document summarized to fit a context window loses the relationships between paragraphs, the specific phrasing that carried nuance, the structure that gave the content meaning. A conversation history compacted to stay within token limits forgets what was important three exchanges ago. The model never actually learns anything. It processes whatever text happens to be in front of it, and then it forgets.

The dream is a model that remembers. Not by stuffing more text into a window, but by changing what it knows. Current paradigm models will never deliver this. Their weights are frozen. The context window is all they have. No amount of prompt engineering, retrieval augmentation, or compression tricks changes this fundamental limitation.

Neuroplasticity requires a different substrate. The CTM layer provides one. Its internal dynamics, synchronization traces across thinking iterations, and dopamine-gated weight updates give the model a mechanism to write experience into its parameters. compact_memory() is not summarization. It is synaptic consolidation. The information does not get compressed into text. It becomes part of the model.

Architecture

Backbone-agnostic integration

We train our own models from scratch. Open-source transformer models like Qwen, Llama, and Mistral provide the attention weights — the expensive matmul workload that takes millions of GPU-hours to converge. We distill this into a frozen backbone and train only the CTM block on top: the part that thinks.

This is not fine-tuning. The backbone's attention layers bootstrap language understanding. The CTM replaces the final FFN layer entirely, adding iterative reasoning, self-observation, and neuroplastic weight modification. Approximately 10% of total parameters are trainable. The rest is borrowed intelligence.

Layer 0: [Attention] -> [FFN] (distilled, frozen) Layer 1: [Attention] -> [FFN] (distilled, frozen) ... Layer N-1:[Attention] -> [FFN] (distilled, frozen) Layer N: [Attention] -> [CTMBlock] (trained from scratch) | K iterations of: - Cross-attention (data re-observation) - SynapseUNET (nonlinear memory) - SuperLinear NLMs - Dual sync: S_out + S_action | Certainty-gated output
Standing on the shoulders of open source. Training attention from scratch would cost millions. Instead, we take the best open-weight models the community has produced and use their matmul layers as a foundation. The CTM learns to think on top of representations that already understand language. The open-weight ecosystem is not merely convenient — it is a prerequisite for this class of architecture.

Single CTM, not per-layer

Following the original paper's design, a single CTM block replaces only the final transformer layer's FFN. Layers 0 through N-1 perform standard language processing. The CTM provides the thinking dimension on top.

Cross-attention re-observation

At each thinking iteration k, the CTM re-attends to the original input representation. This enables the model to extract progressively more nuanced features from the same input as its internal state evolves.

Certainty-based output

Per-token certainty scores determine when the model has converged on an answer. The loss function is: L = 0.5 * L_argmin + 0.5 * L_argmax_certainty. This trains the model to both find good answers and know when it has found them.

CTM components

Inside the thinking iteration

Each token passes through K=32 iterations of recurrent processing. At each tick, the model re-observes the input through cross-attention, updates its neuron state through a U-Net synapse model, and accumulates synchronization traces between neuron pairs. Here is the actual data flow.

Per tick k (0 to K-1): 1. synch_act = α_act / √β_act ← normalized action sync 2. attn_q = project(synch_act) ← cross-attention query from sync 3. obs = flash_attn(attn_q, key, value) ← re-observe input (causal) 4. δstate = SynapseUNET(cat[obs, state + tick_emb]) 5. state = state + δstate ← residual update 6. trace = rolling_window(trace, state) ← history of M past states 7. h = GLU(SuperLinear_1(trace)) ← NLM pathway 1 8. state = GLU(SuperLinear_2(h)) ← NLM pathway 2 9. pp = state[left] * state[right] ← pairwise neuron product 10. α_out = r · α_out + pp · dopamine ← three-factor accumulation 11. β_out = r · β_out + dopamine After K ticks: output = c_proj(α_out / √β_out) ← sync readout → residual stream

SynapseUNET

U-Net architecture that processes the concatenation of the current observation and neuron state. The down path compresses from input dimension through progressively smaller layers to a 16-dimensional bottleneck. The up path expands back with skip connections from each down layer, preserving fine-grained features. SiLU activations, LayerNorm after each up layer.

Input: cat([obs, state + tick_emb]) → Output: δstate. The tick embedding is a learnable per-iteration vector that tells the model which thinking step it is on.

SuperLinear NLMs

N-independent parallel linear transforms where each neuron has its own weight matrix. Input shape (B, N, in_dims) — each of N neurons extracts features independently via einsum('BNM,MON→BNO'). Two SuperLinear layers with GLU gating form the nonlinear memory pathway that processes the rolling trace of past neuron states.

The trace is a (B, D, M) rolling window of the last M states. NLMs detect temporal patterns across thinking iterations — how neuron activations evolve over ticks.

Dual synchronization

Two independent sync accumulators track pairwise neuron co-activations:

S_out (n_synch=512 channels): drives the output projection. Encodes which neurons co-fired during successful predictions. After K ticks, c_proj(α_out / √β_out) produces the residual stream contribution.

S_action: generates the cross-attention query that re-observes the input at each tick. The query encodes "what neuron pattern is currently active" — the model attends to different input features each tick based on its evolving internal state.

Decay: r = exp(-decay), clamped to [0, 15]. Creates exponential memory fade across ticks.

CTMCache

Per-layer inference state with a critical split between what resets and what persists across tokens:

Resets per token: state (B, D), trace (B, D, M) — back to start_state to match training distribution.

Persists across tokens: α_out, β_out, α_act, β_act (all B, n_synch) — sync accumulators carry session memory, bounded by exponential decay.

Cache-aware training (30% of steps) explicitly trains the model to handle inherited sync state. Without it, accumulated sync from previous tokens is out-of-distribution and poisons generation.

Training

Multi-tick loss: how we select which tick to learn from

The CTM produces K=32 outputs per token, one at each thinking iteration. Training selects two ticks per token and combines their losses:

For each token t, across K ticks: loss_k = cross_entropy(logits_k, target) for k = 0..K-1 certainty_k = 1 - entropy(softmax(logits_k)) / log(V) best_tick = argmin_k(loss_k) ← which tick got closest to the answer cert_tick = argmax_k(certainty_k) ← which tick was most confident L = 0.5 · loss[best_tick] + 0.5 · loss[cert_tick]

Argmin alone would train the model to converge early and skip later thinking. Argmax-certainty alone would train confidence without accuracy. Together they teach the model both to find good answers and to know when it has found them. Easy tokens converge at tick 0 (reflexive). Hard tokens use late ticks (deep thinking).

Neuroplasticity

Three-factor learning: ΔW ∝ pre × post × dopamine

Standard language models are frozen after training. Every interaction is stateless. We implement online learning through a three-factor rule inspired by dopamine-modulated synaptic plasticity in biological neural circuits. The model rewrites its own synapse weights from what it experiences during inference.

1. Wake: dopamine from prediction surprise

After each generated token, we compute how surprised the model was:

surprise = -log(p(sampled_token)) ema = 0.9 · ema + 0.1 · surprise ← running average raw = tanh(surprise - ema) ∈ [-1, 1] dopamine = 0.5 + 0.5 · (raw + 1) / 2 ∈ [0.5, 1.0]

Dopamine ≈ 1.0 means the token was unexpected — remember it harder. Dopamine ≈ 0.5 means it was predictable — dampen accumulation. The [0.5, 1.0] clamp prevents positive feedback: garbage output → high surprise → amplified accumulation → more garbage. Dampening is allowed, amplification is not.

arXiv:2512.07194 (DA-SSDP)

2. Encode: dopamine-gated sync accumulation

During each CTM tick, pairwise neuron co-activations are accumulated into the sync trace, scaled by dopamine. This is the three-factor rule: presynaptic × postsynaptic × neuromodulator.

left = state[synch_left_indices] ← presynaptic (512 channels) right = state[synch_right_indices] ← postsynaptic pp = left * right * dopamine ← three-factor product α = r · α + pp ← numerator (exponential moving avg) β = r · β + dopamine ← denominator (normalization) sync = α / √β ← readout

The sync accumulator is the memory substrate. It persists across tokens in the CTMCache, bounded by exponential decay. Surprising tokens (high dopamine) leave stronger traces. Predictable tokens barely register. After a conversation, the accumulated sync encodes which neuron patterns co-activated during important moments.

arXiv:2505.18069 (Hebbian = weight decay)

3. Sleep: compact_memory() — sync → permanent weight changes

After inference, compact_memory() writes the accumulated sync patterns into the CTM's weight matrices. This is where ephemeral experience becomes permanent knowledge.

Step 1: What did the model learn? synch_accumulated = α_out / √β_out ← session pattern synch_baseline = compute_baseline(start_state, K, r) ← what zero-input produces sync_delta = synch_accumulated - synch_baseline ← pure learned signal Step 2: Novelty gating (don't overwrite existing knowledge) novelty = abs(sync_delta) threshold = median(novelty) gate = (novelty > threshold).float() ← top 50% most novel channels Step 3: Hebbian weight update c_proj.weight += lr · gated_delta · state_scale ← amplify novel sync channels last_up.weight += lr · state_delta · input_approx ← rank-1 synapse update Step 4: Homeostatic regulation for w in [c_proj, last_up]: if w.norm() > base_norm · 1.01: w *= base_norm · 1.01 / w.norm() ← synaptic scaling

The baseline subtraction isolates what the model learned from this session versus what it already knew. Novelty gating prevents redundant updates. Homeostatic regulation (analogous to TNF-alpha synaptic scaling during biological sleep) prevents runaway weight growth — direction of changes preserved, magnitude stabilized.

Gerstner 2018 (three-factor)

Verified: memory survives context reset. On a Qwen2.5-0.5B + CTM K=32 model at 1000 training steps: the model was taught facts ("my name is Tommi", "I live in Helsinki"), compact_memory() was called, the conversation was restarted from a blank context, and the model recalled the encoded information. The weight changes are permanent — they persist across sessions, reboots, and context resets. This is not retrieval from a context window. It is synaptic consolidation.

Memory architecture

Three timescales of memory

Working memory

milliseconds — per tick

Neuron state and trace within a single token's K iterations. Resets between tokens. This is the "thinking" — 32 iterations of recurrent processing where the model refines its answer.

Episodic memory

minutes — per session

Sync accumulators (α, β) in CTMCache persist across tokens within a session. Bounded by exponential decay. This is "what happened in this conversation" — the substrate that compact_memory() reads from.

Semantic memory

permanent — in weights

Synapse weights (SynapseUNET, c_proj) modified by compact_memory(). Survives context reset, model restart, serialization. This is "what the model knows" — experience written into the network itself.

Recursive self-improvement

Engineering vs learning

There are two different things happening and the distinction matters.

An agent that improves training code is editing the environment: better data pipelines, smarter loss functions, faster infrastructure. The training code is outside the model. Neuroplasticity is inside. compact_memory() does not change the training script. It changes the neuron weights themselves. The model is not improving the process that produces models. It is rewriting its own synapses from what it has experienced. One is engineering. The other is learning. Both matter, but only one of them is actually new.

The end state of a recursive learning loop is a model that has reinvented its own internal representations. The backbone provides language knowledge. But the CTM layer's weights, its synchronization patterns, its allocation of thinking iterations — these are all subject to self-directed modification. The model that emerges from a thousand cycles of self-improvement may bear little resemblance to the model that started.

Frozen models cannot do this. They can generate text about self-improvement. They cannot change what they are. A neuroplastic model can.

A note on control. A model that modifies its own weights in response to its own evaluation of its own performance is no longer fully characterized by its initial training. The boundaries of its behavior shift with each cycle. This is where alignment guarantees from pretraining begin to erode. We do not consider this a reason to stop. We consider it a reason to be the ones doing it carefully, with full observability into the thinking process at every iteration. The time-travel debugger exists because if you cannot see what the model is doing at every tick of every thinking step, you should not be changing its weights.

Current results

Training status

Backbone

Qwen3-0.6B

620M frozen parameters providing language knowledge. Attention layers, embeddings, and layer norms preserved from pretraining.

CTM parameters

68M trainable

Single CTMBlock at layer 28 (final layer). SynapseUNET, SuperLinear NLMs, cross-attention, sync heads, certainty head.

Hardware

1x RTX 5090 32GB

Training at ~3.6s/step, ~2.2k tok/sec. Batch size 2, sequence length 2048, gradient accumulation 2. 16.1GB VRAM at K=32.

Training is ongoing. The time-travel debugger on the home page shows real training dynamics. Current run: 50,000 steps on Karpathy's climbmix-400b. The 3D visualization reveals how the model learns to allocate its 32 thinking iterations across training, with distinct phase transitions as different ticks specialize.

References

Papers and code

arXiv:2505.05522 -- CTM (Sakana AI) arXiv:2505.18069 -- Hebbian weight decay arXiv:2512.07194 -- DA-SSDP Gerstner 2018 -- Three-factor plasticity Sakana AI -- CTM project page nanoctm -- our implementation (GitHub)