The Continuous Thought Machine, introduced by Sakana AI (arXiv:2505.05522), augments neural networks with an internal thinking dimension. Each neuron maintains its own temporal history, and synchronization patterns between neurons form the basis for output generation. We extend this architecture as a drop-in replacement for the feed-forward layers in existing pretrained transformers.
Every memory innovation in feed-forward transformers operates within the same constraint: the context window. RAG retrieves text and pastes it into the prompt. Summarization compresses conversation history into shorter text. MemGPT pages information in and out of context. Vector databases store embeddings that get injected as tokens. These are all variations of the same approach: compress information into text, lose most of it, and hope the relevant fragments survive.
This is inherently lossy. A 200-page document summarized to fit a context window loses the relationships between paragraphs, the specific phrasing that carried nuance, the structure that gave the content meaning. A conversation history compacted to stay within token limits forgets what was important three exchanges ago. The model never actually learns anything. It processes whatever text happens to be in front of it, and then it forgets.
The dream is a model that remembers. Not by stuffing more text into a window, but by changing what it knows. Current paradigm models will never deliver this. Their weights are frozen. The context window is all they have. No amount of prompt engineering, retrieval augmentation, or compression tricks changes this fundamental limitation.
Neuroplasticity requires a different substrate. The CTM layer provides one. Its internal dynamics, synchronization traces across thinking iterations, and dopamine-gated weight updates give the model a mechanism to write experience into its parameters. compact_memory() is not summarization. It is synaptic consolidation. The information does not get compressed into text. It becomes part of the model.
We train our own models from scratch. Open-source transformer models like Qwen, Llama, and Mistral provide the attention weights — the expensive matmul workload that takes millions of GPU-hours to converge. We distill this into a frozen backbone and train only the CTM block on top: the part that thinks.
This is not fine-tuning. The backbone's attention layers bootstrap language understanding. The CTM replaces the final FFN layer entirely, adding iterative reasoning, self-observation, and neuroplastic weight modification. Approximately 10% of total parameters are trainable. The rest is borrowed intelligence.
Following the original paper's design, a single CTM block replaces only the final transformer layer's FFN. Layers 0 through N-1 perform standard language processing. The CTM provides the thinking dimension on top.
At each thinking iteration k, the CTM re-attends to the original input representation. This enables the model to extract progressively more nuanced features from the same input as its internal state evolves.
Per-token certainty scores determine when the model has converged on an answer. The loss function is: L = 0.5 * L_argmin + 0.5 * L_argmax_certainty. This trains the model to both find good answers and know when it has found them.
Each token passes through K=32 iterations of recurrent processing. At each tick, the model re-observes the input through cross-attention, updates its neuron state through a U-Net synapse model, and accumulates synchronization traces between neuron pairs. Here is the actual data flow.
U-Net architecture that processes the concatenation of the current observation and neuron state. The down path compresses from input dimension through progressively smaller layers to a 16-dimensional bottleneck. The up path expands back with skip connections from each down layer, preserving fine-grained features. SiLU activations, LayerNorm after each up layer.
Input: cat([obs, state + tick_emb]) → Output: δstate. The tick embedding is a learnable per-iteration vector that tells the model which thinking step it is on.
N-independent parallel linear transforms where each neuron has its own weight matrix. Input shape (B, N, in_dims) — each of N neurons extracts features independently via einsum('BNM,MON→BNO'). Two SuperLinear layers with GLU gating form the nonlinear memory pathway that processes the rolling trace of past neuron states.
The trace is a (B, D, M) rolling window of the last M states. NLMs detect temporal patterns across thinking iterations — how neuron activations evolve over ticks.
Two independent sync accumulators track pairwise neuron co-activations:
S_out (n_synch=512 channels): drives the output projection.
Encodes which neurons co-fired during successful predictions. After K ticks,
c_proj(α_out / √β_out) produces the
residual stream contribution.
S_action: generates the cross-attention query that re-observes
the input at each tick. The query encodes "what neuron pattern is currently
active" — the model attends to different input features each tick based
on its evolving internal state.
Decay: r = exp(-decay), clamped to [0, 15].
Creates exponential memory fade across ticks.
Per-layer inference state with a critical split between what resets and what persists across tokens:
Resets per token: state (B, D), trace (B, D, M) — back to
start_state to match training distribution.
Persists across tokens: α_out, β_out, α_act, β_act
(all B, n_synch) — sync accumulators carry session memory, bounded by exponential decay.
Cache-aware training (30% of steps) explicitly trains the model to handle
inherited sync state. Without it, accumulated sync from previous tokens is
out-of-distribution and poisons generation.
The CTM produces K=32 outputs per token, one at each thinking iteration. Training selects two ticks per token and combines their losses:
Argmin alone would train the model to converge early and skip later thinking. Argmax-certainty alone would train confidence without accuracy. Together they teach the model both to find good answers and to know when it has found them. Easy tokens converge at tick 0 (reflexive). Hard tokens use late ticks (deep thinking).
Standard language models are frozen after training. Every interaction is stateless. We implement online learning through a three-factor rule inspired by dopamine-modulated synaptic plasticity in biological neural circuits. The model rewrites its own synapse weights from what it experiences during inference.
After each generated token, we compute how surprised the model was:
Dopamine ≈ 1.0 means the token was unexpected — remember it harder. Dopamine ≈ 0.5 means it was predictable — dampen accumulation. The [0.5, 1.0] clamp prevents positive feedback: garbage output → high surprise → amplified accumulation → more garbage. Dampening is allowed, amplification is not.
During each CTM tick, pairwise neuron co-activations are accumulated into the sync trace, scaled by dopamine. This is the three-factor rule: presynaptic × postsynaptic × neuromodulator.
The sync accumulator is the memory substrate. It persists across tokens in the CTMCache, bounded by exponential decay. Surprising tokens (high dopamine) leave stronger traces. Predictable tokens barely register. After a conversation, the accumulated sync encodes which neuron patterns co-activated during important moments.
After inference, compact_memory() writes the accumulated sync patterns into the CTM's weight matrices. This is where ephemeral experience becomes permanent knowledge.
The baseline subtraction isolates what the model learned from this session versus what it already knew. Novelty gating prevents redundant updates. Homeostatic regulation (analogous to TNF-alpha synaptic scaling during biological sleep) prevents runaway weight growth — direction of changes preserved, magnitude stabilized.
milliseconds — per tick
Neuron state and trace within a single token's K iterations. Resets between tokens. This is the "thinking" — 32 iterations of recurrent processing where the model refines its answer.
minutes — per session
Sync accumulators (α, β) in CTMCache persist across tokens within a session. Bounded by exponential decay. This is "what happened in this conversation" — the substrate that compact_memory() reads from.
permanent — in weights
Synapse weights (SynapseUNET, c_proj) modified by compact_memory(). Survives context reset, model restart, serialization. This is "what the model knows" — experience written into the network itself.
There are two different things happening and the distinction matters.
An agent that improves training code is editing the environment: better data pipelines, smarter loss functions, faster infrastructure. The training code is outside the model. Neuroplasticity is inside. compact_memory() does not change the training script. It changes the neuron weights themselves. The model is not improving the process that produces models. It is rewriting its own synapses from what it has experienced. One is engineering. The other is learning. Both matter, but only one of them is actually new.
The end state of a recursive learning loop is a model that has reinvented its own internal representations. The backbone provides language knowledge. But the CTM layer's weights, its synchronization patterns, its allocation of thinking iterations — these are all subject to self-directed modification. The model that emerges from a thousand cycles of self-improvement may bear little resemblance to the model that started.
Frozen models cannot do this. They can generate text about self-improvement. They cannot change what they are. A neuroplastic model can.
Qwen3-0.6B
620M frozen parameters providing language knowledge. Attention layers, embeddings, and layer norms preserved from pretraining.
68M trainable
Single CTMBlock at layer 28 (final layer). SynapseUNET, SuperLinear NLMs, cross-attention, sync heads, certainty head.
1x RTX 5090 32GB
Training at ~3.6s/step, ~2.2k tok/sec. Batch size 2, sequence length 2048, gradient accumulation 2. 16.1GB VRAM at K=32.