The Dragon Hatchling Learns To Fly: Inside AI’s Next Learning Revolution

A Friendly Guide to the Brain-like Dragon Hatchling (BDH)

Modern neural networks can recognize faces, write stories, and even pass programming interviews — but they all share the same limitation: they stop learning once deployed.

A few weeks ago, a group of engineers and researchers — Adrian Kosowski, Przemysław Uznanski, Jan Chorowski, Zuzanna Stamirowska, and Michał Bartoszkiewicz — published a fascinating paper introducing a new idea in the field of machine learning and neural architectures. In simple terms, they proposed a new type of artificial neural network.

https://arxiv.org/abs/2509.26507?embedable=true

The paper itself is quite dense — filled with math, formulas, and graphs — but full of bold ideas. I wanted to unpack it in a way that’s easier to digest: to make a popular-science overview, with a few metaphors and simplifications of my own.

Imagine a young dragon hatchling that has just broken out of its shell. It already knows how to fly and breathe fire — but it doesn’t yet know how to react to the world around it. It doesn’t learn from books, but from experience — right in the middle of flight — memorizing which actions helped and which didn’t.

That’s the essence of BDH — the Brain-like Dragon Hatchling: a new neural architecture that combines classic pretraining (like in standard networks) with instant, self-directed learning during inference.

A neural network is a system of neurons connected by “weights” that adjust through gradient descent, gradually reducing error — much like a student improving after each test by reviewing mistakes. However, once the test is over, the student no longer learns — the learning happened earlier, before the test.

That’s how today’s models like GPT work: they learn inside the egg — and then stop.

What makes the Dragon Hatchling different?

The BDH is designed a bit smarter. It has two kinds of memory:

Permanent memory, like any normal neural network — this is what it learned before hatching.
Temporary memory, resembling instincts or short-term connections between thoughts.

When BDH processes information, it creates new connections on the fly. If two neurons activate together — the connection between them strengthens.

This is known as the Hebbian learning rule:

“Neurons that fire together, wire together.”

These connections are stored in a separate matrix σ, which acts as a temporary map of what has recently happened. n If a similar situation occurs later, BDH recalls:“Ah, I’ve seen this before — and here’s what worked.”

What changes with BDH?

BDH transforms the learning process itself. It learns while it works, even without running backpropagation. It can adapt to new information on the go, without retraining or heavy GPU computations.

In other words — BDH is a network that learns to live, not just to repeat.

Learning to Stand, Fly, and Breathe Fire

Every living creature has its own learning stages. A dragon hatchling first learns to stand, then to flap its wings, and eventually to breathe fire. The BDH model follows a similar path — each stage of its “life” brings a different kind of learning.

Stage 1: Standing (Classic Pretraining)

This is where BDH learns, like any traditional neural network. It’s trained on data, adjusts weights via gradient descent, and minimizes loss — the familiar supervised learning phase. Think of it as the dragon strengthening its legs before taking the first flight.
At this stage, the model is trained offline on a large dataset — text corpora, translations, and other examples. It uses standard backpropagation, an optimizer like AdamW, and a loss function that predicts the next token.
During this process, BDH develops its permanent weights, referred to as “G” in the paper (the fixed ruleset). These correspond to what, in a transformer, would be parameters like Wq, Wk, Wv, W1, W2, and so on.
Stage 2: Flying (Online Adaptation)

Once training ends, most networks stop changing. But BDH keeps learning in real time. It has a Hebbian memory — a fast-acting connection map that updates itself during inference. If certain neurons activate together, their connection grows stronger; if not, it weakens. This is how BDH adapts to new situations mid-flight, without retraining.

During inference — when BDH reads or generates text — it updates its temporary internal states, denoted as σ(i, j), or “synaptic weights.”

This process isn’t gradient descent. Instead, it follows a local learning rule:

If neuron i and neuron j fire together → strengthen their connection σ(i, j).

This simple rule implements Hebbian learning — often summarized as “neurons that fire together, wire together.”

These updates are short-lived: they exist only while a dialogue or reasoning session is active. Once σ is reset, the model returns to its original “hatched” knowledge — the way it was trained before flight.

Stage 3: Breathing Fire (Self-regulation)

BDH doesn’t just strengthen all connections — it keeps them balanced. The model uses sparsity thresholds and normalization to prevent runaway feedback loops. It learns to “breathe fire” carefully — powerful, but controlled. Too much activation would lead to instability; too little would make it unresponsive. The balance between those extremes is what gives BDH its “life”.

The paper briefly mentions an intriguing idea: if theHebbian updates (σ) are preserved and averaged over time, BDH could develop something resembling long-term memory — a mechanism akin to slowly updating its core weights. However, the authors haven’t yet formalized the exact algorithm for this process.

They suggest that:

Fast memory (σ) operates on short timescales — minutes or a few hundred tokens.
Slow memory (G) evolves over much longer periods — days or across model updates.

This opens the door to lifelong learning— systems that can continuously acquire new knowledge without erasing what they already know. n Unlike classic transformers, which suffer from catastrophic forgetting, BDH hints at a future where models can remember their past while growing into the future.

Why I Believe BDH Is an Evolution, Not Just Another Model

The paper “The Brain-like Dragon Hatchling (BDH)” isn’t just theoretical — it points toward a new direction in AI architecture that offers real, measurable advantages.

(a) shows how many connections enter and leave each neuron — most have only a few, but some act as large “hubs.”

Transparent and Interpretable AI

One of the biggest pain points in modern LLMs is opacity — we rarely know why a model made a particular decision. BDH changes that: its “synapses” directly correspond to conceptual relationships. You can literally see which connections strengthen as the model “thinks” about a given idea. Its activations are sparse and positive (just like in the brain), making it possible to debug and even audit reasoning processes.

➡️ This opens the door for explainable AI in critical domains — medicine, finance, law — where understanding why a model reached its conclusion is as important as the conclusion itself.

On-the-Fly Learning (Inference-Time Learning)

BDH applies Hebbian learning even during inference — meaning the connections between neurons can evolve without retraining. It adapts to the user or context in real time, developing a form of short-term memory that “remembers” ideas across tokens and paragraphs.

➡️ This pushes LLMs closer to lifelong learning — models that keep improving mid-conversation, the way humans do, without any extra fine-tuning.

Stable and Scalable Reasoning Over Time

Transformers struggle with long-range reasoning — once you go beyond their trained context window, coherence collapses. BDH, however, is designed as a scale-free system — its behavior remains stable as reasoning depth and neuron count grow.

➡️ That means we can build agentic systems that run for days or even weeks — planning, researching, or simulating — without losing logical consistency.

Merging Models Without Catastrophic Forgetting

BDH introduces a unique property called model merging: two models can be “fused” simply by connecting their graphs. Unlike transformers, this doesn’t degrade performance or require retraining.

➡️ You can combine models from different domains (say, medical and legal) without fine-tuning. ➡️ This paves the way for modular AI, where reusable “neural plugins” can be connected like software components.

Performance and Efficiency

BDH-GPU works as a state-space system, meaning it can be trained efficiently using PyTorch and GPUs. Its parameter and compute costs grow linearly — not exponentially like in large transformer stacks.

➡️ This enables building powerful models in the 10M–1B parameter range, making BDH accessible to independent researchers and startups alike.

Photo by Stefano Huang on Unsplash

Connection to Neuromorphic Computing

Because BDH is naturally defined in terms of neurons and synapses, it’s a perfect fit for neuromorphic hardware — chips like Loihi or TrueNorth that emulate biological networks directly in silicon.

➡️ This opens possibilities for running large-scale reasoning models on energy-efficient edge devices, robotics platforms, or bio-inspired systems.

A Step Toward “Axiomatic AI”

The authors introduce the idea of Axiomatic AI — systems whose behavior can not only be observed but formally predicted over time. It’s like discovering the “thermodynamics of intelligence”: predictable scaling laws and stable reasoning dynamics.

➡️ This points toward certifiable and safe AI architectures, suitable for use in autonomous, high-stakes environments — from finance and healthcare to transportation.

Building a Simple Neural Network

To really understand how BDH works, I decided to build a tiny proof-of-concept — a minimal “tiny-BDH” in Rust, trained on the classic XOR problem. It uses autograd via tch-rs (a Rust wrapper around libtorch, the C++ core of PyTorch). This little project was inspired by the famous “A Neural Network in 11 Lines of Python”, but my goal wasn’t brevity — it was clarity. I wanted to deeply understand how BDH’s mechanisms could work in practice.

The full source code is available in my GitHub repo ZhukMax/tinybdhxor, prepared specifically for this article. Below, I’ll walk through the implementation step by step. It may look verbose, but that’s intentional — the goal here is maximum transparency and accessibility for anyone curious about BDH internals.

Cargo.toml

Since this example is written in Rust, we start with a Cargo.toml file — the manifest that defines the project and its dependencies.

The key dependency here is tch, a safe Rust wrapper around the libtorch C++ library, which powers PyTorch. It gives us access to tensors, autograd, and other core features of deep learning directly from Rust.

Because BDH uses familiar concepts like neurons and synapses, it makes sense to reuse these existing abstractions rather than re-implement them from scratch. Our goal isn’t to recreate PyTorch — it’s to explore the learning logic behind BDH in the simplest possible form.

Here’s the relevant snippet from Cargo.toml:

[package]
name = "tiny_bdh_xor"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0.100"
tch = { version = "0.22", features = ["download-libtorch"] }

💡 The download-libtorch feature tells Cargo to automatically fetch and link the correct libtorch binaries for your OS and architecture. Without it, you’d need to manually install PyTorch and set the LIBTORCH environment variable. With it, everything “just works” — Cargo downloads and links the library during build.

(Note: the exact version of tch may differ depending on your setup.)

`src/main.rs` — The Core of Our Tiny BDH

In Rust projects, all source files live inside the src directory. Since this is a minimal example, we’ll keep everything in a single file — main.rs. Let’s import the necessary dependencies and set up the entry point:

use anyhow::Result;
use tch::{nn, Device, Kind, Reduction, Tensor};
use tch::nn::{Init, OptimizerConfig};

fn main() -> Result<()> {
    let dev = if tch::Cuda::is_available() { Device::Cuda(0) } else { Device::Cpu };
    Ok(())
}

Choosing the Device (CPU or GPU)

On line 6, we decide where to run the computations — on the GPU or CPU:

tch::Cuda::is_available() checks whether CUDA is installed and detects any NVIDIA GPUs.
If CUDA is available, the code selects the first GPU: Device::Cuda(0).
If CUDA isn’t available (for example, on a Mac or a CPU-only server), it defaults to Device::Cpu.

The variable dev is then passed into other components such as VarStore::new(dev) so that all tensors are created and computed on the same device.

Creating the Training Data

Next, we define the input and output tensors for our tiny XOR neural network — its training set:

let x = Tensor::from_slice(&[
    0f32,0.,1.,  0.,1.,1.,  1.,0.,1.,  1.,1.,1.
]).reshape([4,3]).to_device(dev);

let y = Tensor::from_slice(&[0f32,1.,1.,0.]).reshape([4,1]).to_device(dev);

We start with a flat array of 12 numbers (4 × 3), describing four XOR samples. Each triplet of numbers is one example:

[0, 0, 1]  
[0, 1, 1]  
[1, 0, 1]  
[1, 1, 1]

The first two values are binary inputs (X₁ and X₂), and the third is a constant bias input (always 1), helping the model separate data linearly.

Then .reshape([4,3]) converts this flat array into a 4×3 matrix — four samples, each with three input features. Finally, .to_device(dev) moves the tensor to the selected device (GPU or CPU), ensuring all computations happen in one place.

The second tensor, y, contains the expected outputs for each input:

[0], [1], [1], [0]

These correspond to the XOR truth table:

| X₁ | X₂ | Y |
|—-|—-|—-|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

Network Hyperparameters

let n: i64 = 64;
let d: i64 = 16;
let u: f64 = 0.20;
let hebb_lr: f64 = 0.01;
let smax: f64 = 1.0;
let sparsity_thresh: f64 = 5e-3;
let lr: f64 = 5e-3;
let steps = 3000;

n = 64 — the size of the neural field (number of neurons in the layer).
d = 16 — the low-rank dimension for matrices E and D, defining how much the data is compressed and expanded.
u = 0.20 — the forgetting rate for the fast memory σ; higher values make it “forget” faster.
hebb_lr = 0.01 — the learning rate for Hebbian updates — controls how strongly new activations modify σ.

Hebbian Memory: In BDH, memory is represented by a special connection matrix σ (sigma) — a temporary synaptic memory. It doesn’t store the model’s learned weights (those are handled by gradient descent). Instead, it remembers which neurons were active together, forming short-term associations — a kind of “working memory” active during inference.

Continuing:

smax = 1.0 — limits the maximum connection strength in σ, preventing runaway values.
sparsity_thresh = 5e-3 — zeroes out very small σ elements, keeping the memory sparse and stable.
lr = 5e-3 — learning rate for the Adam optimizer that updates regular model parameters (E, D, R_in, W_read).
steps = 3000 — number of training iterations (how many times the model sees the data).

Initializing Parameters and the “Neural Field”

After defining our hyperparameters, we create a parameter store — a container that holds all trainable weights and biases of the network. Then we add the model’s learnable parameters — its “weights,” which will be updated during training:

let vs = nn::VarStore::new(dev);
let root = &vs.root();

let e  = root.var("E",  &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 });
let dx = root.var("Dx", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 });
let dy = root.var("Dy", &[n,d], Init::Randn { mean: 0.0, stdev: 0.05 });

let r_in   = root.var("R_in",   &[3,n], Init::Randn { mean: 0.0, stdev: 0.20 });
let w_read = root.var("W_read", &[n,1], Init::Randn { mean: 0.0, stdev: 0.20 });

Each variable defines part of the BDH model:

r_in — the input projection into the neural field.
E, Dx, Dy — the internal transformations, analogous to the weights of a hidden layer. But remember: BDH doesn’t have layers in the usual sense — it’s more like a single self-connected field of neurons.
w_read — the output projection, used to read the network’s final activations.

The Optimizer and Fast Memory

Next, we initialize the Adam optimizer, a popular variant of gradient descent that automatically tunes learning rates per parameter. We also create a tensor σ — a square [n × n] matrix filled with zeros. This represents BDH’s fast Hebbian memory, which stores temporary connections between neurons and is updated at every training step.

let mut opt = nn::Adam::default().build(&vs, lr)?;
let mut sigma = Tensor::zeros(&[n, n], (Kind::Float, dev));

for step in 0..steps {
    ...
}

Inside this training loop, we’ll add the code that teaches our “Dragon Hatchling” while it’s still in its egg — that is, during offline pretraining.

Forward Pass — The Dragon’s First Flight

The next code block performs the forward pass, the main computation step where inputs are transformed into outputs (logits):

let x_neu = x.matmul(&r_in);
let y1 = relu_lowrank_forward(&x_neu, &e, &dx);
let a  = x_neu.matmul(&sigma.transpose(-1, -2));
let y2 = y1 + a;
let z  = relu_lowrank_forward(&y2, &e, &dy);
let logits = z.matmul(&w_read);

Here’s what happens step by step:

x_neu = x.matmul(&r_in) — the input data enters the neural field.
y1 = relu_lowrank_forward(...) — the data is compressed, expanded, and passed through a ReLU activation. (We’ll define this helper function next.)
a = x_neu.matmul(&sigma.T) — retrieves the additional signal from Hebbian memory σ, based on temporary neuron associations.
y2 = y1 + a — merges the “current” signal with short-term memory — this is the core idea of BDH.
z and logits — the final processing and output projection, combining both short-term and long-term knowledge of the model.

The output logits aren’t yet passed through a sigmoid; they represent the raw predictions before activation — the dragon’s unrefined thoughts before taking shape.

Low-Rank + ReLU Helper

As promised, here’s the ReLU helper we use in the forward pass:

/// y = ReLU( (x E) D^T )
fn relu_lowrank_forward(x: &Tensor, e: &Tensor, d: &Tensor) -> Tensor {
    let h = x.matmul(e);                       // [B,n]·[n,d] = [B,d]
    h.matmul(&d.transpose(-1, -2)).relu()      // [B,d]·[d,n] = [B,n]
}

This is a low-rank linear layer with ReLU. Instead of a big dense matrix W ∈ R^{n×n} , we factor it as W ≈ E · Dᵀ with E ∈ R^{n×d} , D ∈ R^{n×d}, d ≪ n .

The idea is straightforward: you don’t need all possible synapses. Project into a compact latent space of size d, then project back. For tiny demos like XOR this is mostly illustrative; for GPT-scale models the memory savings can be massive (terabytes at scale).

Line 3 compresses the high-dimensional “neural field” (n features) into a latent space of size d.
The next line expands it back to n as a linear combination of decoder patterns from D. Together this acts like a single multiplication by W ≈ E · Dᵀ, but uses 2nd parameters instead of (n^2).

Loss, Backprop, Step

Now let’s add the standard training step — compute the loss, run backprop, update weights:

let loss = logits
  .binary_cross_entropy_with_logits::<Tensor>(&y, None, None, Reduction::Mean);
opt.zero_grad();
loss.backward();
opt.step();

These four lines are the heart of the training loop: measure error, compute how to fix the model, and apply the update. After each iteration, the network moves a little closer to the correct solution.

Hebbian Fast Memory Update (σ)

The last part — and really the core BDH twist — is the Hebbian fast-memory update. It runs outside autograd and keeps values stable:

tch::no_grad(|| {
    let bsz = x.size()[0] as f64;

    // 1) Build co-activation map: outer = y2ᵀ @ x_neu
    let outer = y2
        .detach()                  // detach from autograd
        .transpose(-1, -2)         // [B,n]ᵀ → [n,B]
        .matmul(&x_neu.detach())   // [n,B] @ [B,n] → [n,n]
        .to_kind(Kind::Float)
        * (hebb_lr / bsz);         // scale by batch size and Hebb LR

    // 2) Work on a shallow copy to avoid move/borrow issues
    let zeros = Tensor::zeros_like(&sigma);
    let mut s = sigma.shallow_clone();

    // 3) Exponential forgetting + add fresh co-activations
    s *= 1.0 - u;                  // older σ fades out
    s += &outer;                   // Hebbian boost for co-firing neurons

    // 4) Safety rails: clamp to prevent blow-ups
    // (I originally skipped this and hit runtime errors during training)
    s = s.clamp(-smax, smax);

    // 5) Sparsify: zero-out tiny values (efficiency + stability)
    let keep = s.abs().ge(sparsity_thresh);
    s = s.where_self(&keep, &zeros);

    // 6) Row-wise normalization: stabilize the energy of σ @ x
    let row_norm = s.square().sum_dim_intlist([1].as_ref(), true, Kind::Float).sqrt();
    s = &s / &row_norm.clamp_min(1.0);

    // 7) Write back into σ without changing ownership
    sigma.copy_(&s);
});

Think of this as BDH’s working memory: it quickly adapts to the current context (Hebbian), gradually forgets old patterns (u), stays compact (sparsity), and remains numerically stable (clamp + normalization).

What We’ve Built

We’ve implemented a network with the two learning modes described in the paper:

Slow learning — classic backprop that shapes the permanent weights (E, D, R_in, W_read).
Fast learning — Hebbian updates of the σ matrix during inference/training.

We intentionally leave out the third piece — transferring fast memory into long-term weights — because, as the authors note, it’s not fully specified yet. Designing that mechanism is nontrivial and beyond the scope of this overview; even the research paper only sketches this direction at a high level.

How to Run It

# 1) Create the project and add the files
cargo new tiny_bdh_xor && cd tiny_bdh_xor
# (replace Cargo.toml and src/main.rs with the code above)

# 2) Build & run
cargo run --release

As expected, after a couple thousand steps the network converges (loss ↓, acc → 1.0) and predicts XOR correctly.

Logging to the Console

To make the training dynamics and results easy to inspect, let’s add some lightweight logging.

1) Progress every 300 steps

Print loss and accuracy during training:

if step % 300 == 0 {
    let y_hat = logits.sigmoid();
    let acc = y_hat.gt(0.5)
        .eq_tensor(&y.gt(0.5))
        .to_kind(Kind::Float)
        .mean(Kind::Float)
        .double_value(&[]);
    println!("step {:4}  loss {:.4}  acc {:.2}", step, loss.double_value(&[]), acc);
}

2) Final predictions

After training, dump the model’s predictions:

let x_neu = x.matmul(&r_in);
let y1 = relu_lowrank_forward(&x_neu, &e, &dx);
let a  = x_neu.matmul(&sigma.transpose(-1, -2));
let y2 = y1 + a;
let z  = relu_lowrank_forward(&y2, &e, &dy);
let preds = z.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64);
println!("nPred:n{:?}", preds);

3) With vs. without fast memory (σ)

Compare predictions when the Hebbian memory is on vs off:

// σ = on
let probs = z.matmul(&w_read).sigmoid();
println!("nProbs (σ=on):");
probs.print();
println!("Preds (σ=on):");
preds.print();

// σ = off
let y1_nos = relu_lowrank_forward(&x_neu, &e, &dx);
let y2_nos = y1_nos; // no 'a' term from σ
let z_nos  = relu_lowrank_forward(&y2_nos, &e, &dy);
let preds_nos = z_nos.matmul(&w_read).sigmoid().gt(0.5).to_kind(Kind::Int64);
println!("nPreds (σ=off):");
preds_nos.print();

:::tip
For a full working code, see the repository: https://github.com/ZhukMax/tiny_bdh_xor

:::

Build, Training, and Prediction Results

The model converges quickly, and you can see that:

Probs (σ = on) are almost perfect: [~0, 1, 1, ~0].
Preds (σ = off) match — which is expected for XOR: it’s a static task solvable by the “slow” weights without fast memory.

Running `target/debug/tiny_bdh_xor`
step    0  loss 0.6931  acc 0.50
step  300  loss 0.0000  acc 1.00
step  600  loss 0.0000  acc 1.00
step  900  loss 0.0000  acc 1.00
step 1200  loss 0.0000  acc 1.00
step 1500  loss 0.0000  acc 1.00
step 1800  loss 0.0000  acc 1.00
step 2100  loss 0.0000  acc 1.00
step 2400  loss 0.0000  acc 1.00
step 2700  loss 0.0000  acc 1.00

Pred:
Tensor[[4, 1], Int64]

Probs (σ=on):
 7.4008e-09
 1.0000e+00
 1.0000e+00
 6.6654e-17
[ CPUFloatType{4,1} ]
Preds (σ=on):
 0
 1
 1
 0
[ CPULongType{4,1} ]

Preds (σ=off):
 0
 1
 1
 0
[ CPULongType{4,1} ]

Why σ Isn’t “Needed” for XOR

XOR is a simple Boolean function that the network can learn with its slow parameters (E/Dx/Dy/R_in/W_read). The Hebbian layer σ shines when there’s context over time — sequences, associations, “what happened earlier” — not when each sample is independent.

What to Try Next to See σ Pay Off

Sequences (context memory): Predict the final symbol of a pair that appeared earlier in the same sequence (copy / associative recall).
Long-range dependencies: Balanced-parentheses tasks — check pairing correctness across 20–100 steps.
On-the-fly adaptation: During inference, “inject a new rule” (a token pair) and verify the model uses it without gradient updates.
σ ablations: Compare convergence speed/quality with σ on/off on harder prediction tasks. Log nnz(σ) and watch how connections strengthen/decay over time.

The AI Incubator Is Near (Conclusions)

BDH isn’t just “another alternative to transformers.” It’s a glimpse into the next era of neural architectures — ones that learn not on schedule, but in the moment of action. Instead of waiting for retraining or requiring terabytes of data, BDH adjusts itself during reasoning, in real time.

If transformers are like “students” who completed a course and earned their diploma, then BDH is a dragon hatchling — freshly born, exploring the world, making mistakes, adapting, and remembering everything new it encounters.

This direction brings AI back to its original spirit: not just to compute probabilities, but to think within context and experience.