Stories by Jay Prajapati on Medium

DeepSeek mHC Paper Breakdown: Architecture, Math, and Kernel Fusion

Jay Prajapati — Mon, 19 Jan 2026 13:07:19 GMT

Scaling the information capacity of LLM (Large Language Models) has traditionally demanded a heavy price: if you want a smarter model, you must pay for more parameters and higher latency. Standard Residual architectures enforce a rigid one-lane flow of information, meaning any increase in width directly spikes computational cost.

The mHC advantage: Turning the standard Residual bottleneck into a stable 4-lane superhighway. Generated By Gemini Nano Banana

mHC (Manifold-Constrained Hyper-Connections), a novel architecture that breaks this correlation. By introducing Manifold Constraints to Hyper-Connections, DeepSeek has created a system that allows parallel information streams to coexist stably within a single layer. Unlike traditional methods that suffer from gradient explosion at depth, mHC mathematically guarantees signal stability, allowing us to quadruple the model’s bandwidth while keeping the inference cost virtually identical to standard models.

In this blog we will explore the technicalities behind mHC covering the topics:

Architecture of mHC:
The Manifold Constraint:
Using the Birkhoff Polytope to enforce the Identity Property.
Applying the Sinkhorn-Knopp Algorithm to normalise mixing matrices.
Implementation & Speed: How it circumvents the Memory Wall using Kernel Fusion and TileLang to fuse operations directly in SRAM.
The Free Lunch Paradigm: Achieving 4x information capacity without increasing the compute budget.

Residual vs. Hyper-Connections vs. Manifold-Constrained HC. Adapted from the DeepSeek Paper.

Intuition Behind mHC: How it Expands Intelligence Without the Cost

Step 1: Treating the Residual Stream as a Multi-Lane Highway

Before Deep Diving into mHC Architecture. let’s walk through the intuition behind how it works.

Imagine a standard Model (e.g. Llama, GPT, ResNet) as a single highway. Every type of information — grammar, facts, logic, tone etc. — must travel in this one narrow lane (the residual vector x).

mHC simplifies this by expanding that single lane into 4-lane superhighway. Instead of forcing all information to compete for space in a single vector, we split the input into independent multiple branches (subspaces).

Think of it as allocating specific lanes for specific traffic: Lane 1 for Syntax, Lane 2 for Logic, Lane 3 for Facts etc. They travel parallelly rather than jamming into one queue.

Why 4 Lane Not More?

You might wonder, if 4 lanes make the model smarter, why not build 8, 16, or 100 lanes? The answer lies in the Memory Wall. While wider highways theoretically offer more capacity, the cost of moving that data (I/O) scales linearly. DeepSeek found that 𝑛 = 4 is the precise engineering limit where the performance gain justifies the cost. At 4 lanes, the custom Fused Kernels can still fit the entire operation inside the GPU’s ultra-fast L1 Cache (SRAM), keeping the training overhead to a manageable 6.7%. Pushing beyond this Sweet Spot creates a tipping point: the data volume would exceed the SRAM’s capacity, forcing the system to spill data into slow global memory (HBM). This would drastically spike the computational overhead (likely >13%), yielding a scenario where the cost of maintaining the highway consumes the resources needed to run the traffic.

Comparison of Memory Access Costs Per Token. Adapted from the DeepSeek

Step 2: Creating Hyper-Connections (Subspaces)

Once the input is split, mHC creates Hyper-Connection between layers. These are parallel paths that allow information to evolve independently.

Each lane (subspace) captures, specific portion of the data’s features. This division helps the model to maintain distinct trains of thought without them overwriting each other, solving ambiguity problem found in standard models.

How is this different from just making the model wider? At first glance, splitting a vector into branches sounds like just making a model Wider (increasing ). However, there is a massive difference in cost and stability. The Cost Trap of Standard Width: In a standard Transformer, if you want 4x the width (capacity), the computational cost increases by 16x (Quadratic Scaling, O(d²)). Every neuron has to connect to every other neuron. It is computationally expensive and inefficient.

The mHC Advantage: mHC achieves that same 4x effective width but keeps the compute cost almost identical ( 1.06x).

Sparse Interaction: Instead of a dense everyone-talks-to-everyone matrix, mHC uses sparse, diagonal-heavy mixing.
Dedicated Subspaces: By mathematically isolating the streams, gradients don’t cancel each other out during training. A gradient update for Logic doesn’t mess up the weights for Syntax.

Step 3: The Manifold Constraint (The Traffic Controller)

The Danger of having multiple lane is Signal Explosion If Lane 1 shouts at Lane 2 and Lane 2 shouts back, the noise amplifies layer-by-layer until the model crashes (Gradient Explosion).

mHC solves this enforcing a Manifold Constraint using a Doubly Stochastic Matrix.

Rule 1 (Row Sum = 1): Total Information leaving lane can’t exceed 100%.
Rule 2 (Column Sum = 2): Total Information entering lane can’t exceed 100%.

This acts like a Conservation of Energy law for the AI. It ensures that the signal stays perfectly stable, preserving the Identity Property even if the network is 1000 layers deep.

Step 4: The Free Lunch — Kernel Fusion

Finally, solving the math for step 3 (Sinkhorn Normalisation) is usually too slow for real-time AI. mHC uses Kernel Fusion via TileLang to solve this.

Instead of reading and writing to memory for every calculation, mHC fuses the normalisation and mixing steps into a single operation inside the GPU’s ultra fast cache (SRAM).

The Result: The model gets the intelligence of a massive network and the stability of ResNet, all with the speed of standard, smaller model.

Let’s Dive Deeper Now…

mHC Architecture: Detailed Exploration

mHC’s architecture is a fundamental rethinking of how information flows through a Transformer. Instead of a single Residual Highway, mHC splits the flow into parallel, mathematically constrained streams. This allows the model to expand its capacity 4x) without the signal explosion that typically kills wide networks.

The Hyper-Connection Layer: Subspaces and Parallel Processing. Standard Transformers use a single vector 𝘹 to carry all information (syntax, semantics, logic). mHC splits this vector into multiple Subspaces (branches). Each branch is treated as an independent stream of thought that evolves separately but interacts through a Mixer.

Mathematical Intuition of Subspaces:

The output of the layer is determined by mixing these streams using a weight matrix W:

Where W is the Router that decides if Stream 1 (Syntax) should share information with Stream 2 (Logic).

The Manifold Constraint: Enforcing the Traffic Rules
The danger of mixing streams is Signal Explosion. If W is random (as in standard Hyper-Connections), values can grow infinitely ( 1.¹¹⁰⁰ ≈ 13 000). To solve this, mHC forces the mixing matrix W to live on the Birkhoff Polytope. This is a fancy way of saying W must be a Doubly Stochastic Matrix.

Why Doubly Stochastic? It acts like a Conservation Law for the AI.

Row Sum = 1: The total information entering a stream cannot exceed 100%.
Col Sum = 1: The total information leaving a stream cannot exceed 100%.

Mathematical Intuition of Stability:
For the signal to be stable, the Eigenvalues (λ) of the mixing matrix must not exceed 1.

If we enforce the constraint that every row and column sums to 1:

Then, by the Birkhoff-von Neumann Theorem, the matrix is strictly non-expansive. The vector Y essentially just rotates inside the vector space, never exploding in length. This preserves the Identity Property even at 1000 layers deep.

Let’s understand in detail the intuition behind the Free Lunch …

Kernel Fusion & The Sinkhorn Algorithm:
One of the most fascinating challenges in mHC is that forcing a matrix to be Doubly Stochastic is computationally expensive. You can’t just set the weights; you have to normalise them iteratively during every forward pass.

The Problem: The Memory Wall
To turn a random matrix R into a stable matrix W, we use the Sinkhorn-Knopp Algorithm, which requires repeatedly dividing rows and columns.

Standard Implementation: Read Matrix→Sum Rows→Divide→Write Back→Read Again…
The Cost: This repeated reading/writing to the GPU’s main memory (HBM) is incredibly slow. It makes the model 10x slower.

The Solution: Kernel Fusion (TileLang)
Kernel Fusion is a technique where we combine multiple mathematical operations into a single GPU command. DeepSeek used TileLang to fuse the entire Sinkhorn loop into one kernel.
Intuition Behind the Speed Up:
Instead of moving data back and forth to slow memory (HBM), we load the matrix Tile into the GPU’s ultra-fast L1 Cache (SRAM).

Load: Load a chunk of the matrix into SRAM.
Loop: Run the Sinkhorn normalization (Row/Col division) entirely inside the fast cache.
Compute: Multiply the stable matrix by the input vector.
Write: Send only the final result back to main memory.

Bonus: Why DeepSeek used TileLang over CUDA? Manually writing these kernels in CUDA is a nightmare of memory management. TileLang allows engineers to simply define the Tiling Strategy, how to chop up the matrix for processing, in high-level code. The compiler then generates the optimised machine code automatically. This innovation transforms the memory access cost from a prohibitive bottleneck into a negligible 6.7% overhead.

Results: Punching Above Its Weight Class

Scaling Properties of mHC comparing baseline. Adapted from DeepSeek Paper

In this paper Team DeepSeek had compared mHC 27B to the Standard Baseline 27B to measure architectural efficiency. And the results proved that mHC is scalable and stable by scoring 51.0 on the BBH Benchmarkan improvement of 14.12% over the Standard Baseline score of

This innovation allows a medium-sized AI (27B) to punch far above its weight class. It solves complex logic puzzles significantly better than standard models of the same size, effectively making it smarter without making it larger or slower.

Future Directions: Beyond the Birkhoff Polytope

The mHC architecture is likely just the Hello World of a new era in Deep Learning design. While the paper establishes a new baseline, the authors and the community see three major avenues for evolution.

DeepSeek’s Theoretical Roadmap: Future Directions from the Paper
The authors frame mHC not as a final product, but as a proof of concept for Manifold-Aware Architecture. They explicitly highlight three paths:

Exploring Diverse Manifolds: Currently, mHC uses the Birkhoff Polytope (Doubly Stochastic Matrices) because it is mathematically symmetric. However, the authors suggest this is just one option. Future work will likely explore distinct geometric constraints tailored to specific tasks.
The Vision: Imagine a Creative Manifold for Storytelling models and a Logic Manifold for Math models, each using different geometric constraints to optimise the flow of information.
Optimising the Plasticity vs. Stability Trade-off: There is always a tension in AI: Plasticity (learning new things quickly) vs. Stability (not crashing). By restricting the weights to a manifold, we gain stability, but do we lose some plasticity?
The Goal: Finding the Sweet Spot manifold that allows the model to be even more flexible than the current design while maintaining the rigorous safety of the Identity Property.
Reviving Macro-Architecture Design: For the last few years, AI research has obsessed over Micro-Design (tweaking the insides of an Attention Head or an FFN block). DeepSeek hopes mHC restore community interest in macro-architecture.
The Shift: Moving focus away from the neuron and towards the Global Topology, how layers connect over long distances.

2. Our Perspective: The Uncharted Frontiers Ahead

Solving the Depth Limit in Vision (ViTs): Vision Transformers (ViTs) struggle to go deeper than ~30 layers because visual signals tend to over-smooth (become identical).
The Prediction: The Identity Property preserved by mHC is exactly what ViTs need to scale to 100+ layers, potentially unlocking a new level of high-fidelity image understanding and generation.
The Manifold-MoE Hybrid: Current Mixture-of-Experts (MoE) models route tokens to experts. mHC creates distinct subspaces (streams) of information.
The Prediction: Combining these could allow for Trillion-Parameter models where specific Streams are routed to specific Experts. (e.g., The Syntax Stream routes to a Grammar Expert, while the Logic Stream routes to a Math Expert).

Conclusion: Building mHC — From Chaos to Constrained Intelligence

The iterative construction of mHC highlights how combining high-bandwidth architecture ( Hyper-Connections), rigorous linear algebra ( Manifold Constraints), and hardware optimisation ( Kernel Fusion) led to a new paradigm in Deep Learning. Each step solved a critical bottleneck that previously made scaling impossible:

Hyper-Connections provided the structural potential, creating a 4-Lane Superhighwa y for parallel information processing.
The Manifold Constraint (Birkhoff Polytope) acted as the traffic controller, strictly enforcing the Identity Property to prevent the signal explosion that kills standard wide networks.
Kernel Fusion (via TileLang) broke the Memory Wall, allowing us to run complex normalisation algorithms in real-time without slowing down the GPU.

By moving from a single residual lane to a mathematically constrained manifold, DeepSeek has proven that we can quadruple the information capacity of an AI without quadrupling the cost.

Thank You for Reading!

If you made it this far, you are part of the small percentage of people who truly care about the mechanics of Intelligence, not just the hype.

Stay tuned for Part 2, where we will leave the analogies behind and walk through the rigorous mathematical proofs that make this architecture robust.

Until then, keep optimising.

Originally published at https://jay9122.substack.com.

DINO by Meta: How AI Learns Without Labels (Self-Supervised Learning)

Jay Prajapati — Fri, 17 Oct 2025 11:27:06 GMT

Evolution Of Dino Model Family. Created with Gemini Nano Banana.

DINO (self-DIstillation with NO labels) Family is the Pioneer of the Self-Supervised Learning for Image Models. Meta AI released DINO in April 2021. It gets traction at the ICLR 2021 due its without label training for images. DINO is a BERT For the Images and able finding pictures very similar to your query just by comparing their feature vectors.

During Training, DINO’s features spontaneously developed the unique ability to clearly distinguish the boundaries of objects within an image. To Scale it to Universal Feature Set — from identifying the image’s category to mapping out the pixel details, DINOv2 introduced in April 2023 pre-trained on 1B ViT, and became the Visual Foundation Model unlike GPT (Foundational LLM). Its core innovation is Frozen Features, which can do various tasks without specific fine-tuning, such as Depth Estimation, Semantic Segmentation, Dense Matching, Sparse Matching etc.

In DINOv2, detail degradation that occurs when training these huge models for long periods. To overcome this issue DINOv3 introduced (Aug 2025) with stabilization technique called Gram anchoring, and trained a 7B model. The result is a model that yields exceptionally high-quality dense features. While DINOv2 provided a good, smooth map of the scene’s geometry, DINOv3 delivers a precise, high-resolution topographical map of the scene, enabling superior performance for applications like robotics or 3D reconstruction that rely on exact geometric data.

DINO Algorithm Explanation

DINO Algorithm Represented in Official GitHub Repository

There are 3 Innovations breakthrough implemented by Meta AI in DINO Algorithm.

The Student and the Teacher:

The core of DINO involves two identical neural networks: a student network and a teacher network. Both have the same structure (e.g., a ViT), but they have different internal parameters.
The student network is the one actively learning and getting smarter.
The teacher network acts like a wiser, more stable mentor. Its job is to provide good examples for the student to learn from.

2. Multi-Crop Training:

For any single input image, the algorithm doesn’t just look at it once. Instead, it creates multiple different crops of that image.
These views include two larger, global views (e.g., covering more than 50% of the original image) and several smaller, local views (e.g., covering less than 50%).
Student Network process all these views, while teacher network get only processes the two global views.

3. Output Generation:

Both the student and teacher networks process their respective views and produce an output, kind of summary or representation of what they see in that view.
Student’s goal is to match the teacher’s output using a cross-entropy loss.
Based on the matching result, student updates its parameters. But teacher network doesn’t learn directly through this matching process. Instead its parameters are updated slowly by taking an average of the student’s past parameters. This is known as exponential moving average or momentum encoder.
DINO Prevents From the collapse (e.g. always outputting the same thing for every image), by applying Centring & Sharpening.

Centring stops the network from focusing too much on just one type of feature or always producing the same boring output.

Sharpening makes the teacher’s judgements more distinct and confident, which helps guide the student towards learning specific features.

What Improved in DinoV2 from DinoV1

https://medium.com/media/c7191243d13c851482e3707e3e80216f/href

DinoV1 had proved, self-supervision works well with ViTs. To scale SSL in both data quantity and model size, they created a LVD-142M Dataset to scale Self-Distillation (e.g. iBOT).
These all scaled the model size from 85M to 1B parameters.
All the innovation led DINOv2 to the generation of general-purpose visual features that worked extremely well across various tasks — both image-level (like classification) and pixel-level (like segmentation or depth estimation).

What Improved in DinoV3 from DinoV2

https://medium.com/media/8e5141043b9686981fc76f72cf35df1e/href

While DINOv2 was excellent, scaling SSL models (especially with very long training schedules) introduced a problem: the dense feature maps degraded or collapsed over time, losing their fine-grained spatial accuracy.

To overcome this, DINOv3 used an even larger curated dataset (LVD-1689M, over 1.6 billion images) and trained a giant ViT model with 7B parameters.
To mitigate the collapse of dense feature maps during long, large-scale training, they introduced Gram Anchoring.
In short, Gram Anchoring is a technique to makes sure the details stay sharp and organized, even while the student learns bigger, fancier ideas about all the other pictures.
Due to Gram Anchoring, DINOv3 delivers a precise, high-resolution topographical map of the scene, enabling superior performance for applications like robotics or 3D reconstruction that rely on exact geometric data.

Future Directions

DINOv3 has significant performance gap on OCR-Heavy Classification Tasks due to not leveraging paired image-text data during its training. Authors had admitted, that better handling of text recognition purely through SSL is a recognized challenge.

Conclusion

DINO started as a self-distillation method, leveraging the momentum encoder and multi-crop for ViTs. DINOv2 scaled SSL using curated data and combining DINO/iBOT losses, achieving 1B parameters. DINOv3 introduced Gram anchoring, enabling training up to 7B parameters for versatile, high-quality dense features.