<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Jay Prajapati on Medium]]></title>
        <description><![CDATA[Stories by Jay Prajapati on Medium]]></description>
        <link>https://medium.com/@jay9122?source=rss-8288fa2632b5------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*OCLm57mL-uCZ3-vCiNUPDw.png</url>
            <title>Stories by Jay Prajapati on Medium</title>
            <link>https://medium.com/@jay9122?source=rss-8288fa2632b5------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 19 Jun 2026 09:18:21 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@jay9122/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[DeepSeek mHC Paper Breakdown: Architecture, Math, and Kernel Fusion]]></title>
            <link>https://medium.com/@jay9122/deepseek-mhc-paper-breakdown-architecture-math-and-kernel-fusion-3975d5042bef?source=rss-8288fa2632b5------2</link>
            <guid isPermaLink="false">https://medium.com/p/3975d5042bef</guid>
            <category><![CDATA[deepseek]]></category>
            <category><![CDATA[ai-model-architecture]]></category>
            <category><![CDATA[research-paper-breakdown]]></category>
            <category><![CDATA[hyper-connection]]></category>
            <dc:creator><![CDATA[Jay Prajapati]]></dc:creator>
            <pubDate>Mon, 19 Jan 2026 13:07:19 GMT</pubDate>
            <atom:updated>2026-01-19T13:07:19.585Z</atom:updated>
            <content:encoded><![CDATA[<p>Scaling the information capacity of LLM (Large Language Models) has traditionally demanded a heavy price: if you want a smarter model, you must pay for more parameters and higher latency. Standard Residual architectures enforce a rigid <em>one-lane</em> flow of information, meaning any increase in width directly spikes computational cost.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*iGRxVvMFL6YSFXxF.png" /><figcaption>The mHC advantage: Turning the standard Residual bottleneck into a stable 4-lane superhighway. Generated By Gemini Nano Banana</figcaption></figure><p><strong>mHC (Manifold-Constrained Hyper-Connections)</strong>, a novel architecture that breaks this correlation. By introducing <strong>Manifold Constraints</strong> to Hyper-Connections, DeepSeek has created a system that allows parallel information streams to coexist stably within a single layer. Unlike traditional methods that suffer from gradient explosion at depth, mHC mathematically guarantees signal stability, allowing us to quadruple the model’s bandwidth while keeping the inference cost virtually identical to standard models.</p><p>In this blog we will explore the technicalities behind mHC covering the topics:</p><ul><li><strong>Architecture of mHC:</strong></li><li><strong>The Manifold Constraint:</strong></li><li><em>Using the </em><strong><em>Birkhoff Polytope</em></strong><em> to enforce the Identity Property.</em></li><li><em>Applying the </em><strong><em>Sinkhorn-Knopp Algorithm</em></strong><em> to normalise mixing matrices.</em></li><li><strong>Implementation &amp; Speed:</strong> How it circumvents the Memory Wall using <strong>Kernel Fusion</strong> and <strong>TileLang</strong> to fuse operations directly in SRAM.</li><li><strong>The Free Lunch Paradigm:</strong> Achieving <strong>4x information capacity</strong> without increasing the compute budget.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*F4Wqd1clEpLAPn3O.png" /><figcaption><strong>Residual vs. Hyper-Connections vs. Manifold-Constrained HC. Adapted from the </strong><a href="https://arxiv.org/pdf/2512.24880"><strong>DeepSeek Paper</strong></a><strong>.</strong></figcaption></figure><h3>Intuition Behind mHC: How it Expands Intelligence Without the Cost</h3><h4>Step 1: Treating the Residual Stream as a Multi-Lane Highway</h4><p>Before Deep Diving into mHC Architecture. let’s walk through the intuition behind how it works.</p><p>Imagine a standard Model (e.g. Llama, GPT, ResNet) as a single highway. Every type of information — grammar, facts, logic, tone etc. — must travel in this one narrow lane (the residual vector x).</p><p>mHC simplifies this by expanding that single lane into 4-lane superhighway. Instead of forcing all information to compete for space in a single vector, we split the input into independent multiple branches (subspaces).</p><p>Think of it as allocating specific lanes for specific traffic: Lane 1 for Syntax, Lane 2 for Logic, Lane 3 for Facts etc. They travel parallelly rather than jamming into one queue.</p><blockquote><strong><em>Why 4 Lane Not More?</em></strong></blockquote><blockquote><em>You might wonder, if 4 lanes make the model smarter, why not build 8, 16, or 100 lanes? The answer lies in the </em><strong>Memory Wall<em>.</em></strong><em> While wider highways theoretically offer more capacity, the cost of moving that data (I/O) scales linearly. DeepSeek found that 𝑛 = 4 is the precise engineering limit where the performance gain justifies the cost. At 4 lanes, the custom </em><strong>Fused Kernels</strong><em> can still fit the entire operation inside the GPU’s ultra-fast L1 Cache (SRAM), keeping the training overhead to a manageable </em><strong><em>6.7%</em></strong><em>. Pushing beyond this </em><strong>Sweet Spot</strong><em> creates a tipping point: the data volume would exceed the SRAM’s capacity, forcing the system to </em><strong>spill</strong><em> data into slow global memory (HBM). This would drastically spike the computational overhead (likely &gt;13%), yielding a scenario where the cost of maintaining the </em><strong>highway</strong><em> consumes the resources needed to run the traffic.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/909/0*03GN1quaKi79J52T.png" /><figcaption>Comparison of Memory Access Costs Per Token. Adapted from the <a href="https://arxiv.org/pdf/2512.24880">DeepSeek</a></figcaption></figure><h4>Step 2: Creating Hyper-Connections (Subspaces)</h4><p>Once the input is split, <strong>mHC</strong> creates <strong>Hyper-Connection </strong>between layers. These are parallel paths that allow information to evolve independently.</p><p>Each lane (subspace) captures, specific portion of the data’s features. This division helps the model to maintain distinct <em>trains of thought </em>without them overwriting each other, solving <em>ambiguity </em>problem found in standard models.</p><blockquote><strong><em>How is this different from just making the model wider?</em></strong><em> </em>At first glance, splitting a vector into branches sounds like just making a model <strong>Wider</strong> (increasing ). However, there is a massive difference in cost and stability. <strong>The Cost Trap of Standard Width: </strong>In a standard Transformer, if you want <strong>4x</strong> the width (capacity), the computational cost increases by <strong>16x</strong> (Quadratic Scaling, <strong>O(d²)</strong>). Every neuron has to connect to every other neuron. It is computationally expensive and inefficient.</blockquote><blockquote><strong>The mHC Advantage: </strong>mHC achieves that same <strong>4x</strong> effective width but keeps the compute cost almost identical ( <strong>1.06x</strong>).</blockquote><ul><li><strong><em>Sparse Interaction:</em></strong><em> Instead of a dense </em><strong><em>everyone-talks-to-everyone</em></strong><em> matrix, mHC uses sparse, </em><strong><em>diagonal-heavy mixing</em></strong><em>.</em></li><li><strong><em>Dedicated Subspaces:</em></strong><em> By mathematically isolating the streams, gradients don’t </em><strong><em>cancel each other out</em></strong><em> during training. A gradient update for </em><strong><em>Logic</em></strong><em> doesn’t mess up the weights for </em><strong><em>Syntax</em></strong><em>.</em></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YU82ZNKS7KG6k3xA.png" /></figure><h3>Step 3: The Manifold Constraint (The Traffic Controller)</h3><p>The Danger of having multiple lane is <strong>Signal Explosion </strong>If Lane 1 shouts at Lane 2 and Lane 2 shouts back, the noise amplifies layer-by-layer until the model crashes (Gradient Explosion).</p><p>mHC solves this enforcing a <strong>Manifold Constraint </strong>using a <strong>Doubly Stochastic Matrix.</strong></p><ul><li><strong>Rule 1 (Row Sum = 1): </strong>Total Information leaving lane can’t exceed 100%.</li><li><strong>Rule 2 (Column Sum = 2): </strong>Total Information entering lane can’t exceed 100%.</li></ul><p>This acts like a <strong>Conservation of Energy</strong> law for the AI. It ensures that the signal stays perfectly stable, preserving the <strong>Identity Property </strong>even if the network is 1000 layers deep.</p><h3>Step 4: The Free Lunch — Kernel Fusion</h3><p>Finally, solving the math for step 3 (Sinkhorn Normalisation) is usually too slow for real-time AI. mHC uses <strong>Kernel Fusion </strong>via TileLang to solve this.</p><p>Instead of reading and writing to memory for every calculation, mHC fuses the normalisation and mixing steps into a single operation inside the GPU’s ultra fast cache (SRAM).</p><p><strong>The Result: </strong>The model gets the intelligence of a massive network and the stability of ResNet, all with the speed of standard, smaller model.</p><p>Let’s Dive Deeper Now…</p><h3>mHC Architecture: Detailed Exploration</h3><p>mHC’s architecture is a fundamental rethinking of how information flows through a Transformer. Instead of a single <em>Residual Highway</em>, mHC splits the flow into parallel, mathematically constrained streams. This allows the model to expand its capacity 4x) without the signal explosion that typically kills wide networks.</p><ul><li><strong>The Hyper-Connection Layer</strong>: Subspaces and Parallel Processing. Standard Transformers use a single vector 𝘹 to carry all information (syntax, semantics, logic). mHC splits this vector into multiple <strong>Subspaces</strong> (branches). Each branch is treated as an independent stream of thought that evolves separately but interacts through a <strong><em>Mixer</em></strong>.</li></ul><p><strong><em>Mathematical Intuition of Subspaces:</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/188/1*A5-X_PEDnfAlcxiLckbncw.png" /></figure><p><em>The output of the layer is determined by mixing these streams using a weight matrix </em><strong><em>W</em></strong><em>:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/188/1*TuYHPSzg4JV1z6FLbixJQQ.png" /></figure><p><em>Where </em><strong><em>W</em></strong><em> is the </em><strong><em>Router</em></strong><em> that decides if Stream 1 (Syntax) should share information with Stream 2 (Logic).</em></p><ul><li><strong>The Manifold Constraint</strong>: Enforcing the <em>Traffic Rules<br></em>The danger of mixing streams is <strong><em>Signal Explosion</em></strong>. If <strong>W</strong> is random (as in standard Hyper-Connections), values can grow infinitely ( <strong>1.¹¹⁰⁰ ≈ 13 000</strong>). To solve this, mHC forces the mixing matrix <strong>W</strong> to live on the <strong><em>Birkhoff Polytope</em></strong>. This is a fancy way of saying <strong>W</strong> must be a <strong><em>Doubly Stochastic Matrix</em></strong>.</li></ul><p><strong>Why Doubly Stochastic?</strong> It acts like a <em>Conservation Law</em> for the AI.</p><ul><li><strong><em>Row Sum = 1:</em></strong><em> The total information entering a stream cannot exceed 100%.</em></li><li><strong><em>Col Sum = 1:</em></strong><em> The total information leaving a stream cannot exceed 100%.</em></li></ul><p><strong><em>Mathematical Intuition of Stability:<br></em></strong>For the signal to be stable, the Eigenvalues (λ) of the mixing matrix must not exceed 1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/188/1*iSjgre26S3u2nrXTJIDjMw.png" /></figure><p>If we enforce the constraint that every row and column sums to 1:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/236/1*eOvzizxpKwZkhxHSR3OkxQ.png" /></figure><p>Then, by the <strong><em>Birkhoff-von Neumann Theorem</em></strong>, the matrix is strictly non-expansive. The vector Y essentially just <em>rotates</em> inside the vector space, never exploding in length. This preserves the <strong><em>Identity Property</em></strong> even at 1000 layers deep.</p><p><strong>Let’s understand in detail the intuition behind the <em>Free Lunch </em>…</strong></p><ul><li><strong>Kernel Fusion &amp; The Sinkhorn Algorithm:<br></strong>One of the most fascinating challenges in mHC is that <strong>forcing</strong> a matrix to be <em>Doubly Stochastic</em> is computationally expensive. You can’t just set the weights; you have to <em>normalise</em> them iteratively during every forward pass.</li></ul><p><strong>The Problem: The Memory Wall<br></strong>To turn a random matrix R into a stable matrix W, we use the <strong>Sinkhorn-Knopp Algorithm</strong>, which requires repeatedly dividing rows and columns.</p><ul><li><strong>Standard Implementation:</strong> Read Matrix→Sum Rows→Divide→Write Back→Read Again…</li><li><strong>The Cost:</strong> This repeated reading/writing to the GPU’s main memory (HBM) is incredibly slow. It makes the model 10x slower.</li></ul><p><strong>The Solution: Kernel Fusion (TileLang)<br>Kernel Fusion</strong> is a technique where we combine multiple mathematical operations into a <strong><em>single GPU command</em></strong>. DeepSeek used <strong><em>TileLang</em></strong> to fuse the entire Sinkhorn loop into one kernel.<br><strong><em>Intuition Behind the Speed Up:<br></em></strong>Instead of moving data back and forth to slow memory (HBM), we load the matrix <strong>Tile</strong> into the GPU’s ultra-fast <strong><em>L1 Cache (SRAM)</em></strong>.</p><ul><li><strong>Load:</strong> Load a chunk of the matrix into SRAM.</li><li><strong>Loop:</strong> Run the Sinkhorn normalization (Row/Col division) entirely inside the fast cache.</li><li><strong>Compute:</strong> Multiply the stable matrix by the input vector.</li><li><strong>Write:</strong> Send only the final result back to main memory.</li></ul><blockquote><strong>Bonus: </strong>Why DeepSeek used TileLang over CUDA? Manually writing these kernels in CUDA is a nightmare of memory management. TileLang allows engineers to simply define the Tiling Strategy, how to chop up the matrix for processing, in high-level code. The compiler then generates the optimised machine code automatically. This innovation transforms the <strong>memory access cost</strong> from a <strong>prohibitive bottleneck</strong> into a negligible <strong>6.7%</strong> overhead.</blockquote><h3>Results: Punching Above Its Weight Class</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7YbQACk4nEsAx2yJ.png" /><figcaption>Scaling Properties of mHC comparing baseline. Adapted from <a href="https://arxiv.org/pdf/2512.24880">DeepSeek Paper</a></figcaption></figure><p>In this paper Team DeepSeek had compared mHC 27B to the Standard Baseline 27B to measure architectural efficiency. And the results proved that mHC is scalable and stable by scoring <strong><em>51.0</em></strong> on the <a href="https://arxiv.org/abs/2210.09261"><strong><em>BBH Benchmark</em></strong></a>an improvement of <strong><em>14.12% </em></strong>over the Standard Baseline score of</p><p>This innovation allows a <strong><em>medium-sized AI (27B)</em></strong> to <em>punch far above its weight class</em>. It solves complex logic puzzles significantly better than standard models of the same size, effectively making it smarter without making it larger or slower.</p><h3>Future Directions: Beyond the Birkhoff Polytope</h3><p>The mHC architecture is likely just the <em>Hello World</em> of a new era in Deep Learning design. While the paper establishes a new baseline, the authors and the community see three major avenues for evolution.</p><ol><li><strong>DeepSeek’s Theoretical Roadmap: Future Directions from the Paper<br></strong>The authors frame mHC not as a final product, but as a proof of concept for <strong><em>Manifold-Aware Architecture</em>.</strong> They explicitly highlight three paths:</li></ol><ul><li><strong>Exploring Diverse Manifolds:</strong> Currently, mHC uses the <strong><em>Birkhoff Polytope</em></strong> (Doubly Stochastic Matrices) because it is mathematically symmetric. However, the authors suggest this is just one option. Future work will likely explore <strong><em>distinct geometric constraints</em></strong> tailored to specific tasks.<br><strong><em>The Vision</em></strong><em>:</em> Imagine a <em>Creative Manifold</em> for Storytelling models and a <em>Logic Manifold</em> for Math models, each using different geometric constraints to optimise the flow of information.</li><li><strong>Optimising the Plasticity vs. Stability Trade-off:</strong> There is always a tension in AI: <em>Plasticity</em> (learning new things quickly) vs. <em>Stability</em> (not crashing). By restricting the weights to a manifold, we gain stability, but do we lose some plasticity?<br><strong><em>The Goal</em></strong><em>:</em> Finding the <em>Sweet Spot</em> manifold that allows the model to be even more flexible than the current design while maintaining the rigorous safety of the Identity Property.</li><li><strong>Reviving Macro-Architecture Design:</strong> For the last few years, AI research has obsessed over <em>Micro-Design</em> (tweaking the insides of an Attention Head or an FFN block). DeepSeek hopes mHC <strong><em>restore community interest in macro-architecture</em>.<br><em>The Shift</em></strong><em>:</em> Moving focus away from the neuron and towards the <strong>Global Topology</strong>, how layers connect over long distances.</li></ul><p>2. <strong>Our Perspective: The Uncharted Frontiers Ahead</strong></p><ul><li><strong>Solving the Depth Limit in Vision (ViTs):</strong> Vision Transformers (ViTs) struggle to go deeper than ~30 layers because visual signals tend to <em>over-smooth</em> (become identical).<br><em>The Prediction:</em> The <strong><em>Identity Property</em></strong> preserved by mHC is exactly what ViTs need to scale to 100+ layers, potentially unlocking a new level of high-fidelity image understanding and generation.</li><li><strong>The Manifold-MoE Hybrid:</strong> Current Mixture-of-Experts (MoE) models route tokens to experts. mHC creates distinct subspaces (streams) of information.<br><em>The Prediction:</em> Combining these could allow for <strong>Trillion-Parameter models</strong> where specific <em>Streams</em> are routed to specific <em>Experts</em>. (e.g., The <em>Syntax Stream</em> routes to a Grammar Expert, while the <em>Logic Stream</em> routes to a Math Expert).</li></ul><h3>Conclusion: Building mHC — From Chaos to Constrained Intelligence</h3><p>The iterative construction of mHC highlights how combining high-bandwidth architecture ( <strong>Hyper-Connections</strong>), rigorous linear algebra ( <strong>Manifold Constraints</strong>), and hardware optimisation ( <strong>Kernel Fusion</strong>) led to a new paradigm in Deep Learning. Each step solved a critical bottleneck that previously made scaling impossible:</p><ul><li><strong>Hyper-Connections</strong> provided the structural potential, creating a <em>4-Lane Superhighwa</em> y for parallel information processing.</li><li><strong>The Manifold Constraint</strong> (Birkhoff Polytope) acted as the traffic controller, strictly enforcing the <strong><em>Identity Property</em></strong> to prevent the signal explosion that kills standard wide networks.</li><li><strong>Kernel Fusion</strong> (via TileLang) broke the <em>Memory Wall</em>, allowing us to run complex normalisation algorithms in real-time without slowing down the GPU.</li></ul><p>By moving from a single residual lane to a mathematically constrained manifold, DeepSeek has proven that we can quadruple the information capacity of an AI without quadrupling the cost.</p><h3>Thank You for Reading!</h3><p>If you made it this far, you are part of the small percentage of people who truly care about the <em>mechanics</em> of Intelligence, not just the hype.</p><p><strong>Stay tuned for Part 2</strong>, where we will leave the analogies behind and walk through the rigorous mathematical proofs that make this architecture robust.</p><p><strong>Until then, keep optimising.</strong></p><p><em>Originally published at </em><a href="https://jay9122.substack.com/p/deepseek-mhc-paper-breakdown-architecture?r=1zzcg7"><em>https://jay9122.substack.com</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3975d5042bef" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[DINO by Meta: How AI Learns Without Labels (Self-Supervised Learning)]]></title>
            <link>https://medium.com/@jay9122/dino-by-meta-how-ai-learns-without-labels-self-supervised-learning-096dc4b21887?source=rss-8288fa2632b5------2</link>
            <guid isPermaLink="false">https://medium.com/p/096dc4b21887</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[self-supervised-learning]]></category>
            <category><![CDATA[vision-transformer]]></category>
            <dc:creator><![CDATA[Jay Prajapati]]></dc:creator>
            <pubDate>Fri, 17 Oct 2025 11:27:06 GMT</pubDate>
            <atom:updated>2025-10-17T11:27:06.287Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VvBj18tt9yv_US5_a_dIbw.png" /><figcaption>Evolution Of Dino Model Family. Created with Gemini Nano Banana.</figcaption></figure><p>DINO (self-<strong>DI</strong>stillation with <strong>NO</strong> labels) Family is the Pioneer of the Self-Supervised Learning for Image Models. Meta AI released DINO in April 2021. It gets traction at the ICLR 2021 due its without label training for images. DINO is a BERT For the Images and able finding pictures very similar to your query just by comparing their feature vectors.</p><p>During Training, DINO’s features spontaneously developed the unique ability to clearly distinguish the boundaries of objects within an image. To Scale it to Universal Feature Set — from identifying the image’s category to mapping out the pixel details, DINOv2 introduced in April 2023 pre-trained on 1B ViT, and became the <strong>Visual Foundation Model </strong>unlike GPT (<strong>Foundational LLM</strong>). Its core innovation is Frozen Features, which can do various tasks without <strong>specific fine-tuning, </strong>such as Depth Estimation, Semantic Segmentation, Dense Matching, Sparse Matching etc.</p><p>In DINOv2, detail degradation that occurs when training these huge models for long periods. To overcome this issue DINOv3 introduced (Aug 2025) with stabilization technique called <strong>Gram anchoring, </strong>and trained a 7B model. The result is a model that yields <strong>exceptionally high-quality dense features. W</strong>hile DINOv2 provided a good, smooth map of the scene’s geometry, DINOv3 delivers a precise,<strong> high-resolution topographical map of the scene</strong>, enabling superior performance for applications like robotics or 3D reconstruction that rely on exact geometric data.</p><h3>DINO Algorithm Explanation</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*JIDf7fWbnRAzSkwB" /><figcaption><em>DINO Algorithm Represented in Official </em><a href="https://github.com/facebookresearch/dino"><em>GitHub</em></a><em> Repository</em></figcaption></figure><p>There are 3 Innovations breakthrough implemented by Meta AI in DINO Algorithm.</p><ol><li><strong>The Student and the Teacher:</strong></li></ol><ul><li>The core of DINO involves <strong>two identical neural networks</strong>: a <em>student</em> network and a <em>teacher</em> network. Both have the same structure (e.g., a ViT), but they have different internal parameters.</li><li>The <strong>student network</strong> is the one actively learning and getting smarter.</li><li>The <strong>teacher network</strong> acts like a wiser, more stable mentor. Its job is to provide good examples for the student to learn from.</li></ul><p>2. <strong>Multi-Crop Training:</strong></p><ul><li>For any single input image, the algorithm doesn’t just look at it once. Instead, it creates <strong>multiple different crops</strong> of that image.</li><li>These views include <strong>two larger, global views</strong> (e.g., covering more than 50% of the original image) and <strong>several smaller, local views</strong> (e.g., covering less than 50%).</li><li>Student Network process all these views, while teacher network get only processes the two global views.</li></ul><p>3. <strong>Output Generation:</strong></p><ul><li>Both the student and teacher networks process their respective views and produce an output, kind of summary or representation of what they see in that view.</li><li>Student’s goal is to match the teacher’s output using a <strong>cross-entropy loss.</strong></li><li>Based on the matching result, student updates its parameters. But teacher network doesn’t learn directly through this matching process. Instead its parameters are <strong>updated slowly</strong> by taking an average of the student’s past parameters. This is known as <em>exponential moving average</em> or <em>momentum encoder.</em></li><li>DINO Prevents From the <strong>collapse</strong> (e.g. always outputting the same thing for every image), by applying <strong>Centring </strong>&amp; <strong>Sharpening.</strong></li></ul><blockquote><strong>Centring </strong>stops the network from focusing too much on just one type of feature or always producing the same boring output.</blockquote><blockquote><strong>Sharpening</strong> makes the teacher’s <strong>judgements</strong> more distinct and confident, which helps guide the student towards learning specific features.</blockquote><h3>What Improved in DinoV2 from DinoV1</h3><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fplayer.vimeo.com%2Fvideo%2F1128130963%3Fapp_id%3D122963&amp;dntp=1&amp;display_name=Vimeo&amp;url=https%3A%2F%2Fvimeo.com%2F1128130963&amp;image=https%3A%2F%2Fi.vimeocdn.com%2Fvideo%2F2071105130-285c62f86c6bd41257fc90924770435354c6052d42064a324be7f2267f7d3def-d_1280%3Fregion%3Dus&amp;type=text%2Fhtml&amp;schema=vimeo" width="1920" height="1080" frameborder="0" scrolling="no"><a href="https://medium.com/media/c7191243d13c851482e3707e3e80216f/href">https://medium.com/media/c7191243d13c851482e3707e3e80216f/href</a></iframe><ul><li>DinoV1 had proved, self-supervision works well with ViTs. To scale SSL in both data quantity and model size, they created a LVD-142M <a href="https://github.com/facebookresearch/dinov2/issues/24#issuecomment-1515477785">Dataset</a> to scale Self-Distillation (e.g. iBOT).</li><li>These all scaled the model size from 85M to 1B parameters.</li><li>All the innovation led DINOv2 to the generation of <em>general-purpose visual features</em> that worked extremely well across various tasks — both image-level (like classification) and pixel-level (like segmentation or depth estimation).</li></ul><h3>What Improved in DinoV3 from DinoV2</h3><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fplayer.vimeo.com%2Fvideo%2F1128170491%3Fapp_id%3D122963&amp;dntp=1&amp;display_name=Vimeo&amp;url=https%3A%2F%2Fvimeo.com%2F1128170491&amp;image=https%3A%2F%2Fi.vimeocdn.com%2Fvideo%2F2071153521-61d286d919bca8821c5dc40479a522996e719bac0c779f16c7af39582e9b52c1-d_960%3Fregion%3Dus&amp;type=text%2Fhtml&amp;schema=vimeo" width="1256" height="720" frameborder="0" scrolling="no"><a href="https://medium.com/media/8e5141043b9686981fc76f72cf35df1e/href">https://medium.com/media/8e5141043b9686981fc76f72cf35df1e/href</a></iframe><p>While DINOv2 was excellent, scaling SSL models (especially with very long training schedules) introduced a problem: the <strong>dense feature maps degraded</strong> or collapsed over time, losing their fine-grained spatial accuracy.</p><ul><li>To overcome this, DINOv3 used an even larger curated dataset (LVD-1689M, over 1.6 billion images) and trained a giant ViT model with <strong>7B parameters.</strong></li><li>To mitigate the collapse of dense feature maps during long, large-scale training, they introduced <strong>Gram Anchoring.</strong></li><li>In short, Gram Anchoring is a technique to makes sure the details stay <strong>sharp and organized</strong>, even while the student learns bigger, fancier ideas about all the other pictures.</li><li>Due to Gram Anchoring, <strong>DINOv3 delivers a precise, high-resolution topographical map of the scene,</strong> enabling superior performance for applications like robotics or 3D reconstruction that rely on exact geometric data.</li></ul><h3>Future Directions</h3><ul><li>DINOv3 has significant performance gap on OCR-Heavy Classification Tasks due to not <strong>leveraging paired image-text data</strong> during its training. Authors had admitted, that better handling of text recognition purely through SSL is a recognized challenge.</li></ul><h3>Conclusion</h3><ul><li>DINO started as a self-distillation method, leveraging the <strong>momentum encoder</strong> and multi-crop for ViTs. DINOv2 scaled SSL using curated data and combining DINO/iBOT losses, achieving 1B parameters. DINOv3 introduced <strong>Gram anchoring</strong>, enabling training up to 7B parameters for versatile, high-quality dense features.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=096dc4b21887" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>