Stories by Harsh Maheshwari on Medium

Flow Matching vs Diffusion

Harsh Maheshwari — Tue, 18 Mar 2025 15:17:12 GMT

Briefly going into mathematical foundations and getting intuitive understanding

Code: https://github.com/harshm121/Diffusion-v-FlowMatching

Introduction

Generative models have revolutionised artificial intelligence, enabling machines to create remarkably realistic images, audio, and text. Among these techniques, two approaches have gained significant traction: diffusion models and flow matching. While both methods transform noise into structured data (or vice versa), they operate on fundamentally different principles (or do they?). This blog post provides a comprehensive comparison of these two powerful techniques, exploring their mathematical foundations, practical implementations, and intuitive interpretations.

The Basic Intuition

Before diving into the mathematics, let’s develop an intuitive understanding of both approaches:

Diffusion Models gradually add noise to data until it becomes pure noise, then learn to reverse this process. Think of it as slowly dissolving a photograph in acid until it becomes a random blur, then learning how to reconstruct the photograph from the blur.

Flow Matching creates a continuous path (or flow) between noise and data distributions. Think of it as defining a smooth transportation plan that morphs noise into structured data, similar to watching a time-lapse of clay being sculpted from a random blob into a detailed statue.

Mathematical Foundations of Diffusion Models

Forward Process

Diffusion models define a forward process that gradually adds Gaussian noise to data $x_0$ over $T$ timesteps:

where

and $\alpha_t \in (0, 1)$ is a noise schedule parameter.

This process can also be expressed as a stochastic differential equation (SDE):

where $W_t$ is a standard Wiener process (Brownian motion), $f(x_t, t)$ is the drift coefficient, and $g(t)$ is the diffusion coefficient.

After sufficient steps, $x_T$ approaches a standard Gaussian distribution $\mathcal{N}(0, I)$, essentially destroying all structure in the original data.

Reverse Process

The key insight in diffusion models is that if we can learn to reverse this noise addition process, we can generate new data by starting with random noise and iteratively denoising it.

The reverse process is defined as:

In practice, models like DDPM (Denoising Diffusion Probabilistic Models) predict the noise component $\epsilon_\theta(x_t, t)$ at each step, from which we can derive $\mu_\theta(x_t, t)$.

Training Objective

The training objective for diffusion models is typically a variational bound on the negative log-likelihood:

In practice, this is often simplified to a reweighted mean-squared error objective:

Mathematical Foundations of Flow Matching

Continuous Normalizing Flows and Velocity Fields

Flow matching builds on the concept of continuous normalizing flows (CNFs), which define a differential equation that transforms one probability distribution into another:

where $v_\theta$ is a learned velocity field.

Understanding Velocity Fields

A velocity field is a function that assigns a velocity vector to each point in space and time. Think of it as a wind map that shows the direction and speed that a particle should move at any given location. In the context of generative models:

Mathematical Definition: A velocity field

assigns a vector $v$ to each point $x$ at time $t$.

Physical Intuition: Imagine placing a tiny boat at any point in a river. The velocity field tells you the direction and speed the current will move the boat at that exact position.
Transformation Properties: If we follow the velocity field from time $t=0$ to $t=1$, we trace out paths that transform samples from one distribution (typically noise) to another (our target data distribution).

Velocity fields have several important properties that make them powerful tools for generative modeling:

Deterministic Paths: Following a velocity field from a starting point produces a deterministic trajectory.
Invertibility: We can run the process backward by reversing the velocity vectors.
Conservation of Probability Mass: The continuity equation ensures that probability mass is neither created nor destroyed during the transformation.

To visualize a velocity field, imagine a grid where at each point there’s an arrow showing direction and magnitude:

↑     ↗     →     ↗     ↑
↖     ↑     ↑     →     ↗
←     ↙     ↑     ↑     →
↙     ↓     ↓     ↗     ↑
↓     ↙     ←     ←     ↑

In flow matching, we learn a neural network to predict these arrows (velocity vectors) given a position and time. The network is trained to match a reference vector field that defines the desired flow between distributions.

Probability Flow ODE

The evolution of probability density under this flow follows the continuity equation:

This ensures mass conservation during the transformation.

Flow Matching Principle

The key innovation in flow matching is to directly supervise the velocity field $v_\theta$ using a predefined path between distributions. Instead of deriving the velocity field from a complex probability flow equation, flow matching directly constrains it to match a reference vector field $u(x,t)$ that defines how samples should move:

where $x(t)$ is sampled from the path distribution $p_t(x)$ that interpolates between the noise distribution $p_0(x)$ and the data distribution $p_1(x)$.

Conditional Flow Matching

A powerful extension is conditional flow matching (CFM), which constructs paths between individual data points and noise samples:

where

defines a path between noise $z$ and data $x_0$, with

controlling path noise.

Key Differences

1. Path Definition

Diffusion Models define a fixed, stochastic path through the addition of Gaussian noise. The forward process is predetermined by the noise schedule, and the model learns to reverse this specific process.

Flow Matching allows for flexible path design between distributions. The paths can be straight lines, curved trajectories, or even learned dynamically, providing greater flexibility.

2. Training Dynamics

Diffusion Models typically require estimating complex probability densities or their surrogates, often leading to challenging training dynamics and the need for careful noise scheduling.

Flow Matching directly supervises the velocity field, resulting in a simpler mean-squared error objective that tends to be more stable during training.

3. Sampling Efficiency

Diffusion Models traditionally require many steps during sampling (often 1000+), though techniques like DDIM sampling have reduced this requirement.

Flow Matching can often achieve high-quality samples with fewer steps (10–100) by leveraging higher-order ODE solvers.

4. Theoretical Guarantees

Diffusion Models have strong connections to score-based generative models and provide clear likelihood bounds.

Flow Matching offers guarantees about exact density matching under certain conditions and provides a more direct route to optimizing the probability flow ODE.

Numerical Example: Transforming a Simple Distribution

Let’s consider a concrete example: transforming a standard normal distribution into a mixture of two Gaussians.

Diffusion Approach

In the diffusion approach, we would:

Start with our target distribution (mixture of Gaussians)
Gradually add noise according to a schedule $\beta_t$
Train a model to predict the noise at each step
Sample by starting with pure noise and iteratively denoising

For a simple 1D example with $T=10$ steps and linear noise schedule $\beta_t = 0.1$, the forward process would be:

The reverse process would estimate:

Flow Matching Approach

In the flow matching approach, we would:

Define a path between distributions, e.g., linear interpolation:

2. Derive or define a velocity field $u(x, t)$ that induces this path

3. Train a model to predict this velocity field

4. Sample by solving the ODE: $dx/dt = v_\theta(x, t)$ from noise to data

For our simple example, assuming straight-line paths, the velocity field might be:

where $z$ is a noise sample and $x_0$ is a data sample.

Practical Implementations and Applications

Diffusion Models in Practice

The most influential diffusion models include:

DDPM (Denoising Diffusion Probabilistic Models): The foundational model that popularized the approach
DDIM (Denoising Diffusion Implicit Models): Accelerated sampling through deterministic processes
Stable Diffusion: Applied to latent spaces for efficient image generation
Imagen and DALL-E 2: Text-to-image models built on diffusion principles

Implementation considerations:

# Simplified DDPM training loop
def train_step(self, x_0, optimizer):
    """Single training step for diffusion model"""
    batch_size = x_0.shape[0]
    
    # Sample random timesteps
    t = torch.randint(0, self.n_timesteps, (batch_size,), device=self.device, dtype=torch.long)
    
    # Add noise to data
    x_t, noise = self.q_sample(x_0, t)
    
    # Predict noise
    predicted_noise = self.model(x_t, t / self.n_timesteps)
    
    # Compute loss
    loss = F.mse_loss(noise, predicted_noise)
    
    # Update parameters
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

Flow Matching in Practice

Notable flow matching implementations include:

Conditional Flow Matching (CFM): Efficiently trains neural ODEs for generative modeling
Consistency Models: Combines aspects of flow matching with diffusion principles
SiT (Score in Time): Reformulates diffusion models as interpolant matching

Implementation considerations:

def sample_path_point(self, x_0, z, t):
    """Sample point along the path from noise z to data x_0 at time t"""
    # Linear interpolation path with small noise
    noise = torch.randn_like(x_0) * self.sigma
    x_t = (1 - t) * z + t * x_0 + noise * torch.sqrt(t * (1 - t))
    
    # Target velocity for the path
    # For straight-line path (excluding the noise term for simplicity)
    target_v = x_0 - z
    
    return x_t, target_v

# Simplified Flow Matching training loop
def train_step(self, x_0, optimizer):
    """Single training step for flow matching"""
    batch_size = x_0.shape[0]
    
    # Sample random timesteps
    t = torch.rand(batch_size, device=self.device)
    
    # Sample noise points
    z = torch.randn_like(x_0)
    
    # Get path points and target velocities
    x_t, target_v = self.sample_path_point(x_0, z, t.unsqueeze(-1))
    
    # Predict velocity vectors
    predicted_v = self.model(x_t, t)
    
    # Compute loss
    loss = F.mse_loss(predicted_v, target_v)
    
    # Update parameters
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

Intuitive Analogies

Diffusion as Erosion and Reconstruction

Imagine a sandcastle on a beach. The diffusion forward process is like watching the tide gradually wash away the castle until only a flat bed of sand remains. The reverse process is learning how to reconstruct the castle from the flat sand by understanding the patterns in which sand grains were moved.

Flow Matching as Navigation

Flow matching is like having a GPS system that gives you direction vectors at any point in space to navigate from your current location to a destination. Rather than following a fixed route, you can start anywhere, and the learned vector field will guide you toward the target distribution.

Convergence of Approaches

Recent research has revealed interesting connections between diffusion models and flow matching:

Probability Flow ODEs: Diffusion processes can be converted to deterministic ODEs similar to those in flow matching
Score-Based Methods: Both approaches can be viewed as learning different parameterizations of the score function (gradient of log density)
Interpolant Matching: Unifies various generative approaches under a common framework

Computational Efficiency

Diffusion Models generally require more function evaluations during sampling, leading to higher computational costs. However, they can be trained with simple architectures and loss functions.

Flow Matching typically achieves better sampling efficiency with sophisticated ODE solvers but may require more complex architectures to accurately model the velocity field.

Sample Quality

Both approaches can achieve state-of-the-art sample quality, with the choice often depending on specific application requirements:

Diffusion Models excel at highly structured data like images and audio
Flow Matching can be more efficient for simpler distributions or when sampling speed is critical

Conclusion

Diffusion models and flow matching represent two powerful paradigms in generative modeling, each with distinct mathematical foundations and practical considerations. Diffusion models follow a fixed stochastic process and learn to reverse it, while flow matching directly learns a velocity field that can transform distributions along flexible paths. It is as if flow matching kept all the good part of the diffusion process intact but simplified by removing the unnecessary restrictions of the complex forward noising process.

Understanding the fundamental differences and similarities between these approaches provides deeper insight into the broader landscape of generative modelling and points toward exciting future directions in this rapidly evolving field.

Source: https://youtube.com/clip/Ugkx6Gm16nxY_jo1ydOslPrAVUdU1qiUwswr?si=h97F79DjZwrZQat9

Code

https://github.com/harshm121/Diffusion-v-FlowMatching

References

Ho, J., Jain, A., & Abbeel, P. (2020). “Denoising diffusion probabilistic models.”
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). “Score-based generative modeling through stochastic differential equations.”
Lipman, Y., Cohen-Or, D., & Chen, R. (2022). “Flow matching for generative modeling.”
Tong, A., Huang, J., Wolf, G., van dn Driessche, D., & Balasubramanian, K. (2023). “Conditional flow matching: Simulation-free dynamic optimal transport.”
Song, Y., Durkan, C., Murray, I., & Ermon, S. (2021). “Maximum likelihood training of score-based diffusion models.”

Moving back to India just after MS from the US

Harsh Maheshwari — Mon, 19 Jun 2023 07:18:58 GMT

There is already a plethora of content where individuals have shared their experiences and reasons on the decision to return to India from Western countries, predominantly the US. Some relocate during their retirement, others after the birth of their first child, some following the expiration of their Optional Practical Training (OPT), while some are compelled to move, and yet others choose to do so willingly. In my case, I made the choice to return to India immediately after completing my MS in the US, which is quite uncommon.

As I sit here at the Delhi airport, it’s 3:15 am and I find myself contemplating whether I made a foolish decision. Writing this article, I am awaiting the break of dawn, ensuring it’s safe to book a cab. Admittedly, given that I resided in Atlanta, where a mass shooting incident occurred just a kilometer away from my home, I don’t have many grievances about the crime situation in India.

It’s intriguing how even when I order the standard coffee from Costa Coffee, it doesn’t taste the same. It’s neither the familiar flavor of Indian coffee from my life before the US nor does it resemble the taste I remember from sipping on an iced caramel latte in my lab just a month ago. It serves as a perfect metaphor for the changes that have taken place here. It might sound overly dramatic, considering I was away for only 22 months. Yet, perhaps it’s not India that has changed the most, but rather, it’s me. I realize that 22 months may seem like a short duration, but when it comes to measuring the impact of time, experiences hold far greater significance. After all, time is truly relative!

I find myself reluctant to pen yet another article delving into the reasons behind my decision to return, mainly because I don’t have any groundbreaking insights to offer. Countless well-crafted articles, books, and videos have already covered the topic thoroughly, leaving little for me to contribute. However, what I would like to share is how I mentally prepared myself to embrace this decision. I believe I possess some valuable insights that can assist others in similar situations. Moreover, if properly abstracted, many of these reflections can be applied to a wide range of life choices. So, without further ado, the remainder of this blog comprises a collection of random thoughts that have been circulating in my mind and are relevant to this discussion. There is no specific order to them, but some may prove more beneficial than others depending on one’s individual circumstances. Personally, I found the final point to be the most valuable, which is why I have provided a more detailed elaboration on it.

Novelty vs Utility

Recall the last time you witnessed a sky painted entirely in shades of orange. Undoubtedly, you found it breathtakingly beautiful. Yet, I would argue that there is nothing inherently more captivating about an orange sky compared to a blue one. Its beauty stemmed from the rarity of the event itself.

Relocating to the United States offers a multitude of new experiences. Having spent a significant amount of time in India, immersing oneself in an entirely different culture, in a country that differs vastly in almost every aspect, inevitably exposes one to numerous disparities. This also leads to a lot of novel experiences! From small things like checking out at grocery stores to big things like commuting, things are different. Novelty possesses a magnetic allure — it injects a burst of excitement into our lives, invigorates our growth, and bestows us with new perspectives.

However, it is crucial to recognize that novelty is transient and challenging to sustain. When making comparisons between two things, we must exercise caution in discerning whether our attraction stems solely from their novelty or if they genuinely contribute utility to our lives. While moving to the US offers an abundance of novel experiences that may captivate us, it is essential not to equate novelty with superiority, as its allure is short-lived. Instead, we should evaluate things based on the value they bring to our lives, rather than being swayed solely by their temporary attractiveness. This principle holds true across various aspects of life, be it relationships, where seeking novelty in others can lead to cheating, or in consumerist societies driven by capitalism, where we relentlessly pursue novelty in gadgets, clothes, and possessions for fleeting adventures.

Appreciating the distinction between novelty and utility is important. Let us not be seduced by the transient appeal of novelty, but rather prioritize the enduring worth and impact that something can offer.

Think about long term

Unfortunately, the pursuit of long-term and short-term objectives often leads to conflicting decisions. Prioritizing short-term gains can hinder progress toward long-term goals. If your ultimate aim is to return to India, it is crucial to always keep that in mind. There will come a time when you have to make compromises, sacrificing immediate benefits for the sake of your long-term vision. Such compromises can be disheartening, but it’s essential to be steadfast in your long-term goals, even if it means relinquishing visible and attainable short-sighted objectives. It may be acceptable to adjust your milestones or timelines but ensure that you are not creating unnecessary challenges and difficulties for your future self.

One of the reasons I chose to return to India as soon as possible was the realization that the task would become increasingly arduous with each passing year in the US. For me, the incremental difficulties didn’t outweigh the advantages of staying a few more years. Completing my MS served as a natural checkpoint to evaluate this decision, as finding such convenient milestones in the future would prove challenging. There will always be another enticing project or opportunity on the horizon, requiring more intricate planning and alignment for a successful transition. The longer you reside in the US, the more attachments and commitments you will form, necessitating greater efforts to uproot yourself and move back.

The human mind possesses a remarkable ability to deceive itself. Being truly honest with oneself is a daunting task, and it is vital to cultivate the skill of accepting harsh truths. (Perhaps this blog post is my own way of self-deception regarding this decision?) In our daily lives, we often encounter choices where we must confront the harsh truth or convince ourselves that everything is fine. While the latter can provide temporary relief, excessive reliance on self-deception can impede personal growth and development.

Think about the past

We don’t hear this often and are usually encouraged to look forward and move on from the past. When you are thinking about your life in the US vs your life in India, you are also inherently comparing your current life and your past life. This makes the comparison very difficult because what you want is a comparison between locations, but what you have data for varies in location and time. This leads to two problems: a) Life in India would have changed, and more so for you. b) You probably are relying on your memories from your previous life and those could be very biased.

Talking about the first, visiting India will give you some data points, but just keep in mind that transitions are not reflective of the long term. (More on this later). Keeping in touch with counterfactual-people (people who have a life that is close to what yours would be if you move back) will probably give you a better idea. The second problem can be more difficult to resolve because our minds are weird, we don’t usually keep a very good account of the past in our memories — they are biased and I feel usually biased towards the good memories (except in some traumatic experiences). So make sure you have a good account of how you felt when you were in India, what bothered you, and what gave you happiness. Write it down as soon as possible because the memories will get more selective (and thus even more biased) with time.

We feel the most during transitions

Let’s look at this visualization, where the y-axis represents the comfort and the x-axis is the time.

Let’s say you are at comfort = s_0 at time = 0 and hypothetically you know you will reach comfort = s_T at a future time = T.

Given this, there are multiple paths you can choose to spend your time, the most obvious being this:

which maximizes your total comfort in life. However, the other options would be:

where the orange line represents a gradual increase in comfort and the green represents random bumps. In my opinion, life in green is more exciting than life in blue as we get very easily used to whatever our situation, and then the absolute comfort level doesn’t really matter much.

This also means it will be okay to downgrade as well.

We will get adapted to that easily as well, it’s just the transition when we might feel bad. But once we are settled in, it would be okay, and instead of the “comfort”, other things would matter more for our happiness.

In concluding my account of why I made the decision to move back to India from the US, I want to emphasize that the notion of what is “better” is highly subjective and personal. I share my experiences not with the intention to hurt or judge anyone who has chosen a different path, but rather to offer a glimpse into my own journey and the factors that influenced my decision. Each individual’s circumstances, aspirations, and values differ, and what may be fulfilling for one person may not hold true for another. It is important to approach such decisions with an open mind, understanding that there is no universal “right” or “wrong” path. Ultimately, it is the pursuit of personal happiness and fulfillment that should guide our choices, respecting the unique narratives and perspectives that shape each individual’s story.

I would like to reiterate that these are not the reasons why I moved to India, just some thoughts that helped me accept the decision. My reasons were like everyone else’s and nothing extraordinary. But I feel I should mention that I was in a privileged situation financially to come back just after the MS.

P.S. I would like to acknowledge that I sought assistance from ChatGPT in crafting this article. Prompting ChatGPT with “write it better” helped me refine the language and structure while retaining the core content and ideas as my own. It’s remarkable to have access to such powerful language models that can enhance our writing process. I am grateful for the collaboration and support provided by AI technology in creating this piece.

P.P.S. Even the above PS was written by ChatGPT, and it is self-praising itself without actually prompting it to say that.

Efficacy of Oxford–AstraZeneca COVID-19 (CoviShield) vaccine

Harsh Maheshwari — Mon, 17 May 2021 10:18:19 GMT

There has been a lot of confusion regarding the efficacy of the CoviShield vaccine and it’s relation to the gap between the two doses. Indian government has changed the recommended gap between the doses 3 times till now (4 weeks, 6–8 weeks and recently 12–16 weeks) citing “real-life” evidence especially from the UK. Interestingly, just a day after this decision, UK decreased the gap from 12 weeks to 8 weeks based on the concerns that the ‘Indian’ variant is more infectious and the need to protect their population quickly.

The Indian government’s decision has left many people wondering (especially those who have already recieved the vaccines with a gap less than 12 weeks) what ideally should be the gap between the two doses.

This article aims to bring the information present regarding the efficacy and the gaps in one place. I am not an expert on the topis and thus will try to stick to the research without providing any opinions.

What is the Oxford-AstraZeneca vaccine?

AstraZeneca COVID-19 Vaccine (Code: AZD1222) is a viral vector vaccine, which means the genetic code of the spike protien (DNA, in this case) is added to another virus called an adenovirus. Adenoviruses are common viruses that typically cause colds or flu-like symptoms. The Oxford-AstraZeneca team used a modified version of a chimpanzee adenovirus, known as ChAdOx1. It can enter cells, but it can’t replicate inside them [1].

Study 1: Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK

The first trial was conducted by AstraZeneca and Oxford and they published their findings on 8th December [2].

These are the results:

LD/SD: The priming dose (first dose) was a “low dose” whereas the booster dose (second dose) was a “standard dose”.
SD/SD: Both the doses were “standard dose”.

Gap between the two doses in the study

Interestingly, in this trial the researchers were not able to moderate the gap between the two doses. The gap varied across the groups and the above findings include people who recieved the vaccine with a gap varying from 4 to 12 weeks. The authors mention “The timing of priming and booster vaccine administration varied between studies. As protocol amendments to add a booster dose took place when the trials were underway, and owing to the time taken to manufacture and release a new batch of vaccine, doses could not be administered at a 4-week interval.”, in the paper

Study 2: Single-dose administration and the influence of the timing of the booster dose on immunogenicity and efficacy of ChAdOx1 nCoV-19 (AZD1222) vaccine: a pooled analysis of four randomised trials

The (almost) same group of people published another study [3] on 19th Feb talking about the influence of the gap on the immunogenecity of the vaccine.

Findings: “In the participants who received two standard doses, after the second dose, efficacy was higher in those with a longer prime-boost interval (vaccine efficacy 81·3% [95% CI 60·3–91·2] at ≥12 weeks) than in those with a short interval (vaccine efficacy 55·1% [33·0–69·9] at <6 weeks). These observations are supported by immunogenicity data that showed binding antibody responses more than two-fold higher after an interval of 12 or more weeks compared with an interval of less than 6 weeks in those who were aged 18–55 years.”

Results:

Vaccine Efficacy vs Interval for SD/SD group:

x axis: Time Interval (in days) between the two dose.

Vaccine Efficacy vs Interval for LD/SD group:

x axis: Time Interval (in days) between the two dose.

“Each datapoint shows one estimate of vaccine efficacy calculated in a subset of participants who received two doses of vaccine with a prime-boost interval falling within a 20-day interval. The x-axis shows the midpoint of the interval such that the first datapoint”. So in the above plots, the efficacy reported at interval of 36 is calculated from a sample of group who recieved the vaccine with a gap of 26–46 days.

What now?

So, it does seem like the increasing the gap increases the efficacy of the Oxford-AstraZeneca vaccine. Given this result, it is natural to wonder then why the government had started the vaccination of high-risk group with a short gap of 28 days. It is unclear that why the government has changed the recommended gap now when the evidence for increased efficacy has been since Februray, if not before. According to [4], Indian government had decided to go ahead with the shorter gap citing that the scale of India’s vaccination drive had necessitated it.

Regarding the decisin to increase the gap, the government says this decision is based on scientific evidence [5] “When you are in a very difficult situation, the way you are in India, you have to try and figure out ways to get as many people vaccinated as quickly as you can, so I believe that it is a reasonable approach to do,” Antony Fauci said on the government’s decision on increasing the gap.

The decision can not solely be based on the efficacy, there are other factors which influence this decision. The lack of transperancy in government’s decision making and the frequent changes in the guidelines, thus, leaves the people confused and doubtful.

At an individual level too, the decision is not straightforward. One should consider the trade off between a quick but apparently lower immune response and a better but delayed response. Given the spread, every day is very risky if a person is not vaccinated. [6–7] are some general advice which might be helpful in making these decisions.

[1] https://www.nytimes.com/interactive/2020/health/oxford-astrazeneca-covid-19-vaccine.html
[2] https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(21)00432-3/fulltext
[3] https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(21)00432-3/fulltext
[4] https://www.theweek.in/health/more/2021/04/23/the-reason-behind-the-longer-gap-between-vaccine-jabs.html
[5] https://www.hindustantimes.com/india-news/saddening-vk-paul-on-claims-that-covishield-dose-gap-widened-owing-to-crunch-101621073799487.html
[6] https://www.hindustantimes.com/india-news/when-to-take-covid-19-vaccine-2nd-dose-when-to-take-1st-dose-if-once-infected-here-s-all-you-need-to-know-101620893941629.html
[7] https://www.youtube.com/watch?v=K3odScka55A

Paper explained: “UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION” — ICLR’17

Harsh Maheshwari — Fri, 06 Nov 2020 08:48:32 GMT

Paper explained: “UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION” — ICLR’17

Original Paper can be found here. It was one of the three papers which got Best Paper Award at ICLR 2017.

What to expect from this blog: Summary of the paper and my understanding of the paper mixed with my personal opinions.

1. Crux of the paper

Let us begin by trying to summarise the claims of the paper.

This is the main claim of the paper:

The paper shows how traditional approaches fail to explain why large neural networks generalize well in practice.

To elaborate further on the above statement, let us look at the following points:

‘traditional approaches’: Generally generalisation performance of the large neural networks is attributed to either the model family (and the inductive bias associated with them, e.g. CNN for images) or the explicit (L2 Norm of the weights, Dropout, BatchNorm) as well as the implicit (properties of the optimization algorithm) regularisation techniques.
Generalization: Generalization is referred to the model’s ability to perform equally on the unseen data and hence is usually quantified as the generalisation error, i.e., the difference between test error and train error.

The paper brings to light that it is not trivial to answer why neural networks have such good generalisation performance. They do this by finding out that neural networks easily fit (memorise) random label and random data with no significant change in the training properties. This is highlighted in the paper neatly by the following two centred, italic notes:

Deep Neural Networks easily fit random labels.

Explicit regulrization may improve generalization performance, but is neither necessory nor by itself sufficient for controlling generalzation error.

Other claims of the paper:

A simple 2-layered ReLU network with parameters p = 2n+d can express any labelling of size n in d dimensions, i.e., the hypothesis class represented by a 2-layered ReLU network with p parameters can shatter datasets of size n in d dimensions.
The properties of the training process of the standard architecture neural network don’t change substantially when fitting on random labels — which leads to the claim that whatever justification there was for small generalization of these networks are not enough.
The statistical learning idea around (explicit) regularisation, namely, confining the hypothesis class to a smaller subset with manageable complexity, is not enough to explain the generalization abilities of deep networks (since the same networks also fit random data)
Implicit regularization — What properties of global minima explain their generalisation? Do all global minima generalise equally?
They call out the fact that understanding generalization is difficult even for a simple linear model. For linear models, they investigate two properties of the minima and check if these properties signal towards generalization performance of a model.
a) Curvature of the minima: In their construction of the linear case, all the minima had the same curvature
b) Norm of the minima: For the linear model in consideration, assuming 0 initial weight, they find that the solution that SGD would converge to is the minimum-l2-norm solution. Unfortunately, this also doesn’t guarantee better generalization performance.

The paper asks an important question — What makes Deep Neural Networks generalize well? It brings out the fact that all trivial answers to this question are not correct.

2. Significance

The question that is put forward for the readers is why do Neural Networks generalise well. We have all taken advantage of neural networks’ performance someway or the other, they have become immensely popular and are almost everywhere. But we still do not understand what makes them generalise well.

An answer to this question would enable better design of architectures, optimization algorithms and regularisers. Not knowing why something is working well makes it harder to improve and interpret. A satisfactory answer to this question thus would have profound implications in understanding Deep Learning, making it more reliable and robust.

3. Experimental Setup

The experimental results of the paper seem to not shock me — there are some similar reactions on the OpenReview forum. It is not shocking to note that the standard regularizers only contribute so much to the generalization performance. It is also not shocking that even with standard regularizations in place, Deep Networks can fit random labels.

The following set of experiments are performed:
1. True Labels: Original dataset
2. Partially Corrupted labels: label of each image is independently corrupted with a probability p
3. Random Labels: p=1
4. Shuffled Pixels: one random permutation of the pixels is applied to all images in train and test set.
5. Random Pixels: Random permutation is applied to each image independently
6. Gaussian: random pixels are generated from a Gaussian distribution with mean and variance matching original image.

Standard architectures are trained on CIFAR 10 and ImageNet benchmarks with the same set of hyperparameters and the training and test accuracies are noted.

The first 3 experiments can prove if the deep networks can ‘shatter’ datasets of practical sizes (Although to prove shattering, one should show all possible labels can be explained by a hypothesis class, it is safe to assume that if random labelling can be explained, any labelling would be). Close to 0 training error in the second and third experiments bring out the fact that even though there is seemingly no humanly explainable relation between the images and the labels, deep networks still learn some function which satisfies random labelling. This means that the hypothesis class is rich; then why does it learn ‘correct’ (correctness based on generalization error) function when given humanely labels.

The latter 3 experiments show that even if the images are not natural, CNNs still are able to learn functions which give close to 0 error. In other words, even after the inductive bias in CNN architectures, the hypothesis class is rich enough to learn functions on top of random pixels. CNN architectures seem to not help much when learning with natural images, as they perform similarly for non-natural images as well.

Two attributes differ for random data and true data: 1) Training error: The training error for random labels is not 0% as with the true labels on ImageNet, but is still very low (~5%) and is much better than random chance. 2) Learning characteristics. The paper also claims that the learning characteristics (train curves, epochs required) are similar for all the above variants. However, some people feel otherwise (OpenReview) and I would also not want to read too much into it. Although learning characteristics also can give a lot of information about the optimization process, the paper does not provide much insights or experiments to claim something confidently.

Experiment with regularizations
Three regularizers are considered which are very commonly used:
1. Data Augmentation: Domain-specific transformations are performed, like random cropping, hue perturbation etc.
2. Weight Decay: l2 regularization on weights
3. Dropout

Without changing the hyperparameters, the experiments are performed with various regularizations turned on and off. On CIFAR10 with or without regularizers, the generalization error is very low. On InceptionNet however, turning off regularizations resulted in 18% drop in test Top-1 accuracy.

The authors also observe that data augmentation techniques using the known symmetries and changing the model architecture seem to be more impactful than just using weight decays or preventing low training errors. Changing the model architecture results in changing the hypothesis class and is a way to model the inductive bias, i.e, our knowledge about the problem and domain. Thus it helps in reducing the complexity of the model and hence reduces the variance, and if the modelled inductive bias makes the hypothesis class closer to the underlying true function, it also reduces the bias error. When both the bias and variance errors reduce, the generalisation error reduces.

The conclusion from these experiments is that regularizations, when tuned properly, help to improve generalization performance, but they are not the reason why deep neural networks generalize well, because even when turned off, the networks continue to perform well.
This to me is not very surprising, nobody would have claimed that it is the regularizers which bring the generalization errors from 90% to 10%.

4. What can be learnt from the paper

We do not know why deep networks generalize well, and the obvious answers are not the correct answers.
Hypothesis classes are rich enough to contain functions which explain arbitrary labellings and optimization algorithms do settle on such functions.

5. Other insights, gaps and some follow up references

Why rich hypothesis spaces are undesirable?
According to learning theory, any hypothesis class which has infinite VC dimension is not PAC learnable. Which means if a hypothesis class can explain all possible labels for a dataset of infinite size, then it is impossible to get a Probably Approximately Correct function. Philosophically, if a hypothesis class is able to explain any fact, then it is useless. [2] noted that a tight bound on the VC dimension of feedforward networks with ReLU activations is: VC-dim = O(k ∗ dim(w)), hence these neural networks can shatter any dataset of size n < VC-dim, which is usually the case as the size of a practical dataset is far lower than the number of parameters.
The assumption in thinking about generalization is that if the network performs similarly on unseen data, we believe it is generalizing well. But we forget that the unseen data is usually very similar or close to the training data. But as noted by [1], if the test images are slightly changed, the test error rises significantly, which hints towards a need for better quantification and evaluation of the generalisation performance itself.
Existing work on neural network pruning (e.g., [4]) demonstrates that the function learned by a neural network can often be represented with fewer parameters. [5] shows the importance of weight initialisation and prove that there are subnetworks which when initialised properly (and hence directed towards the better minima) can reach the same performance as the larger network (with more complex hypothesis class), but when initialised randomly, the same subnetwork architecture is unable to reach the best performance.
A comparison of the performance of different global minimas as done in [3] seems like a step in the right direction.
Side note: It is interesting to me that if a Machine Learning algorithm learns a function which explains human comprehendible labels, we call that ‘learning’ otherwise we call it ‘memorization’. We do want our algorithms to learn what we learn, we want to give them the knowledge that we possess, and thus it makes sense to tag any other kind of learning as just plain memorization. But it is technically not memorization of labels if it is not humanely comprehendible, it is just learning some other function which we do not comprehend and that is not the way we perceive our world.

[1] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? International Conference on Machine Learning, 2019

[2] N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930, 2017

[3] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017.

[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.

[5] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.

Social Media — An Echo Chamber for politics

Harsh Maheshwari — Mon, 23 Dec 2019 14:40:53 GMT

Social Media — An Echo Chamber for politics

The impact of social media in politics has been discussed a lot.

Users posting aggressive and extreme opinions receive more attention and invoke more sentiments.
To get things viral, you have to hit the extreme spot of sentiments, which is leading to dangerous opinions getting an opportunity to reach millions of people. To be sensational, viral, you can’t use logic or data, instead, you need to be full of sentiments, and provoke the same sentiments in the readers. The more extreme the opinions are, the more attention it gets, the more people it reaches. Slowly and slowly, all these websites are filled with content provoking only sentiments, and data takes a backstage position.
Sentiments quite easily dominate logics and data in such situations.

The other issue is of social media providing an echo chamber of opinions to its users.

Social media of all sorts provide a platform where you get a feed or recommendations of content which you have watched or liked before. Well, that’s the work of a recommender system, present aggressively in all social media platforms.

The problem though is, that you are stuck in an echo chamber where you are bombarded with content on which you agree.
You constantly and continuously consume the content having similar opinions, having similar aggressive tones.
Which reinforces your beliefs and opinions and gives you the (over) confidence that whatever you believe and whatever you are saying is 100% true, and others, who have a different set of opinions are either stupid or don’t have a complete understanding.

This is leading to a strong polarization in opinions and people reaching extremes.
This is damaging!

Instead, these platforms should provide a window where you are allowed to listen to the other side of the debate. Some sort of leakage in the current echo chambers.

I feel the impact of such improved systems can be huge and can help a lot in giving people the opportunity to better judge and adjudicate the debate.

Hence I request everyone reading this to please question everything they hear or read.
Irrespective of whatever you believe in, always think of yourself in a debate with the things you read and try to question it.
Use your own mind and logical analysis before sharing anything and everything which just supports your cause.

Udaipur — Where to travel?

Harsh Maheshwari — Tue, 06 Aug 2019 15:47:41 GMT

Udaipur — Where to travel? Where to eat?

I will describe to the best possible extent all the places worth visiting so that you can decide where you would like to go. I will also mention good places to eat at the end of this blog.

1. Fatehsagar Lake:

It is a man-made lake and I can easily say the most favourite place for most of the Udaipurites. In the evening people come here to just sit and hang out. Mostly the place is filled with students but still for me it is always peaceful to come here.

The best time to visit it is in the evening or early morning. There is a roadside sandwich place whose sandwiches are famous and coffee of ‘Sai Sagar’ coffee shop.

2. Lake Pichola

It is another artificial lake and the hotel “Lake Palace” is situated inside this lake, “Jag Mandir” is also situated inside this lake and the only way to travel to these places is by boat. If you plan to go here, I suggest you go for a boat ride. It will be a 20-minute boat ride but a very pleasant one. There are two kinds of boat rides possible

a. With Entry into Jag Mandir: For this, you will have to purchase tickets from City Palace. In this boat ride, you will also get to see Jag Mandir. I suggest you to go for this only if you have time. Jag Mandir is good but nothing extraordinary.

b. Without Entry to Jag Mandir: For this, you can get the tickets easily at Pichola Pal, opposite Dudh Talai Lake.

Both the boat rides are same other than Jag Mandir entry.

(The white building is Lake Palace, and behind it, the yellow one is City Palace)

3. Neemach Mata Temple

It is a temple located at the top of a hill. It takes around 30 minutes to go to the top. One of the best places to witness sunrise or sunset and get a view of the entire city.

(The lake in the first image is Fateh Sagar Lake)

4. Sajjangarh Fort

Also known as Monsoon Palace and is situated at the top of another hill at a height greater than that of Neemach Mata Temple. Although the best time to visit is during monsoon, it is still a good place to get the view of Aravali Hills. It is said that the Maharana Sajjan Singh built it at the top of the hill to get a view of his ancestral home, Chittaurgarh. You can get auto rickshaws or cabs which can take you to the top and they will only bring you back. There are no cab services at the top and you will have to make the auto-rickshaw or cab wait at monsoon palace.

5. Ambrai Ghat

It is a lesser known place, gradually getting famous among students and thus tourist guide websites as well and is one of my favourite places in Udaipur. It is just a small “ghat” (set of steps leading down to a river or lake, in this case, Pichola Lake). It is generally quiet and peaceful. Best time to visit is after sunset, at night. It is just in front of City Palace and thus the best place to witness the lightings and beauty of the City Palace at night.

6. City Palace Museum

You can understand the history of Udaipur and Maharana Pratap here. It is also the home to Arvind Singh Ji Mewar who is the son of Bhagwat Singh Ji Mewar who was the ruler of Mewar Province. My old school is also situated inside the city palace complex.

Other places to visit:

Karni Mata Temple (Another trek like Neemach Mata Temple, ropeway also available), Badi Lake (Little outside the city), Jagdish Temple (near City Palace), Gangaur Ghat (near City Palace).

Famous Places to eat:

1. Krishna Daal Baati:

It is famous for daal baati, which is traditional Rajasthani Food. It is situated at Suraj pole which is one of the prime location and easy to reach.

2. Ambrai Resturant:

It is near Ambrai Ghat (5th point above) and is by the lakeside so the view is spectacular. If you are going here, you can skip the Ambrai Ghat, both of them essentially have the same view.

3. Jheel’s Ginger Coffee Bar & Bakery:

It is a lakeside cafe famous for the view, though the food isn’t the best. It is near City Palace.

4. Pandit Paav Bhaji:

It is my favourite paav bhaji. The ambience is not very good and also not the place as such. But the taste is very amazing. There are a lot of duplicate Paav Bhaji stalls and restaurants by the same name so I suggest you go to this particular place only.