Stories by Arian Amani on Medium

Why Perturbation Prediction Needs Much Better Metrics

Arian Amani — Thu, 19 Mar 2026 07:31:00 GMT

What if I told you that some of the most celebrated deep learning models for predicting genetic perturbations are — statistically speaking — barely better than just… averaging everything and calling it a day?

That’s not a provocative exaggeration. That’s what you get when you look carefully at the numbers. And honestly? It’s one of the most important findings to come out of computational biology in recent years — because the implications aren’t confined to academic papers. If pharma companies and biotech startups are investing millions or billions of dollars on computational perturbation models to eventually find their next drug target, they need to know whether those models are actually predicting biology, or just predicting the average. There’s a very big difference. And right now, our usual metrics often can’t tell the two apart.

Cover made with Gemini

Perturbation Prediction, Briefly Explained

Let me back up for anyone who isn’t deep in single-cell genomics already (I’ll be quick about this).

A perturbation — in this context — is any intervention that changes what a gene does in a cell. The most common examples are genetic knockouts (using CRISPR to silence a gene entirely), gene activations (CRISPRa), and gene inhibitions (CRISPRi). Think of it as deliberately adjusting one control in a highly complex machine to see how the whole system responds.

That machine is a living cell. And cells are extraordinarily complex systems. Each cell expresses thousands of genes simultaneously, and those genes all talk to each other through regulatory networks. When you perturb one gene — say, you knock it out — you don’t just silence that gene. You potentially ripple changes through hundreds of other genes. Some go up. Some go down. Most don’t change at all.

Gene expression — which basically means how actively each gene is being “read” and turned into protein — is what we measure to understand this ripple effect. Using technologies like Perturb-seq, we can now measure the gene expression of thousands of individual cells after perturbations, at scale.

So what is perturbation prediction, exactly? It’s the task of training a model to answer the question: given a cell at baseline, what happens to its gene expression profile after a specific perturbation? If your model can answer that reliably for perturbations it has never seen before — out-of-distribution generalization — you’ve essentially built an in silico screen. You can ask “what if we knock out gene X?” without actually running the experiment.

The practical stakes are enormous. Drug discovery costs north of $2.5 billion and 12–16 years on average to bring a single compound to market. A working perturbation prediction model could dramatically accelerate target identification — flagging which genes are worth pursuing before a single pipette is lifted. That’s why this field has attracted so much talent and investment. (Yeah it’s going too crazy now, everywhere you look there’s a huge VIRTUAL CELL hitting you in the face!)

The Problem — How Do We Know If a Model Is Actually Good?

Here’s the thing. Building a model is one challenge. Knowing whether it actually works is a different challenge entirely — and in perturbation prediction, we’ve been doing this part poorly.

Every machine learning model needs an evaluation metric — a score that tells you how close your predictions are to reality. The standard choices in perturbation prediction are things like:

Pearson correlation — measures how well the pattern of predicted gene expression changes matches the true pattern. Think of it as: does the model predict that the right genes go up and the right genes go down?
R² — measures how much of the variance in the true outcomes your predictions explain.
MSE (mean squared error) — the raw average prediction error, gene by gene.

These all sound reasonable. And they are reasonable… until you notice a critical structural problem with perturbation data.

Most genes don’t respond to any given perturbation. When you knock out a gene, maybe 5–20 genes actually change in a meaningful way — the rest of the transcriptome just sits there, unaffected. But your model is being evaluated on all genes simultaneously, thousands of them. The vast majority of those genes are noise relative to the perturbation.

So what happens when your metric is measuring mostly noise? You get very impressive-looking numbers… that mean almost nothing.

The Baseline Problem — When “Mean Prediction” Wins

This is where it gets interesting.

A non-informative baseline — sometimes called the mean-predictor baseline — is the simplest possible “model” you can imagine. It doesn’t learn anything about specific perturbations. It doesn’t use gene networks, or pre-trained embeddings, or attention mechanisms. It just takes the average gene expression across all perturbed cells in the training set, and predicts that for every single perturbation.

No parameters. No learning. Just the plain average of all perturb cells in your dataset.

And here’s what multiple rigorous benchmarking studies have now found: this baseline is surprisingly competitive with state-of-the-art deep learning models.

Two complementary papers tackle this from different angles. The Systema paper (Viñas Torné et al., 2025, Nature Biotechnology) found that, using standard metrics like Pearson correlation computed on all genes, the simple perturbed mean baseline “outperformed other methods across all datasets” when predicting unseen gene perturbations. Miller et al. (2025) independently confirmed that common metrics like MSE and control-referenced Pearson correlation are “often poorly calibrated” — meaning they frequently rank null predictors above actually informative models.

Systema Figure 1c (Viñas Torné et al., 2025, Nature Biotechnology)

Why does this happen? Let me give you an analogy that I think makes it click immediately.

Imagine a weather forecaster whose only strategy is to predict “no rain” every single day. If you live in the Sahara, that forecaster is going to look really good on accuracy. “No rain” is correct maybe 90%+ of the time. The metric — raw accuracy — doesn’t care that the forecaster has no idea what weather actually does. It just sees correct predictions. This is a classic ML poor metric situation.

Perturbation prediction has exactly this structure. If 95% of genes don’t change after a perturbation, and your model correctly predicts “no change” for all of them (which the mean baseline does, approximately), you get a high Pearson correlation score — even if you completely fail on the 5% of genes that actually matter.

This also has another point of view. What if the perturbation prediction model fails to learn the actual effect of each perturbation separately, and just collapses into learning one general “Perturbed” distribution? Or just general “Cell” distribution!? Then all the generated perturbed cells by that model will look exactly like the general average of all the perturbed cells or even the average of all the cells in the dataset, the model has just learnt what any “cell” or a “perturbed cell” generally looks like and keeps generating from that learnt distribution; Distribution of all cells in that dataset. Therefore again, a plain average of all perturbed cells will result in very similar or even better pearson! You can actually test this yourself very quickly and easily, it gives you a good idea, I tried it on the Replogle K562 dataset:

You get a pearson of 0.99 just by predicting the average of all perturbed cells for ANY perturbation!!!!!!

The genes that actually matter are called differentially expressed genes (DE genes) — genes whose expression genuinely shifts in response to a specific perturbation. This is where the biology lives. This is what matters for understanding mechanisms and finding drug targets.

The Systema paper quantifies something they call systematic variation — the consistent transcriptional differences between perturbed and control cells that arise not from specific perturbation effects but from confounders like cell-cycle differences, stress responses, or selection biases in which genes were chosen for the perturbation panel. They measure this using cosine similarity between perturbation-specific shifts and the average perturbation effect across all perturbations. And they find that it’s present across many contexts. In the widely-used Adamson and Norman datasets, systematic variation is high — meaning the perturbations share a lot of “background” signal that has nothing to do with the specific biology of each perturbation. The correlation between systematic variation and model performance scores? 0.91–0.95 in some cases.

Systema Figure 3a (Viñas Torné et al., 2025, Nature Biotechnology)

Systema Figure 3c (Viñas Torné et al., 2025, Nature Biotechnology)

The authors put it bluntly: high predictive performance in standard metrics “may reflect dataset-specific confounding factors rather than accurate modeling of perturbation biology.”

To say it plainly without any jargon, this figure shows that models tend to perform well on Adamson (when reporting simply pearson), because the dataset has a lot of Systematic Variation (SV), so the models are learning and predicting those general effects (let’s even call it batch effects thinking of perturbation as a batch due to the SV) very well instead of actually learning per-perturbation biological effects.

That’s a striking finding. And it’s not a criticism of any single lab or model — it’s a community-wide issue with how evaluation has been set up.

What Systema Proposes — A Better Evaluation Framework

So what do we actually do about this?

The Systema framework (Viñas Torné, Wiatrak, Piran, Fan, Jiang, Teichmann, Nitzan & Brbić, Nature Biotechnology 2025) is a serious attempt to answer that question. Rather than patching an existing metric, it rethinks the reference point used for evaluation.

Standard metrics compute how well a model predicts the shift from control cells — meaning, compared to unperturbed cells, how does gene expression change? The problem is that this comparison inflates scores when all perturbations share a common background shift (systematic variation). Even a model that predicts the same generic “perturbed” profile for every perturbation looks good because it’s capturing that background.

Systema’s key innovation is to change the reference. Instead of comparing predicted profiles against control cells, it compares them against the perturbed centroid — the average of all perturbation-specific centroids. This refocuses evaluation on what makes each perturbation distinct from other perturbations, rather than what makes it distinct from baseline.

Systema Figure 4a (Viñas Torné et al., 2025, Nature Biotechnology), Compare this to Figure 3a previously attached to see the huge difference between the vectors, now they actually do point todifferent directions rather than one direction with slight changes. This is the vector we want to evaluate.

When you apply this corrected reference, something dramatic happens: all the previously impressive scores crash. The mean baseline — which was matching state-of-the-art models under standard metrics — collapses to near-zero correlation. And suddenly you can actually see differentiation between models: fine-tuned scGPT, for instance, achieves marginally higher scores among other methods in this benchmark.

Systema Figure 4c (Viñas Torné et al., 2025, Nature Biotechnology)

Systema also introduces a metric called centroid accuracy — which measures whether a model’s predicted post-perturbation profile is closer to the correct ground-truth centroid than to the centroids of other perturbations. Think of it as a nearest-neighbor accuracy in expression space. Such metrics help us separate high quality generations with generations that just give us an estimate of what any cell would look like on average.

Systema Figure 5a (Viñas Torné et al., 2025, Nature Biotechnology)

For researchers reading this: Systema is open-source at github.com/mlbio-epfl/systema. All and all it shows a good benchmarking, but I wouldn’t choose my models based on these results as there are many more models out there performing much better in my opinion. But the metrics? BAM! These are THE WAY to go on evaluation.

The takeaway: a model that looks state-of-the-art on one metric can look mediocre on another. The community needs agreed-upon standards — and Systema is a serious attempt at building those.

A Complementary Lens — Miller et al. on Metric Calibration

Systema diagnoses one problem: standard metrics are inflated by systematic variation, making the mean baseline look like a serious competitor. But there’s a second, related question that Systema doesn’t fully resolve — even if we correct for systematic variation, how do we know whether any metric is sensitive enough to detect genuine biological learning in the first place? That’s the question Miller et al. (2025) take on, and their answer adds an important layer to the picture, where Systema asks “are we measuring the right thing?”, Miller et al. ask “is our measurement instrument sensitive enough to register a difference at all?”

Their argument: the problem isn’t that deep learning models fail to learn biology. The problem is that the metrics we’ve been using can’t detect whether a model is learning biology. They call this metric calibration — the degree to which a metric actually improves in response to meaningful biological signal.

Figure 1b, c (Miller et al. , 2025, bioRxiv)

To measure calibration, they introduce a clever tool called the Dynamic Range Fraction (DRF). The idea: you need both a negative control (the mean baseline — a model that predicts literally nothing informative) and a positive control (something you know should score well — they use an “interpolated duplicate” baseline that blends the mean with actual held-out data weighted by differential expression strength). The DRF measures how much of the gap between these two extremes is captured by the metric for a given model. A well-calibrated metric spreads models out across that range. A poorly calibrated one bunches everything near the negative control.

Figure 2c (Miller et al. , 2025, bioRxiv)

To put it very simple, it shows if a metric is going to show and actual difference when you give it just an average of all cells, and when you give it the target perturbed cells themselves!! This is quite important no? :)

Figure 2a,b (Miller et al. , 2025, bioRxiv)

Just to explain interpolated duplicates a bit better for those who are curious, for each perturbation X, it is a weighted average of two vectors. Vector M which is the average of all cells/perturbed cells, and vector P which is the average of perturbation X cells (well actually a subset of them, but that doesn’t really matter now). As you guessed, M shows a really good representation of any cell population in general for most non-DEG genes (housekeeping, etc.) and P shows a good representation for our perturbation I, and it’s main difference is going to be on its Differentially Expressed Genes compared to M. So they take a weighted average of M and P as (1-PValues)*P+(PValues)*M, with P Values from the DEG analysis.

Un-mathing this would be, for for each gene, if it looks significantly different in perturbation X, use the value from P, and if it doesn’t look significantly different, use the value from M.

Their finding: across 14 Perturb-seq datasets and 13 metrics, standard metrics like MSE and Pearson(Δ control) are indeed often poorly calibrated — particularly for perturbations with few differentially expressed genes. But metrics like weighted MSE (WMSE), weighted R², and Normalized Inverse Rank (NIR) — which give more weight to the genes that actually change — are consistently well-calibrated.

Figure 2d,e (Miller et al. , 2025, bioRxiv)

And when you use those metrics? Deep learning models — PRESAGE, scLambda, fMLP-GenePT — significantly and consistently outperform the mean baseline. The models are learning something real. We just weren’t measuring it correctly before.

As someone working on perturbation prediction myself, I find this genuinely reassuring. But it doesn’t let us off the hook: the field needs to standardize around well-calibrated metrics, urgently.

One last thing to point out here, I still believe simple linear models do very well, even in the study above, I suspect using a better representation in the linear model can lead it to reach results as good as the foundation models if not better. I’m generally not a very huge fan of transformer-based foundation models in scRNA world. To be tested :)

Why This Matters Beyond Academia — The Pharma Angle

Let me zoom out for a moment, because I don’t want the non-technical audience reading this to have tuned out during the centroid accuracy discussion :)

Drug discovery is brutally expensive. We’re talking about processes that routinely exceed $2.5 billion and 12–16 years per approved drug, with failure rates above 90% in clinical trials. A major reason targets fail in the clinic is that the preclinical models used to validate them didn’t accurately predict human biology.

Perturbation prediction models are being positioned as a way to change this — to allow in silico screens at scale, to prioritize targets before animal studies, to guide CRISPR experiments, to build what some researchers are calling “virtual cells.” Companies are investing serious capital on the premise that these models work.

If those models are being evaluated with metrics that can’t distinguish actual biological signal from statistical noise, then the confidence intervals we’re placing around their predictions are wrong. A model that looks 80% accurate under Pearson(Δ control) might be capturing mostly systematic variation — not perturbation-specific biology at all. Deploy that model in a real drug pipeline, and you’re making multi-million-dollar decisions based on little more than a sophisticated mean prediction.

That’s not just a research paper problem. That’s a resource misallocation problem that costs years and hundreds of millions of dollars.

Bad metrics don’t just produce bad papers — they produce bad drugs.

What Should Actually Change

As someone who works on perturbation prediction every day, here’s my honest take on what the community needs to do differently.

1. Always report performance on DE genes, not just all genes. If a paper reports Pearson correlation on all 5,000 genes without also reporting performance on the top 20–50 differentially expressed genes per perturbation, it’s hiding the most important information. Make DE-gene-focused metrics the default, not an afterthought. Both papers agree on this. It needs to become a community norm.

2. Always include a mean-predictor baseline — and report it prominently. If your model can’t beat the mean, the paper needs to say so clearly and deal with what that means. Not bury it in a supplementary figure. The mean baseline is embarrassingly easy to compute and should be Table 1, Column 1, in every perturbation prediction paper going forward.

3. Adopt well-calibrated metrics. The Miller et al. framework gives us a principled way to check whether our metrics can actually distinguish good models from null models in a given dataset. WMSE, weighted R², and NIR consistently pass this test. Standard MSE and Pearson(Δ control) often don’t — especially in large-scale datasets with many weak perturbations. Use the former group.

4. Use standardized benchmark splits. The Systema benchmark spans ten datasets, with standardized train/test splits evaluated across three independent runs for each. Cross-paper comparisons are meaningless when everyone uses different splits. Adopt shared benchmarks. Leaderboards are only useful if everyone is playing on the same field.

5. Be honest about what these models can and can’t do. Perturbation prediction is genuinely hard. Generalizing to completely unseen perturbations, predicting perturbation-specific effects rather than systematic variation, handling weak perturbations with few DE genes — these are unsolved problems. Pretending otherwise doesn’t help anyone. The field moves faster when we’re honest about where the frontiers actually are.

Wrapping Up

Here’s what I want you to take away from all of this.

We’re making progress in perturbation prediction. The deep learning models are learning something real — Miller et al. show this convincingly when you use the right lens. But the apparent progress reported in many papers is overstated, because the metrics we’ve been using reward models for capturing systematic background noise rather than genuine perturbation biology. Fixing that starts with metrics.

The good news: the fix isn’t some distant research problem. We have the tools. Systema gives us a standardized evaluation framework with better reference points. Miller et al. give us a principled framework for checking whether our metrics are even calibrated. The community just needs to actually use them.

As someone whose daily work literally involves building perturbation prediction models, this isn’t abstract to me. I care deeply about whether the models I work on are actually capturing biology — not just gaming a leaderboard. Papers like these two are exactly the kind of uncomfortable self-examination that moves science forward, and I’m grateful they exist.

Have thoughts? Spotted something I got wrong? Drop a comment below — I’d love to hear from people working in similar areas. The more the computational biology, ML, and pharma communities talk to each other about what “good enough” actually means here, the faster we’ll get there.

(And if you’ve made it this far — thank you. Truly. This stuff matters.)

About Me

Arian Amani is a Machine Learning Scientist at AI VIVO (Cambridge, UK) and a Machine Learning Researcher at the Lotfollahi Lab, Wellcome Sanger Institute. He works on generative modeling, drug discovery, and computational biology, with a focus on building AI systems that bridge single-cell perturbation biology and drug discovery — which is precisely why the metrics question in this article is not abstract to him :)

Website: ArianAmani.com
LinkedIn: linkedin.com/in/arianamani
Twitter: twitter.com/ArianAmaani

Let’s get in touch :)

References

Viñas Torné, R., Wiatrak, M., Piran, Z. et al. Systema: a framework for evaluating genetic perturbation response prediction beyond systematic variation. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02777-8
Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics
Henry E. Miller, Gabriel M. Mejia, Francis J. A. Leblanc, Bo Wang, Brendan Swain, Lucas Paulo de Lima Camillo
bioRxiv 2025.10.20.683304; doi: https://doi.org/10.1101/2025.10.20.683304
Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). https://doi.org/10.1038/s41592-024-02201-0

Leveraging Machine Learning to Predict Cellular Behavior in Drug Treatments and Discovery

Arian Amani — Tue, 27 Aug 2024 11:55:02 GMT

Leveraging Machine Learning to Predict Cellular Behavior in Drug Treatments

Rambling about me not writing and me now writing, please skip :)

Ok, it’s been a looooooong time since the last time I wrote something (A Deep Learning Road Map And Where To Start), but I always wanted to keep it going and write more, I was just too lazy. So now I’m planning to make up for that! (I’m gonna be honest, writing makes your LinkedIn look better too lol)

I wanted to start with a more basic introductory topic, like how ML is used in Genomics, what Single-Cell Sequencing is, more general areas of research currently happening in BioML, etc., but I thought since it’s been a while since I decided to start writing and never did, I’m going to start with something that is more interesting to me personally and my field of research. To give myself a bit more motivation. :)

Ok, let’s get to it! I’m going to keep this not academic paper technical, but technical enough to be useful for getting started.

Why is ML important in Drug Discovery?

Drug discovery is notoriously expensive (often exceeding USD 2.5 billion), time-consuming (taking 12–16 years on average), and fraught with high failure rates. For context, the drug-like chemical space — the number of possible drug-like molecules that can be made — is estimated to be around ~10⁶⁰ molecules. This is HUGE.

This obviously is not a scale to be tested in practice, even within the most resourceful wet labs and pharmaceutical companies.

So, how do we tackle this challenge? Enter Machine Learning. Simply put, ML leverages large datasets of drug properties and effects (e.g., historical data on drug interactions with various cells or diseases) to make predictions about new compounds. With powerful computational resources, ML can analyze these complex datasets to predict outcomes for novel drugs. This is exactly machine learning by definition.

Let’s get more technical and on point.

Illustration by Michele Marconi

Some areas of drug discovery, currently being addressed by machine learning (brief intro)

Target Identification and Validation

Identification: Finding potential targets (such as genes or proteins) that are hypothetically associated with the disease being studied, and altering them will change the state of the disease.
Validation: Confirming that a biological target (such as a protein, gene, or pathway) is truly involved in a disease and is a viable candidate for therapeutic intervention.

Hit Identification

The process of identifying the drugs that are likely to interact with the biological target we have.
For example, ML models are trained on data from known drug-target interactions and can predict the likelihood of new or untested compounds binding to a target.

Lead Optimization

After finding Hits that are aligned with our interests, these drugs undergo limited optimization to identify promising lead compounds.
Now these Lead Compounds, undergo more extensive optimization to improve various aspects of the drug such as:

Reducing off-target activities
Increase/Optimize drug-target interaction
Prediction and reduction of drug-toxicity
Optimization of drug-likeliness properties: Improve the lead compound’s drug-like properties such as solubility, permeability, and stability through iterative modifications.
And more…

Drug Repurposing

Which is basically identifying new uses for existing drugs.

Predicting Drug-Drug Interactions

To predict potential interactions between drugs, helping to avoid adverse effects in patients taking multiple medications.

De Novo Drug Design

Using generative models to generate new unseen drugs based on our needs.
Optimization of existing drugs into new compounds for the purpose of our disease target

Source

These are just some important areas being tackled by ML/DL methods every day in research and industry. Now, I want to dive into the important task of “predicting a cell’s reaction to specific drugs being applied to it.”

Intro to Drug Perturbation Prediction

Let’s break down the task! What is Drug Perturbation Prediction in Cells?

For the purpose of this blog, when I refer to a “cell”, I am talking about the Single‐cell RNA‐sequencing (scRNA‐seq) representation of a cell (Data-wise, a vector of integer numbers, where each number represents the expression of its corresponding gene in that cell -aka counts-; A vector of length usually somewhere between ~2000 to ~7000 which is the number of important and selected genes after preprocessing -i.e. feature selection-

To put the main task in simple words, given a cell “A” at control condition -i.e. no drugs have been applied to it e.g. untreated cell from a patient with the disease in our study- we want to be able to predict what would cell “A” look like if drug “B” was applied to it.

scGen, Fig 1

In terms of input/output in inference time, the model for this task takes a cell (vector of gene expressions), maybe some information about the cell-covariates- (e.g. cell-type, age, tissue, …), and the drug that we want to apply to the cell as input, and will give you how the cell is probably going to look after applying the drug, as output (vector of gene expressions).

Sub-tasks usually tackled in this area:

Unseen drug effect prediction (completely new drugs)
Prediction of an unseen combination of seen drugs
Drug effect prediction in unseen cell types, organs, age groups, disease progression time points, etc.

Now “Drug Perturbation Prediction” is actually a niche sub-problem of the more general task, “Perturbation Prediction”. The methods are mostly the same in the two problems, the major difference is the variability of beautiful methods to represent a drug!

Perturbation Prediction Architectures

I am going to focus on VAE models as they are the current core and state-of-the-art of Perturbation Prediction tasks.
Although VAEs have been used widely, there are currently many different architectures showing powerful capabilities in this field such as GAN, Transformer, and Diffusion models.

Variational AutoEncoders

VAEs have been the most widely used architecture for the task of predicting perturbation effects in single-cell data in the past few years. They are extremely good at reconstructing gene expressions, and due to their input → representation → output architecture, are absolutely flexible to be altered in the latent space.

If you don’t know about VAEs, I highly recommend you to read this post: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

I will explain by going into CPA’s architecture as a case study, one of the pioneering and important models in Single-Cell Perturbation Prediction. (Lotfollahi et al.)

CPA: Compositional Perturbation Autoencoder

They introduce CPA with the following sentence: “Here, we present the compositional perturbation autoencoder (CPA), which combines the interpretability of linear models with the flexibility of deep-learning approaches for single-cell response modeling.”

CPA leverages its AutoEncoder architecture, to receive a cell's gene expression, its relative covariates (Batch, Cell Type, Age, …), and the perturbation applied on the cell as input, remove all of their effects in the bottleneck latent representation (Basal Latent Z), and add new ones (Covariates and Perturbations) to the Basal Latent Z, and decode it to the new cell. What did all of this mean!? Let’s see:

Lotfollahi et al. 2023, Figure 1

The encoder learns to remove the Perturbation and Covariate effects from the cell and create the “Basal State Z”
The removal of these effects is learned by leveraging an adversarial classifier, trying to figure out the covariates and perturbations from “Basal Latent Z”, and the encoder trying to maximize the loss of the adversarial classifier, i.e. remove any information about Perturbations and Covariates from the Basal Latent Z.
From the publication itself: “Crucially, the latent representation that the CPA encoder learns about a cell’s basal state is disentangled from (does not contain information about) the embeddings corresponding to the perturbation and the covariates. This disentangling is achieved by training a discriminator classifier (Lample et al, 2017; Zhao et al, 2018) in a competition against the encoder network of the CPA.”
This now allows the model, to leverage an interpretable linear combination, to add any given Covariate and/or Perturbation to the Basal Latent Z (shown in the middle of the figure), and predict the expected outcome of the cell with the new characteristics.

Many models today are using similar architectures and approaches to predict perturbation effects on cells. New methods are developing as well using a wide variety of approaches. Some interesting methods for further reading if you’re interested:

GEARS: Predicting transcriptional outcomes of novel multigene perturbations with GEARS (Knowledge Graph Based)

GEARS, Fig 1, Panel B

sVAE+: Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling

sVAE+. Fig 1

scGPT: toward building a foundation model for single-cell multi-omics using generative AI (Transformer architecture, learning gene tokens, with the ability to do perturbation prediction)

scGPT, Fig 1

What is special about drugs in all of this!?

If you look closely at the CPA architecture in the picture above, the Perturbations (Drugs) are represented as a “perturbation dictionary”, which is basically a “Torch Embedding Tensor (torch.nn.embedding)” with dimensions “Number of unique perturbations in the dataset X perturbation latent dimension”

self.pert_embedding = nn.Embedding(n_perts, n_latent, padding_idx=CPA_REGISTRY_KEYS.PADDING_IDX)

When taking Drugs into account, true, we are interested in predicting the effect of known drugs and their combinations on new cells, but -in my opinion- more interestingly, we are interested in predicting the effect of “Unseen, Non-Existent, De Novo Drugs” on cells, to help the drug discovery research in pursue of creating new drugs, more efficiently. This would be a task of making machine learning models generalizable beyond the drugs they were trained on.

What is the solution? We need methods, to encode and represent drugs individually and in a general manner, instead of just encoding the ones we have seen using gradient descent.

Instead of using embedding tensors, we need models that give you a pre-trained and ready embedding representation of a drug, given the structure of that drug/molecule in input.

Fortunately, there are many ways to represent a molecule!

Focusing on representation methods for “Small Molecules”, -Most pharmaceuticals are small molecules- we have a variety of ways to represent them:

Molecules are undirected graphs, and luckily, there is a long history of studying, researching, and understanding graphs in academia and mathematics.

String-Based Representations:

SMILES: Simplified Molecular Input Line Entry System:

Widely used molecule representation method, a linear textual representation of a molecule, where atoms are represented by their chemical symbols, and bonds are implied or explicitly denoted.

Some examples of molecules represented in SMILES (from Wikipedia):

SMILES Wikipedia

SELFIES: Self-Referencing Embedded Strings: A 100% robust molecular string representation

A newer string representation method for molecules, with a focus on Machine Learning use cases was published in 2020, aiming to address and solve some of the challenges the Deep Learning Drug Discovery community has been facing.

Quoting from their paper:

The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task (computational drug generation) because large fractions of strings do not correspond to valid molecules.
Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule

Blog post on SELFIES

These string-based representations can be quite powerful due to today’s improvement and investment in Language Models, classically leveraging RNN models (LSTMs, and GRUs) and more recently, focusing on Large Transformer Models.

Molecular graphs

Molecules can be also represented as graphs, the very same approach we have learned from school, with nodes as atoms, and the edges representing the bonds between these atoms.

With the emergence of all the research and interest in Graphs, there is a whole area of “Graph Neural Networks” in front of us to leverage the graph representation of molecules to our aim. Models doing the tasks such as:

Graph Generation
Graph Embedding (To feed into models like CPA)
Graph Classification (Molecule properties, etc.)

Molecular fingerprints

Fingerprints showcase the idea of applying a kernel to a molecule, to generate a vector (usually a vector of bits -0, 1- or counts -positive integer numbers-) representing that molecule, based on its chemical features.

They are usually intended to represent the existence/non-existence of chemical substructures inside a molecule. This is possible due to the repetitive nature of similar substructures in similar molecules.

Source

These fingerprints, such as “Morgan Fingerprint”, used in the famous Python package “RDKit: An open-source toolkit for cheminformatics” are widely used for the task of similarity search in molecules, since they are more interpretable and nodes in their vectors, represent the existence of explicit substructures that are of great importance when looking for the similarity between molecules.

Note: Molecular fingerprints are not just used for similarity search but also as inputs to machine learning models for tasks like classification, regression, and clustering.

The most common similarity search approach is the Tanimoto similarity:

From RDKit

How are these representations used in deep learning?

There are numerous models and papers out there, leveraging these molecule representations to perform drug discovery tasks.
Tasks such as:

1. Representation Learning for molecules, to be used in:

Perturbation models such as CPA (Check out ChemCPA: NeurIPS 2022)

If we have pre-trained representations for our drugs, we can easily give the perturbation prediction model the representation of a completely new drug, and expect it to provide us with the possible prediction effect on the gene expression.

ChemCPA, Figure 1

— Molecule property prediction: Classification/Regression Tasks

2. Molecule Generation:

Molecule generation in deep learning follows the same trends and methods being used in the general Generative Deep Learning community.

VAEs
Diffusion Models
Flow-Based Models
GANs
Transformers
Probably many more, the deep generative models' research is crazy active at the moment, you can’t list them all…

VAE-based models such as Chemical VAE and Grammar VAE were some of the early deep learning models, using SMILES representations (in other words, 1D representation of molecules)

The problem with this group of models was that they were falling short in representing the molecules. For example, two very similar molecules can have very different SMILES representations which makes it a highly difficult job for the models to learn the similarity structure.

An important model named Junction Tree VAE (JT-VAE) addressed this issue by using the 2D representation of the molecules, i.e. their graph representation, which resulted in considerable improvement in performance.

JT-VAE, Fig 3

And most recently, people have been trying to use 3D representations of molecules using Flow-Nets, Diffusions, etc. to capture the full scale of information which makes sense, since molecules are 3D objects.

Wrapping It All Up

So, to sum it up, machine learning is seriously shaking things up in drug discovery. It’s helping researchers solve problems that used to seem impossible. Whether it’s predicting how new drugs will work, tweaking existing ones, or finding new uses for old meds, ML is making a huge difference. With models like VAEs, transformers, and GNNs, we’re getting closer to speeding up and improving the whole drug development process.

As we look to the future, things are only going to get more exciting. Newer and even more advanced models are on the horizon, which means we could see even faster and cheaper ways to bring new treatments to patients. And who knows, with AI teaming up with other cutting-edge tech like quantum computing, we might be stepping into an era of truly personalized medicine.

If you’re as curious as I am about where all this is headed, there’s plenty to explore. Whether you’re deep into research, just starting out, or simply interested in the future of healthcare, the crossroads of machine learning and drug discovery is an exciting place to be. I can’t wait to see what’s next!

Thanks for sticking with me through this post! I’d love to hear what you think, so drop your thoughts, questions, or anything else in the comments below. Let’s keep the conversation going!

Quick note

This came out more messy and unorganized than I expected, and that’s because I kept getting all excited about every single task and area in the field, instead of focusing deeply on one.
I’m personally fine with it because it was my first time writing a semi-technical blog post and I wasn’t expecting greatness.

But it helped me get all the “I want to show it all” excitement out of my system, and the next one would be more focused, clean, and targeted on the topic I chose. :)))

About me, Arian

Arian is a Research Assistant at the Lotfollahi Group, Wellcome Sanger Institute, and a Computer Science student at the Sapienza University of Rome, with a Computer Science background and experience in Machine Learning and Data Science. His research interests include Representation Learning, Graph Neural Networks, and Robust ML. He is currently mainly working on developing deep-learning methods for Drug Discovery and Perturbation Prediction in single-cell genomics.

LinkedIn: linkedin.com/in/arianamani
Twitter: twitter.com/ArianAmaani
GitHub: github.com/ArianAmani

References

I’m updating this part in a bit after gathering all the resources I used!

The Deep Learning Road Map That I Took

Arian Amani — Fri, 02 Sep 2022 14:53:04 GMT

A Deep Learning Road Map And Where To Start

This is available on my GitHub as well:

https://github.com/ArianAmani/DL-RoadMap

Before getting started:

Before getting started, I have to say that this is the path I took/would have taken, and it worked for me pretty well. But it sure doesn’t mean that it’ll work for everyone.

So if you don’t feel like this is the path for you, there are hundreds of different road maps on the web.

There are two main points of view on learning ML and DL:

Start with the code, you’ll have time to learn theoretical stuff and the math behind it later.
Start with theoretics and the math behind them, then write the code.

Well, I’m a huge fan of the second opinion and I really don’t like the first one. But don’t get scared away, all good paths and courses come with a balanced combination of Theoretics and Code, and I tend to introduce them.

The reason I prefer the second path is that if you just learn to write code and don’t know what exactly is happening behind that code, you can’t actually solve problems on your own. Say you get a task with a few problems to solve, and you need to choose a Machine Learning algorithm to tackle these problems. All these problems are numerical and therefore need mathematical computation and solutions. You’ll face new problems that you haven’t seen before and without knowing the math and the logic behind these algorithms, you can’t choose the right solution to your problem (and of course, you can’t try them all to find which one works best).

So knowing these, let’s start talking about the path itself.
In most of the parts, I’ll introduce different sources based on how deep you want to get or how much time you’ve got.

I recommend reading through to the end before getting started on any of the courses.

Mathematics

As a Computer Science student myself, I was familiar with most of the math I needed to start, but if you’re starting with not much background in mathematics, this is where I’d suggest you start (If you’re familiar with these, feel free to skip):

Linear Algebra:

Fast and efficient way: Coursera Mathematics for Machine Learning: Linear Algebra
There’s also this legendary course on Linear Algebra, taught by Prof. Gilbert Strang at MIT, and it’s publicly accessible. Well, I’d really recommend watching this course if you’re really into math and want to learn a whole lot more about linear algebra, and you’ve got the time too. It’s definitely more than enough for starting ML, but if you feel like learning more, go for it: MIT OCW Linear Algebra 18.06 YouTube Playlist

Calculus:

Fast and efficient way: Coursera Mathematics for Machine Learning: Multivariate Calculus

Probability and Statistics:

Fast and efficient way:
You can probably learn everything you need at Khan Academy:
https://www.khanacademy.org/math/statistics-probability
More deep and academic way:
If you would like to dive deeper into the world of probability and statistics, I’d suggest the book “Probability and Statistics for Engineers and Scientists” by Walpole, Mayers, Ye.

These cover the math you need for ML and DL and you won’t need to worry about that part anymore.

There’s also this book called “Mathematics for Machine Learning” and it’s free online, I’d suggest reading this instead of all the above if you’re a book person and want something well structured all in one place. But if you think you may let it go in the middle of the book, just stick to the courses above.

Let’s go to the next part.

Python

You’re going to need some programming skills(preferably in python) before getting started. Well there are hundreds of Python tutorials on the web, so I’m just going to list two of them for your ease, if you don’t like either, just search “learn python” and you’ll find everything you need.

Machine Learning

Now we’re getting to the fun parts.

Again here, I’m going to recommend two courses, one easier to follow, and one more deep and academic.

Easier to follow (Probably more popular):

Machine Learning Specialization by Andrew Ng

This is probably the most popular ML course on the internet and A LOT of people have started their path into ML using it. It’s also the most popular and I guess the highest ranked course on Coursera (4.9/5).

This Specialization is made of 3 courses covering the main parts of Machine Learning and by the end of it, you’ll have a good understanding of ML Algorithms and how to implement and use them in python.

You can audit the course for free, or enroll to get the certificate. Also, Coursera has this option called Financial Aid for those who can’t afford the course, you can just click on the Financial Aid button and explain why you should get this course for free and in 15 days, you’ll receive an email saying Congrats, you got it :)))

More deep, academic course:

Stanford CS229 Machine Learning

This is the Machine Learning course taught at Stanford University, recorded in the class, and uploaded on YouTube.

Well, as I said before, I’m a fan of more deep academic courses and this is THE COURSE to go with if you’re like me. It involves a lot more math and details on ML concepts and algorithms and of course is more difficult to follow, but if you think you’ll be ok with the huge math and stuff and you won’t run away halfway through the course, don’t even hesitate to start with this one.

The videos are uploaded online on YouTube and the course material is accessible from the course website. Two versions are available online, one from the Autumn 2018 (Andrew Ng) semester and one from Summer 2019 (Anand Avati).

The first one is taught by Andrew Ng, the same instructor as the Coursera ML course introduced above, and the latter one is taught by Anand Avati, Andrew’s Ph.D. student.

Choosing between the two is more a personal preference, I myself love Andrew’s way of teaching and I’m more comfortable with it.

Although, Anand Avati’s course is newer and covers more subjects. It even involves the math required for the course in the first three lectures.

I’d suggest you watch one lecture from each, and choose the one you’re more comfortable with, and stick with it.

Deep Learning

Ah, finally the great field of Neural Networks.

Deep Learning is a part of ML, and today, the most famous and most useful part of it. It isn’t a new method actually, it was introduced in 1943 by Warren McCulloch and Walter Pitts. But back then, we didn’t have the computational resources to carry out the calculations in a Neural Network Model. So it wasn’t used much.

With the growth of computers and more powerful GPUs, Neural Networks began to lead the AI world.

Now I’m again, going to recommend two courses, one easier to follow, and one more deep and academic.

Easier to follow (Probably more popular):

Deep Learning Specialization offered by DeepLearning.AI taught by Andrew Ng

This is a 5-course specialization, covering almost everything you need to understand Deep Learning and its ways.

As the ML Coursera course, you can audit for free or ask for a financial aid here too. I definitely recommend watching this to get started on Deep Learning if you like Coursera-like courses.

More deep, academic course:

Stanford CS231n: Deep Learning for Computer Vision

I actually started Deep Learning with this course, and I’ve got to say, it’s THE BEST COURSE to start with if you’re ok to get a little deeper into the field like me.

It is more focused on Deep Learning applications in Computer Vision, but it also covers ALL the basic and necessary aspects of Deep Learning too. So you should not worry about it being for Computer Vision at all.

As a matter of fact, I watched the whole DL Specialization mentioned above too, after finishing this course, and I already knew all the stuff taught in the Specialization (and more) from this course. It even involves some Neural Network architectures mostly used in NLP.

The only part of the Coursera Specialization that teaches more than this is the 5th course (Sequence Models) which is more focused on NLP.

Its only drawback is that the available lecture videos are from the 2017 class, and it doesn’t cover some new topics like transformers. But if you’re interested enough, you’ll learn that new stuff on your own. (There are also CS231n’s new semester’s course notes available which you can keep reading from those to learn the new methods too)

After CS231n, I’d recommend CS224n if you’re interested in Natural Language Processing and want to get deep in that field.

Choosing between Andrew Ng’s Specialization vs. CS231n is based on your own personal preference. As for me, I prefer CS231n WAY MORE.

Reading Resources

Reading is always a key to getting deeper. If you’re into that, there are some books I like personally that can help you with the process:

The legendary Deep Learning book

This book is like a legend between Deep Learning books. It covers the concepts and the math behind DL algorithms perfectly. I don’t recommend starting with this book since it’s really hard. But if you want to strengthen your knowledge after completing the courses above, this book is prefect.

Hands-on machine learning with Scikit-learn Keras and TensorFlow by Aurelion Geron published by O’Reilley

This book is an awesome resource for learning ML and DL and also learning to code and implement the algorithms. I’d recommend starting with this if you’re more comfortable with books than courses.

Dive into Deep Learning

This is an awesome reference for both getting into the math and the code for Deep Learning. It contains code examples and implementations in all popular DL frameworks (PyTorch, Tensorflow, and MXNET)

It’s available online for free and constantly updated and involves all the newest material on Deep Learning.

If you’ve got the time, I definitely suggest reading this. I’m actually starting to read it for upgrading my coding knowledge.

Extras and where to go

By this stage, you probably know what you’re doing, and you’ll be able to find the path to go from here. Something stuff would recommend:

Read Papers: This will give you a deeper knowledge of algorithms and the things happening behind the code. I’d offer following these awesome lists for starters since there are numerous papers out there:

Something that will definitely help you upgrade your coding skills is doing projects for yourself. Just search for ML or DL project ideas and start coding, you could upload them on your GitHub too to enrich your GitHub.
Starting with some challenges on Kaggle will give you an awesome head start in coding experience. There are a huge lot of competitions and datasets which you can use to improve your coding skills.
DataCamp offers some excellent tutorials for improving your coding skills in working with datasets. I’d really suggest you check some of its courses out, especially for learning to work with tabular data and handling them. (It’s not free but if you’re a student with a verifiable school-issued email, you can subscribe to GitHub Student Developer Pack and get a free 2-month Data Camp account — something I did)

I’ll try to keep this Road Map updated, and I really hope it was useful to you.
Thank you for reading through.

About me

I’m Arian, a Computer Science student interested in Deep Learning research, and its applications in real-life problems.

I’d be happy to answer any questions if I’m able to. Feel free to contact me on Twitter or LinkedIn.

Twitter: twitter.com/ArianAmaani
GitHub: github.com/ArianAmani
LinkedIn: linkedin.com/in/arianamani