Stories by Netflix Technology Blog on Medium

A Human-Augmenting Agentic Workflow for Causal Inference

Netflix Technology Blog — Mon, 08 Jun 2026 15:01:03 GMT

By Winston Chou, Adrien Alexandre, Lars Olds, Yi Zhang, Garrett Hagemann, and Nathan Kallus

Introduction

Imagine asking a data agent to analyze the causal relationship between two variables, such as the effect of watching a popular Netflix show on long-term member retention. It queries your data, runs a regression, and confidently returns an answer. How much should you trust it? Can you be confident that the agent accounted for subtle biases — or does it treat passionate fans as if they were the average viewer? Without deep understanding and expertise, would you even be able to tell if it got the answer wrong?

Data analysis is increasingly being delegated to software agents. While this reduces human effort and toil, oversight is still needed to ensure the validity of results. This is especially true for specialized tasks like Observational Causal Inference (OCI), which require substantial judgment and domain expertise.

In this blog post, we share an agentic workflow for performing OCI under unconfoundedness. Our workflow is designed for software agents to adhere to rigorous, exhaustive templates for causal inference tasks. Yet, it also seeks to be “human-augmenting,” and to enable and empower human inspection and evaluation.

We designed this workflow with OCI practitioners in mind. Although OCI requires context and care to do well, aspects of it — e.g., checking and rechecking covariate balance, conducting sensitivity analyses, and keeping track of multiple iterations — can be repetitive and prone to error. Our workflow seeks to eliminate this toil so that humans can focus on more nuanced tasks, such as framing questions, scrutinizing assumptions, and evaluating results.

To this end, we are open-sourcing a standalone version of our oci-agent so that OCI practitioners can model workflows on and suggest improvements to it. We also share evaluations of our agent on the 2016 Atlantic Causal Inference Conference (ACIC) competition datasets, and show that our agent systematically beats one-shot iterations under numerous data-generating processes — while achieving competitive results against hand-tuned benchmarks.

This post describes the principles behind our workflow and gives a case study of its deployment at Netflix.

Philosophy

Our workflow is built on top of Netflix’s pre-existing OCI toolkit. We built this toolkit — largely in a pre-AI world — to answer “point-in-time” causal questions, such as “what is the effect of playing a Netflix game on member retention?” or “what is the effect of watching a highly popular show on subsequent engagement?” Questions of this kind inform business strategy, guide metric development, and contribute to a rich understanding of what drives member satisfaction.

Our toolkit is guided by a “target trial emulation” philosophy. For any point-in-time OCI question, we ask “what is the ideal A/B test for addressing this question?” This A/B test may be expensive, slow, or even infeasible to run. However, the thought exercise helps to pin down the assumptions needed for a credible answer, such as unconfoundedness of the treatment.

To make the target trial analogy actionable, our toolkit embeds a series of design diagnostics. These diagnostics assess whether we are drawing fair comparisons between treated and untreated units — or if there are hidden differences that could imperil our conclusions:

Covariate balance. After weighting, the standardized mean difference of pre-treatment covariates between treatment and control groups should be less than 0.2.
Overlap. The probability of receiving treatment (aka propensity score) should be bounded between 0.1 and 0.9.
Placebo outcome. The “treatment effect” on variables measured prior to the treatment should not be significantly different from zero.
Sensitivity to hidden confounders. Findings of treatment effects are contextualized by sensitivity to hypothetical omitted variables that explain both treatment and outcome.

As we uplevel our OCI toolkit with agents, such evaluation remains paramount. The standard approach to evaluating agents is to programmatically compare their outputs to ground truth. Yet, outside of artificially simulated data, there is no ground truth in observational causal inference.

Without discounting the need for evals (which our workflow also supports), one of our key principles is to augment human evaluation by making each analytic step as transparent as possible. For example, in our workflow, agents publish artifacts in the form of plans, specifications, plots, and notebooks that humans can inspect and re-execute if they wish. In the absence of ground truth, we rely on these “process audits” — coupled with human oversight — to build good agents.

Principles

Our workflow has three key personas:

Principal — the human user (e.g., data scientist) whose goal is to provide a thorough and correct analysis
Actor — the software persona that performs the analysis, including diagnostics
Critic — the software persona that synthesizes results, identifies gaps, and offers suggestions to improve the analysis

Our agent orchestrates the latter two personas in an actor-critic loop: specifying and triggering the analysis as the actor, then interpreting results and diagnosing flaws as the critic.

Each persona has responsibilities:

Principals

Provide an initial analysis plan containing its context and goals.
Provide context on the main threats to valid inference and the confounders that must be controlled.
Specify the tools that can be used for the analysis.
Specify the data model and dataset.

Actors

Refine the principal’s plan into a data analysis spec.
Use only the tools provided by the principal.
Create human- and machine-checkable artifacts.
Perform the four design diagnostics in addition to the core analysis.
Report any remediations taken in case of diagnostic failures.

Critics

Check for blind spots, such as unmentioned confounders, in the principal’s plan.
Check for alignment between the plan, spec, and executed analysis.
Specify a credibility level in the results after seeing the diagnostics.
Specify if and how the estimand differs from the Average Treatment Effect (ATE), for example due to propensity score trimming.
Contrast the executed analysis with the ideal target Randomized Controlled Trial (RCT).
Suggest at least one alternative measurement strategy (e.g., encouragement RCTs).

Although our workflow is designed for OCI under unconfoundedness, the principles listed in this section are meant to be extensible to other approaches to OCI, such as panel methods with very different assumptions (e.g., parallel trends).

Empowering Human Evaluation

To empower human oversight of each analytic step, we provide principals with a templated notebook that uses our vetted (non-agentic) OCI toolkit, which employs doubly robust learning for causal effect estimation.

The principal’s remaining responsibilities are to write the initial analysis plan and to evaluate the analysis artifacts (the executed notebook and the critic’s report). To enable thorough evaluation, agents version-control their reports and upload executed notebooks to a file store, where they can be downloaded and re-executed by principals (if they wish).

We diagram this workflow below:

Case Study — Estimating the Impact of New Entertainment Types

In recent years, we have added a wide variety of entertainment types beyond streaming video to Netflix. A natural question is how these new entertainment types affect members’ satisfaction and their likelihood of continuing to subscribe to Netflix.

To analyze the impact of one of these new entertainment types, which we will call Type X, we wrote a simple analysis plan specifying our

Treatment: Days engaging with Type X (or “Type X days” for short)
Outcome: Two-month retention
Potential confounders, including pre-treatment Type X days

To establish a baseline, we fed this analysis plan without additional scaffolding to Claude Sonnet 4.6, a powerful yet accessible general-purpose model. The model chose and executed a defensible analysis strategy: linearly regressing retention on Type X days along with controls.

While the result was polished and impressive, when we ran the same analysis through our paved path tooling and agentic workflow, also using Sonnet 4.6, our agent produced an updated estimate that was just 25% of the baseline! What explains the difference between the baseline and the paved-path estimates?

A core challenge when analyzing new entertainment types is early adopter bias. The first users of any new offering are likely to be systematically different from the general population. For example, they may be heavier users of Netflix generally, or they may be extremely strong fans of the underlying titles. Early adopter bias manifested in our analysis as poor “overlap”: the vast majority of observations had a small estimated probability of engaging with Type X, reflecting its early maturity.

This imbalance was caught by our critic agent in its writeup of the analysis. The critic also flagged the failure of the placebo test: early Type X adopters differed significantly from non-adopters in terms of important confounders measured before experiencing the treatment, a warning sign of potential bias.

Addressing Failed Diagnostics

To address these diagnostic failures, our workflow provides agents with a playbook. For example, to overcome poor overlap, we instruct the agent to use Crump-style trimming. That is, before estimating causal effects, the actor trims units with estimated propensity scores outside the range [0.1, 0.9]. This scopes the treatment effect being estimated to the ATE in the population that is not very likely or unlikely to engage in the new entertainment type — an important caveat we instruct the critic to flag in its report.

Trimming yields an estimate that is much smaller than the baseline estimate, and which only applies to the “overlapping” population (for whom engagement with the new entertainment type is non-deterministic). However, the trimmed estimate is substantially more credible, as it focuses on the members for whom the treatment could plausibly be randomly assigned, as in a target trial.

Contrastively, the baseline effect relies heavily on assumptions to extrapolate outcomes for all members, even those with a very low probability of treatment. The danger here is that extrapolation produces a number that is not backed by robust data and is likely confounded by early adopter bias.

Orchestrating Followup Analyses

There are two natural followups to this analysis:

First, we need to analyze the sensitivity of estimates to the choice of trimming threshold. Practically, this requires redoing the analysis with multiple trimming thresholds.
Second, we also care about how these causal effects evolve over time. Yet, comparing causal effects across time raises subtle challenges. For example, we need to coordinate the population across all analyses: if a set of users is trimmed to make one analysis more credible, it should be trimmed in the other analyses as well.

Both of these followups require conducting multiple versions of the same analysis, tweaking some parameters while keeping others the same. Managing this complexity and ensuring consistent execution is another area where agents add value.

To illustrate this, below we show a sensitivity analysis for our case study in which we asked the agent to vary the trimming bounds from [0, 1] (no trimming) to [0.15, 0.85]. As the plot shows, the estimated ATE on the overlapping population is robust to the choice of trimming threshold within bounds of [0.005, 0.995]. Although principals could (and should) execute this and other robustness analyses, delegating them to agents helps to reduce toil.

Another example is generating a time series by repeating the same analysis across multiple date partitions. For example, below we plot the results of using our agent to refit a different analysis on ten distinct date partitions. The plot shows evidence of seasonality: the treatment has a stronger effect on the winter dates compared to the summer dates.

Public Repo and Evals

To help OCI practitioners build on and contribute to our workflow, we are open-sourcing a standalone version of oci-agent. This repo implements two evaluations on public datasets from the 2016 Atlantic Causal Inference Competition (ACIC) data analysis competition. It also includes a lightweight version of our internal causal machine learning notebook that only uses open-source software (EconML).

Our first evaluation runs this notebook for three randomly sampled datasets generated by each of the 77 data-generating processes (DGPs) in the ACIC data. Next, it uses the critic to grade the resulting 231 estimates as either satisfactory or unsatisfactory based on the diagnostics.

Below, we plot the average RMSE and coverage of 95% confidence intervals of our ATT estimates against the 44 competitor methods in the ACIC competition. As the scatterplot shows, our statistical methodology is competitive against these benchmarks: it achieves reasonably low RMSE and well-calibrated confidence intervals that cover the truth in ~95% of DGPs.

More to the point, our diagnostics and agentic workflow help to separate more reliable estimates from less reliable estimates. To illustrate this, the following chart plots our ATE estimates in terms of RMSE and coverage. Note that we separate out the RMSE and coverage of:

All 231 estimates (purple dot)
The 192 satisfactory estimates (blue star)
The 39 unsatisfactory estimates (red dot)

As the plot shows, when aided by our diagnostic suite, the critic agent is able to separate good estimates from bad estimates: the satisfactory estimates have much lower RMSE and better calibrated confidence intervals than do the unsatisfactory estimates.

Our second evaluation compares the performance of an LLM on the same analysis plan with our scaffolding and without it (i.e., one-shot prompting). Unsurprisingly, we find that our scaffolding is critical to helping the LLM return useful estimates. This can be seen in the following random sample of ten ACIC datasets. Using our scaffolding, the LLM recovers the ground truth in nine out of ten datasets. Furthermore, estimates are highly correlated with ground truth.

In contrast, giving the same analysis plan to Sonnet 4.6 without any scaffolding (i.e., just prompting it) results in consistently wrong answers that are not at all correlated with ground truth.

A key limitation of our public repo is that, due to the synthetic nature of the underlying datasets, it doesn’t pressure-test our agent’s semantic understanding or performance on real-world OCI tasks. Nonetheless, the repo demonstrates the core principles underlying our workflow. These include (1) giving agents with extensive scaffolding so that they follow best practices by design, and (2) requiring inspectable artifacts so that humans can audit agents’ processes, not just their outcomes.

Conclusion

We provide a workflow for doing observational causal inference with the help of software agents. Leveraging elements of our pre-AI OCI toolkit, such as templated notebooks, our workflow is designed to ensure that agents conduct rigorous and exhaustive analyses. This helps to reduce the human toil of OCI, which can be a highly iterative and exacting process.

At the same time, motivated by the complexity and ambiguity of observational causal inference, our workflow seeks to be human-augmenting and enables human practitioners to evaluate each analytic step.

Using agents for causal inference poses a challenge: how do we evaluate agents’ performance on tasks without ground truth? To meet this challenge, our workflow combines process audits with human oversight. To enable others to learn from and critique our workflow, we have open-sourced a lightweight, standalone version. We hope this work stimulates more research and development on agentic evaluation in the absence of ground truth.

For valuable feedback on this post and “dogfooding,” we thank Adith Swaminathan, Ayal Chen-Zion, Colin Gray, Juliet Hougland, and Simon Ejdemyr.

Thinking Fast & Slow for a Personalized Notification System

Netflix Technology Blog — Fri, 05 Jun 2026 16:31:01 GMT

by Matthew Wood, Ishan Gupta, Kevin Mercurio, Devon Bryant, and Claire Dorman

In his seminal book “Thinking, Fast and Slow,” Daniel Kahneman describes two systems that drive human cognition: System 1, which operates automatically and quickly with little effort, and System 2, which allocates attention to more challenging mental activities requiring deliberate focus. This dual-process theory has profound implications not just for understanding human behavior, but for designing intelligent systems that must balance immediate responsiveness with strategic foresight. Similar “plan vs. act” decompositions show up in other domains too — for example, robotics and autonomous driving often separate a slower planning layer (setting goals and constraints over longer horizons) from faster control and execution loops, and modern LLM agents frequently pair deliberate planning with rapid, step-by-step tool use and reaction.

At Netflix, our messaging platform faces a similar challenge every day. We send hundreds of millions of personalized notifications — push messages, emails, and in-app alerts — to help members discover content they’ll love. This creates a central tension: optimizing each notification for near-term engagement can conflict with what is best for the member over the long term. Higher message frequency can increase fatigue and opt-out risk, while lower frequency can reduce awareness of relevant titles and features the member would value.

This blog post introduces our framework for personalized notifications — a hierarchical system where a “slow” policy makes strategic, personalized decisions about a member’s weekly messaging plan (e.g., the intended frequency per channel and the resulting pacing over the week), while a “fast” policy handles the tactical, real-time decisions about which specific message to send when a send opportunity occurs. Together, they balance near-term engagement with longer-term member experience.

The Problem:

Before introducing our new framework, it is helpful to ground the discussion in a representative baseline for a personalized notification system. In our previous production system, we used a causal model to make send decisions by predicting the causal effect of a single message over a short time horizon. While this approach is effective as a baseline, it suffers from two fundamental limitations:

Short-Term Reward Horizons

The single-message outcome model is trained to optimize short-horizon metrics, such as immediate user actions occurring shortly after a notification is sent. While this is excellent for driving near-term engagement, it misses the cumulative, long-term effects of a messaging strategy. A message that drives an interaction today might also contribute to notification fatigue, reducing responsiveness in the weeks to follow. Because critical indicators of member satisfaction — like sustained viewing habits or gradual opt-out risk — only surface over extended timeframes, a short-term model will always miss the bigger picture.

Coupled Ranking and Pacing Decisions

When a single system evaluates daily incrementality to decide both whether to send something and, if so, which item to send, an individual member’s weekly message frequency becomes a by-product of those daily decisions rather than an explicit control variable. In our previous single-policy system, frequency was controlled implicitly through a relevance threshold on the model score calibrated to achieve a target aggregate send rate. While effective for managing overall frequency, this mechanism limited the system’s ability to personalize frequency based on individual engagement patterns. Moreover, because send eligibility and message selection were coupled in the same decision rule, adjusting the threshold to control frequency also changed the distribution and quality of selected messages, and vice versa.

To solve these challenges, we needed a system that could separate longer-term strategy from shorter-term decisions. What if we could determine an optimal, personalized message plan for each member, and then focus on selecting the most relevant content within those bounds? In the following sections, we detail how we realized this vision by decoupling our notification engine into a hierarchical ‘System 1’ and ‘System 2’ framework.

The Proposed Method: A Hierarchical Slow-Fast Architecture

The Slow policy’s primary role is to define a personalized pacing of messages over a defined time horizon. The decisions made by slow policy are consumed by the Fast Policy whose role is to maximize immediate relevance and select the optimal message for the member at any given moment.

To illustrate the Slow Policy in practice: For example, if optimized at a weekly cadence, the policy evaluates a member’s long-term engagement patterns to select a “Pacing Plan Action.” To keep the action space manageable yet expressive, we discretize the decision space into a set of actions that independently specify push and email frequencies. This provides approximately O(100) distinct combinations of cross-channel pacing strategies.

The Utility Function

The Slow policy selects actions by maximizing a personalized utility function. This function explicitly trades off positive engagement signals against the long-term “cost” of messaging.

U(member, action) = Σ wₖ·Reward_k(member,action) — Cost(action)

To capture a holistic view of member health, this utility is composed of:

Positive Signals: Capturing the likelihood that a member will find value in and engage with the platform.
Negative Signals: Capturing the likelihood of member fatigue or a propensity to opt out of a messaging channel.

Ideally, negative signals alone would naturally penalize over-messaging. In practice, however, explicit negative feedback is extremely sparse. Without an additional constraint, the predicted ‘cost’ of an incremental message appears negligible, causing the model to gravitate toward maximum frequency.

To address this, we introduce a universal message cost that is added to the personalized negative‑feedback prediction for every send. This additional cost term keeps the reward function concave and well‑behaved, preventing degenerate “always send” policies. The message cost parameter is empirically tuned using a combination of online experiments and offline evaluation metrics.

Pacing Strategy

The two-stage design naturally allows for optimizing both the average frequency as well as pacing of messages over time. The simplest pacing strategy is uniform random: we translate the frequency target into a per-opportunity send probability and, at each eligible opportunity, effectively flip a weighted coin to decide whether to send. This produces an organically randomized pattern whose expected send rate matches the target.

While uniform pacing provides a clean and robust baseline, the framework readily extends to richer, non-uniform pacing profiles (for example, day-of-week patterns, conditioning on user activity, or launch-aligned bursts) whenever product or user-experience considerations call for more structured temporal distributions.

Policy-to-Policy Communication

The true power of this hierarchy lies in decoupling. By splitting into “Slow” and “Fast” policies, we allow each to focus on what it does best.

To bridge these two worlds asynchronously, decisions are events and state is managed through a low-latency feature store:

The Planner (Slow): The Slow policy calculates a member’s ideal pacing plan. It writes this strategic intent to a feature store
The Executor (Fast): Every day, when a notification opportunity arises, the Fast Policy simply pulls that stored “plan” as a feature. It then executes the tactical send decision within those strategic guardrails.

This architecture provides two critical advantages:

“Stickiness”: It ensures a member receives a consistent experience. The Slow policy will be executed once at a defined cadence; the plan is stored and honored.
Independent Evolution: We can retrain, optimize, or A/B test our weekly pacing strategies (the “Slow” layer) without ever touching the real-time ranking logic (the “Fast” layer).

Figure 1: Schematic of the two-layer message personalization system composed of a slow planning policy (top) and a fast execution policy (bottom). A feature store serves as the communication bridge between the two policies.

Key Results & Takeaways

The transition to a hierarchical architecture resulted in one of our largest production metric lifts to date. We observed several key breakthroughs:

Empowering the “Casual Viewer”: Gains were most significant among members who watch less frequently — a critical cohort where timely, high-relevance awareness of new content is vital.
The Power of Decoupling: Separating frequency planning from message selection was as transformative as the modeling itself. This new architecture unlocks incredible flexibility, allowing us to iterate on content ranking models and pacing strategies as two independent, clean variables.
Respecting the Horizon: The impact of messaging is rarely an isolated event; its effects build up cumulatively based on ongoing interactions between our system and the member. By isolating pacing into a dedicated strategic layer, we now have the mechanism to explicitly manage long-term fatigue and opt-out risk.

Acknowledgments

We could not have delivered this project without the help of our outstanding colleagues, and we sincerely thank them for their contributions.

Feature Store Team: Aaron Lewis, Tom Switzer, Abby Whittier, Ray Zhang
Product: Fiona Li
AI for Member Systems (supporting contributor): Sergi Perez

Dynamic Repartitioning for Time Series Workloads

Netflix Technology Blog — Wed, 03 Jun 2026 01:40:22 GMT

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan

Introduction

Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal event data with millisecond latency. We use Apache Cassandra 4.x as the underlying storage for these main reasons:

Throughput, latency, and cost: Cassandra can handle millions of low‑latency reads and writes in a cost-effective manner.
Operational maturity: Our data platform team has deep operational expertise running large Cassandra clusters in production.

However, using Cassandra at this scale introduces trade‑offs for TimeSeries workloads. A key challenge is wide partitions, as TimeSeries dataset partitions can grow quite large with events accumulating over time.

This problem is further compounded by the fact that TimeSeries servers routinely deal with a very high read throughput:

Reads/second for different datasets

This post walks through our journey to reduce the impact of wide partitions in our TimeSeries datasets, the solutions we built, and the lessons we learned.

Note: Although this post walks through re-partitioning in Cassandra, the same techniques can be applied more broadly to other data stores.

Impact of Wide Partitions

For most of our datasets, we observe an average read latency in the order of single-digit milliseconds:

Ideal Latency for Reads (ms)

However, in some datasets, as partitions grow too wide, we observe high read latencies in the order of seconds, especially towards the tail end:

High Tail Latency for Reads (seconds)

This can result in timeouts:

Read timeouts / second

In extreme cases, if most of the reads target wide partitions, we can see Garbage Collection pauses, high CPU utilization and thread queueing.

High CPU utilization and thread-queueing in Cassandra clusters

Scaling up the underlying Cassandra cluster is always an option, but we need smarter alternatives than just throwing more money at the problem.

TimeSeries Partitioning Strategy

The TimeSeries Abstraction was designed to solve the problem of wide partitions by dividing the data into discrete time chunks. For more in-depth information, refer to our previous blog.

To summarize, here is an illustration of how TimeSeries partitioning strategy helps us break up wide partitions into manageable chunks.

Time Series partitioning breaking up a dataset into Time slices, time buckets and event buckets

This strategy further allows us to efficiently query and drop data based on time, without having to deal with tombstones.

Picking the Partitioning Strategy

When a namespace (a.k.a. dataset) is created, users must specify their anticipated workload characteristics. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimal infrastructure and partition configuration.

Provisioning picks optimal infra and configuration based on user inputs

You can learn more about our methodology of capacity planning in this insightful AWS re:Invent talk given by one of our stunning colleagues.

The Problem with the Current Approach

Although this method of provisioning is effective in many situations, it proves insufficient for TimeSeries workloads under these conditions:

Workload is unknown or inaccurately estimated: Early on in a project, users can lack a reliable picture of production traffic or simply misestimate key parameters.
Workload evolves over time: Traffic patterns, client behavior, and product requirements change. A “good” partitioning strategy on day one can become inefficient months later.
Data outliers exist: Not all TimeSeries IDs behave the same. A small percentage of IDs can receive a vastly higher volume of events than the rest.

Fortunately, our design with discrete Time Slices gives us a natural escape hatch for the first two scenarios; each new Time Slice can use a different partitioning strategy.

Each Time Slice can have a unique partition strategy

However, manually adjusting these configurations in a fleet that has thousands of TimeSeries datasets is not sustainable. We need automation.

Solution 1: Time Slice Re-Partitioning

Cassandra exposes useful introspection APIs for understanding data usage and access patterns. For example, nodetool tablehistograms provide percentile distributions for partition sizes in a table. Using these tools, we can detect cases of both over and under partitioning.

Below is an example of over‑partitioning, where the TimeSeries provisioning pipeline selected very small time_bucket intervals based on user provided inputs:

Provisioning selected 60s time buckets based on user inputs

causing partitions to have less than 10 KB of data, leading to high read amplification and thread queueing:

Histogram of the given Cassandra table showing partition size percentiles

In order to tune partition strategies efficiently, we added a background worker, which monitors partition histograms of Time Slices attached to a given application, and exposes it via a Cassandra virtual table:

Histograms exposed through a Cassandra Virtual table

It then computes an adjustment factor when it detects partition sizes not meeting a configured density. This configured density is often set between 2 MiB to 10 MiB depending on the workload.

DynamicTimeSliceConfigWorker: 
namespace: my_dataset_1
Observed: TimeSlices have p99 partitions below configured target of 10MB. 
Proposed: time_bucket interval: 60s -> 604800s

The worker can then update future Time Slices with the new partition strategy:

Partitioning adjusted for future Time Slice(s)

This strategy has yielded real results in reducing our read latencies, as well as reducing the number of timeouts caused by thread queueing.

Reduction in tail latency and thread queueing for

However, this strategy only works if most of the data exhibits such behavior that warrants re-partitioning of the entire table. It does not work in cases where only a percentage of IDs within the table are wide.

We have a couple of options here:

Do Nothing: This is sometimes the right approach if there is no observed impact to the application’s top-level metrics.
Partial Returns: We implemented a ‘Partial Return’ feature, which aborts an inflight request if it has breached a configured latency SLO, while returning whatever data it has collected up until that point. This is a great option for clients who care more about latency than fetching all the data.

Tail latency drops around the SLO cutoff as Partial Returns are enabled

Block IDs: This is an extreme step but worth mentioning, because we do deal with bad data that occasionally seeps into the system e.g. test or spam IDs that can make the system unstable.

dgwts.config..block.Ids: ", , "

Ultimately, we encounter scenarios where valid and important TimeSeries IDs accumulate a high enough volume of events, with callers needing to process all the related data. Simply tolerating elevated latencies or timeouts when querying these IDs is not a desirable outcome.

This is where dynamic partitioning comes into play.

Solution 2: Dynamic Partitioning per ID

Dynamic partitioning is an asynchronous pipeline that auto-detects and splits wide partitions on a TimeSeries ID level rather than at the table level.

It has three main stages:

Detection: Detects wide partitions for a given TimeSeries ID during the read path.
Planning & Splitting: Plans and executes splits of those partitions into optimal sizes asynchronously.
Serving Reads: Re-routes the read queries transparently to read data from the split partitions when ready.

This is how it works at a high level; we will dive into details after:

Dynamic Wide Partition Split Async Pipeline

Here are the different stages of the pipeline:

Detection

Every TimeSeries read operation tracks how many bytes are read for a given partition. If the bytes read exceed a configured threshold, the server emits a detection event to Kafka:

{
  "time_slice": "data_20260328", // the Cassandra table this event was detected in
  "time_series_id": "profileId:123", // the ID detected as wide
  "time_bucket": 7, // the existing time_bucket partition
  "event_bucket": 2, // the existing event_bucket partition
  "immutable": true, // TimeSeries servers can compute if this partition is no longer receiving writes
  "version": "0" // reserved for future use e.g. invalidate if partition is no longer immutable
}

Our decision to detect wide partitions on reads, as opposed to writes, is based on our observation that the majority of the data in the wild doesn’t need this treatment. The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up.

Immutability

Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change by focusing on immutable partitions, while still meaningfully reducing caller timeouts.

Planning

Detection may occur based on a partial read, so the planner must still read the entire partition once to compute an accurate split plan. The checkpointing becomes crucial here. For planning reads that fail to process the entire partition, the process can always continue from the last saved checkpoint.

Checkpointing

The wide_row metadata table serves as the backbone for state transitions and checkpointing of partition splits. It also stores information that is used later by TimeSeries servers to properly route Read queries.

wide_row metadata for storing split states and checkpoints

Splitting

The Planner delegates the splitting of data to an appropriate split-strategy. For example, if EventBucketPartitionSplitStrategy is selected, we split the partition by assigning more event buckets to the same time bucket. If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas.

Split by assigning more event buckets for a given time bucket

Further, since the Splitter has the full view of the partition, it can ensure total sort order across all the split buckets.

Validating Splits

The Planner stores a pre-split checksum of a given partition during the planning phase, while the Splitter computes and stores the post-split checksum. The split status is marked as completed only if the two checksums match.

Ensure checksums match pre- and post-split before marking a split as COMPLETED

Tracking Splits

The pre- and post-split partition sizes across different datasets are tracked to see how effectively the partition splits are being planned and executed:

Track pre- and post-split partition sizes to ensure we are splitting optimally

Serving Reads

The TimeSeries servers load the partition-keys of completed splits periodically into in-memory Bloom filters. Every read operation checks the Bloom filter to see whether a query can be diverted to the split partitions.

Here is what the Read path looks like:

Read path for diverting reads to existing or split partitions

The size of the Bloom filters is monitored to ensure we have enough memory per server. Due to the compactness of partition keys, and ratio of wide partitions in a given dataset, the filters fit comfortably in each server instance.

Bloom filter approximate element count per namespace and time slice

The Bloom filter latency to check whether a given partition key is wide for every read request is typically in single-digit microseconds or better, making this diversion practically invisible to the callers.

Latency for checking Bloom filters is extremely small for callers to notice the diversion

For the cases that do end up with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see how to read a specific wide partition:

{
  "pre_split_data": {
    "time_slice": "data_20260328",
    "time_series_id": "6313825", → What to read
    "time_bucket": 0,
    "event_bucket": 2
    …
  },
  "post_split_data": {
    "time_slice": "wide_data_20260328_0", → Where to read it from
    "event_bucket_partition_strategy": { → Strategy to delegate to for reading
    "target_event_buckets": 2,
    "start_event_bucket": 32 → How should the strategy read it
  }
  …
}

This metadata read is backed by a read-through cache, making it quite performant:

Metadata fetch latency is quite low to affect read operations

Finally, the reads for the split partitions are delegated to our existing PartitionReader, which reads N smaller partitions in parallel, rather than 1 large partition, improving overall performance and stability!

Read much smaller partitions in parallel and merge results

Fallbacks

The existing wide partition from the original time slice is never deleted. This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency. The slightly larger storage space we use as a result is worth the operational safety we gain.

Building Additional Confidence

Serving incorrect reads would be disastrous. To establish trust beyond checksums, we leveraged additional mechanisms such as:

Using our existing Data Bridge pipelines to verify splits offline:

Spark job to ensure that the split data is an exact match to the original data

Implementing a phased rollout strategy to safely advance through stages as our confidence in the system grew:

Advance through Read modes once previous mode passes checks

A critical part of this phased rollout was the Comparison phase, which compared bytes served by old read path and the new read path while in shadow mode:

A chart of bytes match vs bytes differ in a given shadow period

Results

As a result of these dynamic splits, we see a huge improvement in the average read latency of most wide partitions, bringing it down from seconds:

Existing average latency for reading wide partitions

to low double-digit milliseconds!

Average latency for reading dynamically split partitions

Tail latencies of reading wide partitions dropped from several seconds:

Existing tail latency for reading wide partitions

to around 200 ms or better:

Tail latency for reading dynamically split partitions

resulting in a drop in read timeouts:

Overall, this has resulted in a more stable Cassandra cluster with lower CPU utilization and little to no thread queuing:

Low CPU utilization and no thread-queueing

Further, for extreme wide rows, where a dataset would face constant timeouts and unavailability blips, the service was able to paginate and query 500MB+ partitions while remaining available:

grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
    "search_query": {...},
    "time_interval": {
      "start": "2026–05–11T23:42:51.484398Z",
      "end": "2026–05–12T00:13:50.694205Z"
    },
    "pageSize" : 1000,
  }'
# Response:
{
  "next_page_token" : ….,
  "records": [
    {
      …
    }
  ],
  "response_context": [{
    "namespace": "...",
    …
    # Trades elevated latency for being available
    "time_taken": "41.072410142s"
    }
  ]
}

Conclusion

There is more work planned around this feature, like splitting mutable wide partitions, or re-processing previously failed splits, but this has been a successful start in improving service performance and reducing our support burden.

Further, we would like to highlight some key lessons that we learned at different points in this journey.

Reducing Surface Area: As a first step, explore simpler solutions that can still deliver meaningful impact. Also, reducing the surface area of a complex change and deploying incrementally pays off operationally.
Building Confidence: Invest time and resources to build confidence in new features, especially when justified by the feature complexity, deployment blast radius, and/or potential impact.

Acknowledgements: Special thanks to our stunning colleagues who further contributed to this feature’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.

Dynamic Repartitioning for Time Series Workloads was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Netflix Technology Blog — Fri, 29 May 2026 14:01:01 GMT

By Parth Jain, Rakesh Sukumar, Yingwu Zhao, Renzo Sanchez & Nathan Fisher
How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world.

The Puzzle with a Thousand Pieces

Picture this: It’s 3am, and an engineer gets paged. One of our critical services is showing elevated error rates. Members trying to watch their favorite films and series are seeing degraded experiences. The clock is ticking.

A single service at the center of a web of dependencies — services, data stores, and call chains branching in every direction. Without a unified map, engineers have to reason about this structure from memory and scattered signals.

In a system with thousands of microservices supporting our entertainment experience for members worldwide, answering these questions quickly can mean the difference between a minor blip and a major incident.

We kept hearing variations of this story from engineers across Netflix. The tooling gap was clear: we had plenty of signals, but no unified way to understand how everything connected.

The Three Questions Every Engineer Asks

When troubleshooting distributed systems, engineers fundamentally need to understand relationships:

Which services depend on each other? Not just theoretical dependencies from configuration files or architecture diagrams, but actual runtime connections based on real traffic.

What’s the blast radius? When something breaks or needs to go down for maintenance, what else will be affected? Which teams need to be notified?

Where’s the source? Is my problem caused by an upstream issue, or am I the root cause that’s cascading to others?

Traditional observability tools show fragments of this picture. Metrics show symptoms and performance characteristics. Logs show individual service behavior. Traces show single request flows through the system. But none of them show the complete map of how everything connects — the steady-state topology of dependencies that forms the backbone of our distributed architecture.

For an engineer at 3am, having to mentally stitch together information from multiple tools is slow, error-prone, and stressful. We needed something better: a unified view of service dependencies — a map showing how everything connects — with easy navigation to the detailed signals when you need to dig deeper.

Why This Matters More Than Ever

Netflix runs on thousands of microservices working together to deliver entertainment to our members. When you press play on your favorite series, that single action triggers a cascade of service-to-service calls — authentication, recommendations tailored to your tastes, video encoding selection, playback optimization, and more.

This architecture gives us tremendous flexibility and allows hundreds of engineering teams to innovate independently. But it also creates fundamental observability challenges.

And these challenges were growing. New initiatives like our Live programming and Ads-supported plans require even more sophisticated monitoring and faster troubleshooting. Live events can’t wait for lengthy incident investigations. The scale and real-time nature of these systems demanded better tooling.

We analyzed thousands of support requests from our engineers over a four-year period. The patterns were consistent:

“What are my upstream and downstream dependencies?”
“Is this failure in my service, or is something I depend on broken?”
“Which services will be impacted if I take this down for maintenance?”
“Why is this service showing as ‘Unknown’ in my metrics?”
“What changed in my call path recently that could explain this behavior?”

Engineers were asking dependency questions constantly. We needed to provide answers — quickly, accurately, and in real-time.

Building on What We Learned

We didn’t start from scratch. Over the years, we explored various approaches to solving this problem — from evaluating external graph databases and vendor platforms to building internal prototypes with different storage technologies and data models.

Each iteration taught us something valuable:

Real-time matters: Dependency maps that are hours old are useless in dynamic environments where services deploy multiple times per day. We needed near real-time updates.

Scale changes everything: Solutions that work at modest scale hit fundamental walls at Netflix scale. Storage systems that handle thousands of nodes struggle with our service count and traffic volume.

Integration is key: Any solution needs seamless integration with our existing observability ecosystem. Engineers shouldn’t have to learn entirely new tools or leave their existing workflows.

Data quality is critical: Incomplete or incorrect dependency information is worse than no information — it leads to wrong conclusions during incidents.

Multiple perspectives needed: We learned that no single source of dependency information tells the complete story. Network connectivity data lacks application context. Application metrics only cover instrumented services. We needed to combine multiple sources.

These lessons shaped every decision we made in building Service Topology.

What We Needed: A Living Map

We set out to build something specific: a living map of our infrastructure — one that updates in real-time as services deploy, as traffic patterns shift, as new dependencies form and old ones disappear.

The requirements were clear:

Real-time updates, not stale snapshots: In an environment where services deploy continuously, yesterday’s topology map is archaeology, not observability.

Fast queries at scale: When an engineer is troubleshooting at 3am, they can’t wait minutes for a query to return. We needed sub-second response times for traversing the call graph.

Multiple layers: Network-level connectivity doesn’t tell the whole story. We needed to see both the network layer (what’s actually talking to what) and the application layer (which APIs and endpoints are being called).

Rich context, not just connections: Knowing Service A talks to Service B isn’t enough. We needed to overlay health status, availability tiers, business domains, ownership information, and other metadata to make the information actionable.

Visual and programmatic access: Engineers needed a UI for exploration and troubleshooting. But automated systems — resilience frameworks, blast radius calculators, incident response automation — needed programmatic API access.

Our Approach: Three Sources of Truth

Three data sources produce three independent topology graphs — network, application, and request — each stored separately and queryable on their own or merged into a single unified view.

Here’s the key insight we arrived at: no single source tells the complete story.

We built Service Topology by using three complementary sources to build separate dependency graphs — one from each perspective — that can be combined into a unified view or explored independently:

Each source creates its own graph that is physically separate — the network layer in one graph database partition, the IPC layer in another partition, and the tracing layer using columnar storage optimized for analytical queries. This physical separation allows each layer to evolve independently and be queried in parallel. When users request a unified view, we execute traversal queries across all layers simultaneously and merge results, achieving sub-second response times even when combining all three layers.

Each source creates its own graph of service relationships:

1. eBPF Network Flows (Network Layer)

We capture network flow records at the kernel level using eBPF technology — information about which services are connecting to which other services over the network. This gives us ground truth about actual network-level communication.

The value: Comprehensive coverage. Every service shows up here because we’re capturing actual network traffic, regardless of whether applications are instrumented. This layer provides topology at both cluster-level (which deployment clusters are communicating) and app-level (which applications are communicating).

The limitation: Network-level information lacks application context. We know Service A connected to Service B’s IP address using a specific protocol, but not which specific API endpoint or path was called (e.g., /api/v1/users vs /api/v1/orders).

2. IPC Metrics (Application Layer)

We collect Inter-Process Communication metrics from our instrumented services. These are the metrics applications emit when they make calls to other services via gRPC, GraphQL, REST, or other protocols.

The value: Rich application context. We can see which specific endpoints were called, error rates, latency distributions, protocol details, and request/response characteristics. This layer provides app-level topology — since IPC metrics are emitted by applications, the natural granularity is application-to-application connections with endpoint details.

The limitation: Only works for instrumented services. If a service doesn’t emit IPC metrics, we won’t see its application-level calls this way.

3. End-to-End Tracing (Request Layer)

We integrate distributed tracing information that follows individual requests as they flow through our system. We aggregate traces to build a unified topology graph, but also allow engineers to overlay individual traces on the topology to see specific request flows.

The value: Shows actual request paths. Not just “Service A can call Service B,” but “Service A did call Service B as part of serving this specific member request.” This captures runtime behavior, including conditional logic and feature flags. Engineers can both see the aggregated pattern and drill into individual traces. We aggregate traces to build topology at both cluster-level and app-level, allowing engineers to view request patterns at the granularity most useful for their investigation.

The limitation: Sampling. We can’t trace every request without impacting performance, so we sample. This is excellent for understanding common flows, but may miss rarely-used code paths in the aggregated view.

Bringing It Together: Multi-Layer Architecture

Here’s what makes this powerful: we build three separate graphs — one from each source — that create different perspectives on service relationships:

Network graph from eBPF flows: Every connection, regardless of instrumentation
Application graph from IPC metrics: Rich endpoint and protocol details
Request graph from tracing: Actual runtime behavior and call paths

Engineers can:

View each graph independently to focus on a specific perspective (pure network connectivity, application-level calls, or traced request flows)
Combine them into a unified graph by querying multiple partitions in parallel and merging results — our system returns the union of nodes and edges from all requested layers while preserving each layer’s distinct properties

The unified view is especially powerful because:

Network flows ensure completeness — we don’t miss anything
IPC metrics provide application details — we understand the “how” and “what”
Tracing shows actual behavior — we see real request patterns

Each source compensates for the limitations of the others. The result is a comprehensive, accurate, and contextualized view of service dependencies that can be explored from multiple angles.

From Flows to Graph: How We Built It

Here’s the high-level architecture (we’ll dive deeper into engineering challenges in our next post):

Flow logs travel from multi-region Kafka through three aggregation stages — initial batching, intermediary resolution, and final enrichment — before being persisted to the graph database and served via API.

Multi-Region Ingestion: We consume flow logs from Kafka across multiple AWS regions where Netflix operates. This runs continuously, processing millions of flow records as they arrive.

Distributed Processing: We use Apache Pekko Streams (a fork of Akka) to process these flows in a distributed, fault-tolerant pipeline. The system automatically partitions work across our Auto Scaling Groups to handle the volume and provides natural backpressure handling.

Three-Stage Distributed Aggregation: We aggregate network flows through a three-stage pipeline that solves a fundamental challenge: network flow logs only show individual network hops through intermediaries (App A → Load Balancer → App B, or App A → NAT Gateway → App B), not the true application-level connections we need (App A → App B).

Stage 2 resolves network intermediaries: raw flow logs show two separate hops (App A → Load Balancer → App B), but the resolved graph stores the direct application-to-application relationship (App A → App B).

Stage 1 performs initial aggregation from Kafka. Stage 2 applies resolution logic — identifying network intermediaries (load balancers, NAT gateways, API gateways, proxies) and combining their incoming and outgoing flows to reconstruct direct application-to-application paths. Stage 3 performs final aggregation with health status integration before graph persistence. This graduated approach also prevents hot spots by distributing load across multiple points even when specific applications or network intermediaries see 100x more traffic than others.

Graph Storage: We persist the topology in Netflix’s graph database, an abstraction layer built on top of our distributed key-value storage infrastructure. This graph database is specifically designed for high-throughput graph operations at our scale, with fast multi-hop traversal capabilities. Each of our three data sources (network flows, IPC metrics, tracing) creates a separate graph that can be queried independently or merged.

gRPC API: We expose the topology through a gRPC service that supports multi-hop traversal, filtering by availability tier and business domain, pagination for large result sets, and sub-second query response times.

The technical details of building this at Netflix scale — handling Kafka lag, managing memory and garbage collection, optimizing distributed processing, debugging reactive streams — deserve their own discussion. We learned a lot, and we’ll share those lessons in our next post.

What Engineers Can Do Now

Today, the service topology map is helping engineers across Netflix:

Visualize Dependencies: See upstream and downstream dependencies for any service, with the ability to filter by availability tier (Tier 0, Tier 1, etc.) and business domain. Choose between the unified view (combining all sources) or individual graph views (network-only, IPC-only, or trace-only) depending on what you’re investigating.

Jump to Detailed Signals: From any service in the topology, quickly navigate to logs, traces, and detailed metrics in their respective tools. No more hunting for the right service name or time window — the topology provides the context and the starting point.

Understand Blast Radius: Before taking a service down for maintenance or making significant changes, see exactly what will be impacted. Identify which teams to notify and what to monitor.

Overlay Health Status: See not just the topology, but which services in the call path are experiencing issues. This is integrated with health status tracking, so you can quickly identify if a problem you’re seeing is actually originating somewhere else.

Query Programmatically: Use our gRPC API to integrate topology information into automated systems. For example, our Platform Modernization Engineering team uses this to verify that critical Live services have proper availability tier classifications throughout their dependency chains.

Investigate Faster: During incidents, quickly identify if a failure is local or if it’s propagating from somewhere else in the call graph. Follow the failure pattern to find the root cause.

Plan Changes Confidently: Understand the impact of proposed architectural changes or service migrations before implementing them.

Time Travel Through Topology: Query what the topology looked like at specific points in the past. Understand what changed in dependencies around the time an issue started, or see how your service’s dependency footprint has evolved over time. This time-travel capability is powered by time-window aggregation — instead of storing every time slice separately, we use layer-specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs.

The Living Map: Always Current

What makes this truly useful is that it’s a living map. It’s not a static diagram drawn in a design document that goes out of date the moment it’s published. It’s continuously updated based on actual traffic:

When a new service starts calling an API, it appears in the topology with near real-time freshness
When a service stops making calls to a dependency, that edge fades from the graph
When services deploy and their behavior changes, the topology reflects it
When incidents impact service health, the status overlay updates in real-time

This means engineers can trust what they see. The map reflects reality, not someone’s idea of what the architecture should be.

The Journey Continues

We’re not done. We continue to evolve the system with new capabilities:

Change Event Overlay: We’re working to surface deployment events, configuration changes, and other mutations alongside the topology graph. Correlation becomes easier when you can see both the dependencies and what changed when.

Richer Context: As we expand coverage and integrate more signals, we continue to enrich the topology with additional endpoint-level details, protocol information, and network path context.

And looking further ahead, we’re excited about something bigger: Automated root cause analysis. Imagine an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible.

Why This Matters for Our Members

This might seem like infrastructure — plumbing that our members never see directly. But it matters immensely to their experience.

When engineers can quickly understand dependencies and identify issues, incidents get resolved faster. When we can model blast radius before making changes, we avoid disruptions. When automated systems can query dependency information programmatically, we can build smarter, more resilient systems.

All of this translates to what matters most: our members getting to watch their favorite films and series, seamlessly, whenever they want. Whether it’s a weekend binge of a beloved show, a live sports event, or discovering something new through our recommendations tailored to their tastes — we want it to just work.

What’s Next in This Series

This is the first in a series of posts about building Service Topology at Netflix.

In our next post, we’ll pull back the curtain on the engineering challenges we faced at scale: How do you handle Kafka consumer lag when ingesting millions of flow logs per second? What happens when distributed processing meets garbage collection pauses? How do you debug reactive streams that stall under load? How do you manage hot nodes in a distributed system? We’ll share the real problems we hit in production and the solutions we developed.

In future posts, we’ll explore the lessons we learned that apply to any distributed system at scale, and where we’re heading next with time travel capabilities and Automated root cause analysis.

Acknowledgements

This post was written by Parth Jain.

Service Topology was built by Parth Jain, Rakesh Sukumar, Yingwu Zhao, Renzo Sanchez-Silva, and Nathan Fisher.

Special thanks to the many engineers across Netflix who made this possible — the Observability team who built the broader system, the graph database platform team who provided the storage foundation, and the Platform Modernization Engineering, Live, and Ads teams who provided invaluable feedback and use cases throughout development.

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Evolution of Cassandra Data Movement at Netflix

Netflix Technology Blog — Mon, 18 May 2026 20:45:38 GMT

By Guil Pires, Jennifer Prince, Jose Camacho, Ken Kurzweil, Phanindra Chunduru

Background

In a previous post, we introduced Data Bridge, a unified management plane for batch Data Movement at Netflix. Historically, several bespoke Data Movement connectors were developed across different engineering organizations to fulfill their specific requirements. Over the last few years, the Data Movement team has started centralizing these offerings through an abstraction that provides a catalog of connectors, along with simple UI and APIs to initiate Data Movement jobs.

One such case is the Cassandra to Iceberg connector. Apache Cassandra powers mission critical applications at Netflix, including Member, Billing, Recommendations, Subscriptions and many more. These use cases heavily leverage Data Movement to Apache Iceberg for many analytics and operational tasks, and central to this movement was a connector for Cassandra to Iceberg built in-house named Casspactor. As many Cassandra based Data Abstractions emerged, such as Key Value, Time Series and Graph — the need for larger and more complex Data Movement with transformations became more critical to the business.

Data movements are fundamentally fulfilled by leveraging the existing Cassandra backup infrastructure. Regularly scheduled backups are performed directly on the Apache Cassandra nodes, via a sidecar process managing the upload of all necessary SSTables and associated Metadata files directly into Amazon S3. When a Data Movement job is initiated, the job constructs the specific backup structure it needs by referencing the S3 based metadata, allowing it to precisely locate the SSTable files. The engine then downloads these files, performs the required mutation compaction and processing, and finally writes the fully transformed, compacted data directly into the target Apache Iceberg tables.

Image 1: Cassandra Cluster Backups to S3

Casspactor: The Engine We Outgrew

Casspactor processed roughly 1,200 data movements per day, transferring approximately 3 PB of data from Apache Cassandra into Apache Iceberg tables. It served some of the most critical workloads at Netflix. For years, it worked. Then, two compounding challenges made it clear we needed a fundamentally different architecture.

Fragile Metadata Dependencies

Before Casspactor could move a single record, it needed to answer a deceptively simple question: which backup exists, is it complete, and what does it contain?

Casspactor assembled this answer from multiple independent systems:

Image 2: Casspactor’s Composite View of a Backup

Each system had its own failure modes, update cadences, and accuracy guarantees. Casspactor’s view of the world was a composite, and composites diverge from reality.

Metadata fell out of sync with actual backups, causing Casspactor to read stale or incorrect data silently. Routine maintenance on the Cassandra Clusters triggered uncoordinated snapshots, and because Casspactor required all nodes in a region to snapshot at the same clock second, a single node replacement could break data movement for an entire region.

The fix was hiding in plain sight. The answer to “which backup exists and is it complete?” already lived in the backup storage layer (Amazon S3) itself. By reading metadata directly from the backup files, we could replace the entire dependency chain with a single source of truth.

Every Connector Inherited Casspactor’s Limitations

Cassandra at Netflix does not just store raw tables. It backs higher level data abstractions, such as Key Value, Time Series, and others, each with its own data model, access patterns, and semantics. When any of these abstractions needed to move data to Iceberg, they all funneled through Casspactor.

Every abstraction inherited Casspactor’s constraints:

Skewed partition failures: Casspactor could not handle tables with large partitions, a common pattern in Key Value and Time Series workloads. Jobs crashed with out-of-memory errors on some of Netflix’s largest datasets.
No data model awareness: Casspactor moved raw Cassandra tables as is. Connectors for Key Value and other abstractions had to bolt on post processing to reconstruct their data models from the raw output — extra cost, extra complexity, and an extra surface for failures.
Intermediate table bloat: Casspactor wrote to an intermediate Iceberg table before producing the final output. The Key Value connector added another intermediate table and a snapshots table. Connectors for abstractions on top of Key Value added even more. This compounded into significant storage cost overhead.
Inability to Time Travel: by relying on multiple services to compose a backup unit, Casspactor was unable to restore prior backups in the event of cluster Topology or Keyspace schema changes.
Monolithic design: Casspactor was built as a single connector, not as an engine. There was no way to build a family of purpose built connectors on a shared foundation.

We needed something fundamentally different: an engine that reads directly from backups in S3, produces standard Spark DataFrames, and lets each data abstraction build its own connector with full awareness of its data model. One foundation, many connectors.

The New Stack: A Layered Architecture

The new architecture, built upon the foundation of Apache Cassandra Analytics and the in-house Move Data framework, represents a fundamental shift toward a layered, purpose-built stack designed for reuse and maintainability. This new engine was conceived with clear separation of concerns, moving away from Casspactor’s monolithic design. The architecture is intentionally layered with the foundation being a core S3 reading capability: the Cassandra Analytics Wrapper, which is built on top of the Open Source Cassandra Analytics with Netflix’s internal backup representation and an S3 Client.

This layer handles the raw data retrieval from backups, translating it into standard Spark DataFrames. Sitting atop this foundation is a “Connector Factory” model, via both Java UDFs and transforms which allows individual data abstractions (Key Value, Time Series, others) to build highly optimized, data model aware connectors that process the generic Spark DataFrames, avoiding the need for complex, expensive, and failure-prone post-processing steps. This layered approach ensures that improvements to the core reading engine benefit all connectors, while the connectors themselves are focused solely on data transformation.

Image 3: The new Connector layered stack

Handles Skewed Partitions: By moving the mutation compaction and processing to the Executor level within Spark, the new engine can efficiently handle tables with highly skewed or wide partitions, a major pain point for Casspactor. Crucially, this processing occurs without excessive data shuffling, preventing out-of-memory errors and enabling reliable movement of Netflix’s largest datasets.
Operates at Spark DataFrames (No Intermediary Tables): The new architecture directly generates standard Spark DataFrames from the Cassandra backups. This eliminates the need for Casspactor’s costly, multi-stage intermediate Iceberg tables, which led to storage bloat and operational complexity. This native DataFrame operation enables the “Connector Factory” by providing a universal, easily consumable interface for building diverse, model specific connectors.
Jobs Auto Size: The engine integrates intelligent auto-sizing capabilities, allowing jobs to dynamically adjust resource consumption based on the source table’s characteristics. This removes the burden of manual tuning from engineering teams, ensuring optimal performance and cost efficiency without sacrificing reliability.
Reduced Dependencies: By reading metadata directly from the backup files stored in S3, the new stack removes the fragile, multi-service dependency chain that plagued Casspactor. S3 becomes the single, authoritative source of truth for backup existence and completeness, vastly improving data movement reliability and consistency.
Time Travel: A critical feature of the new stack is the ability to process the schema, cluster topology, and data as a cohesive unit at a specific point in time. This capability provides robust time travel functionality, essential for auditing, debugging, disaster recovery and reproducing past data states.
Performance: Collectively, these architectural improvements, including native DataFrame processing, optimized partition handling, and streamlined metadata retrieval have resulted in notable performance gains, reducing overall data movement execution runtime and cost compared to the legacy Casspactor system.
Cost: by eliminating intermediary Iceberg tables and efficient SSTable compaction on Executors, the new stack needs a significantly smaller storage and compute footprint leading to significant cost savings in the order of USD millions.

The Journey Towards a Safe Migration

The successful validation of the new stack was the critical first step, but it only marked the beginning of the most challenging phase: the migration. Large scale data migrations are inherently complex, high-risk undertakings that can be time consuming and often result in customer frustration and service disruption. To navigate the high stakes of decommissioning a mission-critical system like Casspactor and seamlessly replacing it, we needed a strategy that prioritized reliability and transparency above all else.

The migration was fundamentally enabled by a Like-for-Like strategy, which served as the cornerstone of our Platform Engineering philosophy, abstracting complexity. The core tenet was to maintain absolute consistency across the user-facing interface, the output contract, and the final data artifact. This meant ensuring that the data movement parameters defined via the Data Bridge abstraction remained unchanged, and, critically, the schema, metadata, and data within the destination Iceberg tables were identical to the legacy output. By preserving these external contracts, we eliminated the need for complex, time-consuming coordination with dozens of internal teams who relied on these data pipelines. This approach transformed the migration from a distributed, high-risk, multi-team effort into an internal platform implementation detail, allowing us to achieve a transparent, zero-impact transition and accelerate the retirement of the legacy system without requiring any code changes or validation from downstream users.

To navigate this migration, we developed a strategy anchored by three core pillars that serve as a blueprint for successful, large-scale data migrations:

Validation: Establishing and maintaining absolute confidence in data consistency through rigorous, ongoing validation.
Visibility: Instrumenting every part of the system to provide a clear, real-time understanding of migration progress and system health.
Safety: Ensuring user impact is minimized or eliminated, despite the inevitable system failures, by leveraging abstractions and robust fallbacks.

The next section will provide a detailed exploration of these key pillars.

Pillar 1: Validation

Trust is earned, and in data migration, it is earned one row at a time. The first pillar is the most critical: providing a measurable guarantee to users and partners that the data produced by the new system is an exact, row-by-row replica of the data produced by the old one.

Our foundational tactic was deploying the new Move Data connector in a “shadow” testing that ran in parallel with the production Casspactor jobs. This allowed us to validate the new system with real-world, production workloads without any customer impact.

Image 4: Shadow job structure leveraged for data validation

Let C be the set of rows in the legacy Casspactor output (Iceberg table).
Let M be the set of rows in the new Move Data output (Iceberg table).

The test for trust: prove that C = M. This required continuously checking for two conditions:

Rows in C but not in M (C-M): The new system missed data.
Rows in M but not in C (M-C): The new system introduced phantom or erroneous data.

Any result where the cardinality of these difference sets (the number of differing rows) was greater than zero triggered an immediate, high-priority investigation. The target was 100% similarity.

Uncovering and Resolving Disparities

The shadow mode quickly became a powerful forensic tool, exposing “unknown unknowns”, subtle discrepancies that were not bugs in the new system but rather differences in behavior between the new and old systems. Resolving these was the core work of building trust. For each problem we initiated an investigation log where we captured the details, logs, queries that allowed us to diagnose. Based on the assessment the issues were categorized so that similar differences on other datasets were later resolved affecting many of the shadow pipelines.

Maintaining an investigation log was critical to organize the outstanding issues and effectively communicate to stakeholders the progress and confidence of the new connector so that we effectively measure the appropriate level of “confidence” to initiate the migration.

We observed differences in how connectors leverage reference timestamps for Time-to-Live, Consistency Levels, backup selection, and various internal business logic. This continuous, data-driven cycle of discovery and resolution was the mechanism by which we built confidence in the new architecture.

Pillar 2: Visibility

Trust is built in the background, but an active migration requires real-time insight: Visibility. The second pillar involves instrumenting the system to provide an unambiguous, clear understanding of operational health and migration progress.

We extended our instrumentation to the overall migration workflow and its dependencies:

Dashboards: We created centralized dashboards to track migration status, visualizing the total number of data movements migrated versus those remaining. The dashboards tracked execution status, average runtime, and cost comparisons between the two connectors.
Dependency Tracking: Since the new system relied on a new set of APIs to fetch backup metadata, we implemented detailed metrics for failures to keep track of the APIs or dependencies failed.
Alerting: Proactive alerts were set up for job failures (Move Data or Casspactor), failures on Move Data that triggered a fallback to Casspactor or any data discrepancy being detected.

This comprehensive instrumentation allowed the team to be proactive, fix issues as they emerged during the migration, and gain the necessary confidence to accelerate the migration timeline.

Pillar 3: Safety

Even with perfect data correctness and enhanced visibility, the third pillar, Safety is required for a zero-impact migration. The challenge is ensuring that when a system inevitably fails, the user experience is uninterrupted. Our strategy centered on decoupling the user’s workflow from the underlying connector implementation.

Leveraging Abstraction: The Decider Pattern

To achieve a transparent swap, we leveraged the Maestro workflow orchestration platform to implement the Decider pattern:

Data Movement Abstraction: From a user’s perspective, their Data Movement job definition remained the same.
The Decider Step: Internally the workflow responsible to execute the job was modified to include a Decider step. This step took the data movement parameters (source cluster, table name, destination) and invoked a control plane: Connector Controller.
Connector Controller as the Registry: The control plane served as the dynamic registry. Based on the migration cohort and the data movement attributes, it determined and reported the appropriate connector to use either Casspactor (legacy) or Move Data (new).

This abstraction gave our team complete control. We could upgrade or rollback any connector for any data movement instantly by simply updating a configuration in the controller, with zero modification required to the thousands of downstream customer workflows. Crucially, this abstraction guaranteed the critical safety net: a conditional step in the Maestro workflow logic ensured that if the Move Data step fails, it would immediately execute the Casspactor step.

This pattern would increase the chances that the user’s data movement completes successfully, even if the new connector encountered a bug or transient failure during the initial rollout phases. User impact was completely eliminated; they might see a slightly longer runtime in the event of a failure and fallback, but they would never see a migration failure or suffer from stale data.

Image 5: The Decider Pattern Implementation via Maestro

Beyond the workflow, the new system architecture itself was inherently more resilient. By building the new data movement connector on Cassandra Analytics and reading backups directly from S3, we removed fragile dependencies on deprecated internal services.

Conclusion

The migration from Casspactor to the new, layered architecture built on Cassandra Analytics and the Move Data connector was more than a typical “tech debt” project; it was a fundamental shift in our approach to data movement reliability and scalability at Netflix.

The legacy system, while serving us well for years, was ultimately constrained by monolithic design, fragile metadata dependencies, and an inability to handle the complexity of modern data abstractions. The new stack resolves these issues by delivering a robust, cost-efficient, and inherently more resilient solution that reads directly from S3, handles wide partitions gracefully, and eliminates costly intermediate tables.

Our blueprint for the migration, anchored by the three pillars of Validation, Visibility, and Safety, ensured a transparent and high-confidence transition. Through rigorous shadow testing and a data-driven audit framework, we achieved the desired data consistency. Enhanced dashboards and alerting provided the real-time operational insight necessary to manage risk. Most critically, the implementation of the Decider pattern within our workflow abstraction minimized the impact for all downstream users.

This successful migration validates a core philosophy: by abstracting complexity at the platform level, we can perform large system migrations without burdening our product engineering partners. The new foundation is now ready to support the next generation of Netflix’s data abstractions.

Looking ahead

This foundational work on the Cassandra Data Movement stack has done more than just replace a legacy system: it has become an accelerator for innovation across the entire Data Movement organization. By providing a reliable, performant engine that standardizes data retrieval into Spark DataFrames, we’ve enabled the rapid development of new, highly optimized connectors. This new “Connector Factory” approach has already delivered a dedicated Key-Value to Iceberg and Time Series connectors, both of which are fully aware of their respective data models, eliminating costly post-processing. This architecture is also paving the way for ambitious new initiatives, including the development of a solution for bulk loading data into Cassandra itself, effectively completing the data movement cycle, and enabling safer fleetwide connector rollout with canaries inspired by the Decider Pattern.

We are incredibly grateful for the extensive collaboration among the Data Movement, Data Bridge, Online Data Stores, Membership, Billing, Subscriber and Ads platform teams at Netflix; this work simply couldn’t have been accomplished without their partnership!

Data Projects: Managing Data Assets at Netflix Scale

Netflix Technology Blog — Mon, 11 May 2026 23:35:11 GMT

By Amer Hesson, Marcelo Mayworm, James Mulcahy, and Brittany Truong

The Problem: Managing Assets at Netflix Scale

Netflix’s Data Platform is vast. We have millions of tables in our data warehouse and tens of thousands of scheduled workloads running across our orchestration systems. Behind each of these assets sits an engineer, a team, or an initiative — and behind each of those sits a set of decisions about who can access what, and how those workloads execute day after day.

For years, the tools we used to manage access and identity for these assets operated at the granularity of the individual asset. Every table had its own Access Control List (ACL). Every workflow ran under the identity of the engineer who authored it. In a workforce that is fluid, where people change teams, change roles, and occasionally leave the company, this fine-grained model broke down in two persistent, painful ways.

Problem 1: Permissions that can’t keep up with organizational changes

Imagine you’re on a team that owns a few hundred tables. Your org restructures, a neighboring team merges into yours, and you inherit another few hundred. Now you have to find every ACL on every table, figure out who should still have access, and update them one by one. Multiply that by every reorg across every team across the company. The result? Two failure modes:

The support team gets flooded. A significant and outsized share of support threads were requests to update table permissions en masse in response to org changes. While self-service tooling and best practices are in place to manage this, adherence is inconsistent. Data Projects addresses this by promoting the solution from optional tooling to a foundational part of the data platform.
Access gets granted far too broadly. Rather than maintain fine-grained ACLs, teams would often open up table access to the whole company. This defeated the purpose of having ACLs in the first place.

Problem 2: Workloads tied to human identities

Scheduled and asynchronous workloads — Maestro workflows, data movement jobs, Spark pipelines — need an identity to run as. Historically, that was a human: whoever authored the workflow.

Human identities are not durable. People change teams, get new responsibilities, and leave the company. When they do, their permissions change, and the workflows running under their identity start to fail. The only fix was to swap in a colleague’s identity, which inevitably had different permissions, kicking off a “permissions whack-a-mole” as each fix surfaced the next missing grant. And then, eventually, that colleague would also move on, and the cycle would repeat.

Enter Data Projects

We introduced Data Projects to tackle both problems head-on. At its core, a Data Project is two things:

A container to manage and view a set of related assets in aggregate: tables, workflows, and other data assets grouped under a single logical umbrella.
A synthetic, durable, and assumable identity: one that asynchronous and scheduled workloads can execute under, independent of any human’s lifecycle.

You can think of it as hoisting the granularity of management up from the individual asset to a meaningful container: the project. Instead of managing permissions on 500 tables, you manage them on one project that contains those 500 tables.

While the initial focus has been access and identity, the abstraction has applications well beyond those concerns. That broader potential is part of what makes it worth investing in.

Figure 1a. Individual assets, each managed in isolation, with per-asset access controls and per-person ownership.

Figure 1b. These assets are logically grouped into projects for easier management.

Grants and Roles

Each Data Project has a set of grants managed by the owning team. Different identity types can be added as grants: users, groups, applications, and continuous integration (CI) jobs. Each grant has a role that determines what the grantee can do within the project. For example, a Contributor has read/write access to the project’s assets, while a Viewer has read-only access. These roles roll up neatly — instead of rewriting hundreds of ACLs when someone joins or leaves a team, you update a single project grant.

The Identity Umbrella: Netflix and IAM

Every Data Project is provisioned with a Netflix application identity, and optionally an AWS IAM role. This is the “identity umbrella” that makes workloads durable:

The project’s Netflix identity is what executes the project’s async workloads (e.g. Maestro workflows). It belongs to the project, not to any person.
The project’s IAM role supports specialized use cases in AWS like Spark jobs on Amazon EMR. Crucially, the IAM role can be exchanged for the project’s Netflix identity in a cryptographically secure way.

Members with privileged roles can also assume the project’s Netflix identity. This is enormously useful for testing and troubleshooting from a development context like a laptop or a notebook — you get to run commands as the project, exactly as the scheduled workload would.

Gravity

One of the more elegant properties of Data Projects is what we call gravity. When a workload running under a project’s identity creates a new asset — say a Maestro workflow creates three tables — those assets are automatically added to the project as contained assets. The project becomes the center of mass for everything produced under its identity. You get organization for free as a side effect of how the platform already works, eliminating future challenges of discovering relevant assets and gaining access to them.

Securing Data Workflows with Data Projects

Maestro is Netflix’s primary workflow orchestrator for batch analytics, covering scheduled ETL pipelines, data movement jobs, ML training, and much more. Because workflows can run on schedules without the original user present, Maestro is designated a Trusted Workload Manager (TWM), formally authorized to mint fresh identity tokens on behalf of the workloads it manages.

That identity matters everywhere. A single workflow execution may be checked against table ACLs in the Secure Data Warehouse, authorization policies for Netflix resources, and IAM policies for AWS — all in a single run. If the identity is fragile, the whole workflow is fragile.

The Problem with User-Tied Identity

The standard pattern was to run workflows under an On-Behalf-Of (OBO) credential — for example, maestro OBO alice@netflix.com. This gave the workflow the union of Maestro’s and the human’s permissions, but in doing so it also bound the workflow’s permissions to that person’s. When they changed teams or left Netflix, the workflow broke. A colleague might take over ownership, but they rarely had the same access as the previous owner, so the workflow would stay broken for days while permissions were sorted out. At Netflix’s scale, with tens of thousands of scheduled workloads, many of them business-critical, this was unsustainable.

Data Projects: Durable Identity

Data Projects solves this by replacing user-tied identity with a durable, team-owned Netflix application identity: one that doesn’t change teams, go on vacation, or leave the company. Each project groups related workflows, tables, secrets, and other assets under a single consistent identity, and Maestro validates the caller’s access to the project before executing any workflow under it.

The downstream improvements are as follows:

Tables created during execution are automatically associated with the project’s identity through gravity, inheriting its access controls without additional configuration.
Secrets are scoped to project policies, so ownership transfers no longer strand credentials.
Access is managed once at the project level, replacing fragmented per-user grants across every asset the workflow touches.

The result is a workflow identity model that is stable, auditable, and built to survive the organizational changes inevitable at any company operating at this scale.

Success Stories

Many Data Projects have already grown to contain tens of thousands of assets in production. A couple examples are highlighted below:

Streaming Quality of Experience: A core observability pipeline tracking quality of experience (QoE) metrics whose continuity used to depend on whichever engineer happened to own the underlying workflows. Now it runs under the project’s identity, stable regardless of team membership changes.
Member Analytics: Analytical models and ETL workflows for member data products. A concentrated set of business-critical analytics whose access is managed at the project level rather than across hundreds of individual tables and workflows.

More broadly, we’ve seen Data Projects adopted as the organizing principle for entire analytics domains. Where teams previously maintained their own access policies, ad-hoc grant lists, and tribal knowledge about “who should have access to what,” the project is now the single answer.

Using Data Projects

Onboarding workflows onto Data Projects is a matter of:

Creating a project for the logical grouping of assets (or using an existing suitable one).
Granting the right people and groups the appropriate roles.
Configuring the workflow to run with the project’s identity.

Thanks to gravity, new assets produced by project workflows land in the project automatically. Migrating existing workflows can be a challenge as it requires setting up the Data Project with the appropriate permissions before changing its execution identity. We are actively working on infrastructure to track the access patterns of existing workflows so that we can recommend precise permission updates for the destination project. Our goal is to make the Data Project the de facto option for executing any kind of asynchronous workload.

What’s Next

Data Projects started as an Analytics Platform initiative, a response to specific pains in the data warehouse, but the underlying ideas are not unique to data. We see a potential future where Projects (not just Data Projects) are a first-class platform concept spanning data assets, software assets (GitHub repositories, Spinnaker applications, Docker images), and even studio assets (production content, pipelines, and transformations).

We’re also investing in:

Rightsizing: we’re integrating a layer on top of our authorization policies that automatically rightsizes permissions based on actual usage patterns, proactively eliminating unnecessary access and preventing “permission creep”.
Hoisting beyond access and identity: the project is a natural unit for surfacing other concerns at the aggregate level — cost attribution, health indicators, and more.
Ad-hoc use case integrations: extending project identities beyond scheduled workloads to cover interactive, on-demand actions like running a query through the Data Portal.
Activity logs and audits: a unified timeline of grant changes, asset changes, and workflow versions at the project level.

Conclusion

Data Projects is an answer to a simple observation: at Netflix’s scale, the unit of identity and access management can’t be the individual asset or the individual human. It has to be something larger, something durable, something that matches the way teams actually think about the work they own.

A project is that unit. And as we continue to generalize the concept beyond the data warehouse, we expect it to become one of the foundational primitives of how engineering at Netflix is organized, not just how data is organized.

Acknowledgments

We would like to express our gratitude to the following individuals for their contributions to this effort: Ryan Bordo, Doug Clark, Luke Fernandez, Sarrah Figueroa, Ankit Gupta, Brian Hoying, Ye Ji, Abhishek Kapatkar, Anmol Khurana, Matheus Leão, Hechao Li, Raymond Liu, Alice Naghshineh, David Noor, Anjali Norwood, Javier Garcia Palacios, Kunaal Parekh, Brandon Quan, Andrew Seier, Jason Seo, and Ethan Zhang.

If you are interested in helping us solve these types of problems and helping entertain the world, please take a look at some of our open positions on the Netflix jobs page.

Scaling ArchUnit with Nebula ArchRules

Netflix Technology Blog — Fri, 08 May 2026 15:01:00 GMT

By John Burns and Emily Yuan

Introduction

At Netflix, we operate using a polyrepo strategy with tens of thousands of Java repositories. This means that we need to have ways of sharing common build logic across these repositories. On the JVM Ecosystem team within Java Platform, we build tooling such as the Nebula suite of Gradle plugins to provide standard ways to build projects, keep dependencies up-to-date, and publish artifacts reliably across the Java ecosystem. Our mission also entails providing build-time feedback to the developer when they deviate from the paved road, or when their code base contains technical debt.

Case Study

After a Netflix incident relating to a library releasing a backwards-incompatible change, our team was asked to provide some tooling and practices to improve the Java library lifecycle management. This was not a simple case of a library making a reckless breaking change. The code removed had been deprecated for years. Library authors often struggle to know when it is safe to remove deprecated code, or refactor code that is not meant to be used by downstream applications. Fleet-wide migrations, such as upgrading major Spring Boot versions, also involve deprecated code removal. To help with this, we established a suite of API lifecycle annotations:

@Deprecated from the Java standard library
@Public A custom annotation to use on APIs meant to be used downstream
@Experimental A custom annotation for new APIs which may not yet be stable
All other APIs are assumed to be “internal”

Library authors can annotate their APIs with these annotations. However, how will they know which downstream projects are using their API incorrectly, based on these?

As we sought to improve the paved road for JVM-based libraries at Netflix, we needed a good way of identifying this kind of technical debt, not only for the benefit of the Java Platform-provided libraries, but any team delivering shared libraries to the organization. For this, we looked at ArchUnit.

ArchUnit is a popular OSS library (3.5k stars, 84 contributors) used to enforce “architectural” code rules as part of a JUnit suite. It is used internally by Gradle, Spring, and is provided as part of the Spring Modulith platform. The rules engine, which is built directly on top of ASM, can be used for a wide variety of use cases. It is powerful enough to be a general purpose static analysis tool with the following distinctive features:

1. Works cross-language (JVM), because it uses ASM/bytecode, not AST parsing.

2. Exposes a builder API pattern that makes it easy to write rules

3. Also has a lower level API ideal for writing more complex custom rules.

The limitation of ArchUnit is that it is designed to be used as part of a JUnit suite in a single repository. The Nebula ArchRules plugins give organizations the ability to share and apply rules across any number of repositories. Rules can be sourced from OSS libraries or private internal libraries. This makes the plugin generally useful for any JVM+Gradle engineering organization.

Why ArchUnit?

Before we go into how ArchRules works, it is good to understand why we would want to use ArchUnit in this way instead of other static analysis tools.

AST vs Bytecode

Some tools, such as PMD, process rules against an AST (abstract syntax tree). An AST is a structured representation of source code. This kind of tool will have rules that are syntax dependent. Rules that need to support multiple JVM languages, such as Kotlin or Scala, often need to be rewritten for each language. It also allows code which should be found to be hidden under syntactic sugar not anticipated by the rule author. ArchUnit uses ASM to analyze actual compiled bytecode, which means it doesn’t matter how that code was produced. What is analyzed is the actual code that will be run.

Rule Authorship

Tools like PMD and Spotbugs are not optimized for custom rule authorships. Most usage of these tools run built-in provided rules, or add in pre-made third party plugins. Take a look at what a custom rule for PMD might look like:

 //AllocationExpression/ClassOrInterfaceType[
   @Image='DateTime' and (
       (count(..//Name[@Image='DateTimeZone.UTC'])<=0)
       and
       (count(..//Name[@Image='DateTimeZone.forID'])<=0)
    ) or (
       (
           (count(..//Name[@Image='DateTimeZone.UTC'])>0)
             or
           (count(..//Name[@Image='DateTimeZone.forID'])>0)
       ) and (../Arguments/ArgumentList and count(../Arguments/ArgumentList/Expression) = 1)
   )
 ]
]]>

This rule ensures that DateTimes are not instantiated without an explicit zone. This is a raw string meant to be used within PMD’s xpath parser. There is no IDE guidance on crafting it. To test it, a whole separate PMD process needs to be wired up to interpret the rule and evaluate it against a source file. Let’s see how a similar rule would look with ArchUnit:

ArchRuleDefinition.priority(Priority.MEDIUM)
.noClasses()
.should()
.callConstructorWhere(
    // constructor does not have a zone arguement
    target(doesNot(have(rawParameterTypes(DateTimeZone.class))))
   // constructor is for DateTime
        .and(targetOwner(assignableTo(DateTime.class)))
)

This is type-safe Java code with a fluent API. It is also simple to unit test, as ArchUnit has a method to pass a rule object and class references to evaluate the rule against those classes.

Class Relations

Because ArchUnit processes the entire classpath with ASM, it retains a graph of the class data, allowing rules to easily traverse class relationships and call sites. This allows rules to have much more context about the code it is evaluating.

Rules Libraries

The first step was to build the ability to write ArchUnit rules which can be shared and published. In order to do this, we have the ArchRules Library Plugin. This plugin adds an additional source set to your Gradle project called archRules. In this source set, you can create a class which implements the ArchRulesService interface. This interface has a single abstract method which returns a Map. The keys of this map are the names of your rules, and the ArchRule is the rule you would like to define using the standard ArchUnit API. Here is an example:

public class GuavaRules implements ArchRulesService {
  static final ArchRule OPTIONAL = ArchRuleDefinition.priority(Priority.MEDIUM)
        .noClasses()
        .should()
        .dependOnClassesThat()
        .haveFullyQualifiedName("com.google.common.base.Optional")
        .because("Java Optional is preferred over Guava Optional");

    @Override
    public Map getRules() {
        Map rules = new HashMap<>();
        rules.put("guava optional", OPTIONAL);
        return rules;
    }
}

This code and its dependencies will not be bundled with your main code. It is bundled into a separate Jar with the arch-rules classifier. When publishing, your library will publish this jar as a separate variant with the usage attribute set to arch-rules. This means that in order for downstream projects to use these rules, they must use Gradle Module Metadata for dependency resolution. There are 2 flavors of rules Libraries: Standalone rules libraries, bundled rule libraries.

Standalone Rule Libraries

A Standalone Rule library contains no main code: only archRules. These are useful for defining rules for code you don’t own, such as Core Java APIs or OSS libraries. They are also useful for generic rules that can apply to any code, such as “don’t use code marked as @Deprecated”. We maintain a collection of OSS Standalone rule libraries which anyone is free to use, and serve as examples of the types of rules you may want to write yourself. However, the real power of ArchRules is in “bundled rule libraries”.

Bundled Rule Libraries

A bundled rule library is a library with both main and archRules sources. The main source set will contain useful library code, whatever it may be. The archRules will contain rules specific to the usage of that library. For example, rules scoped to that library’s package, or referencing that library’s specific API. Whenever possible, we recommend writing rules in this bundled way. That is because the ArchRules Runner Plugin will be able to automatically detect these rules and run them in only the source sets that use this library as a dependency. An example of this can be seen in our Nebula Test library.

In any case, the library plugin will automatically generate a service loader registration entry for your ArchRulesService so that the runner can discover your rules.

Running Rules

The ArchRules Runner Plugin allows rules to be evaluated against your code. Standalone rule libraries can be evaluated against all source sets by adding them to the archRules configuration in your build. For example:

dependencies {
    archRules("your:rules:1.0.0")
}

As mentioned before, bundled rules will be evaluated automatically. To do this, the runner plugin creates a separate configuration for each of your source sets. In each of these configurations, the archRules classpath is combined with the runtimeClasspath with the arch-rules variant selected. This configuration is the classpath used when the ServiceLoader discovers implementations of ArchRulesService. In the following example, we have a Project which uses a test helper library as a testImplementation dependency, and also adds a standalone rules library to the archRules configuration. The test runtime classpath will only contain the implementation jar for the helper library, but the arch rules runtime will contain the archrules jar for the bundled rules and standalone rules. This all happens automatically.

Gradle configurations used by ArchRules

Once the rules classpath is determined, the runner plugin will create a Gradle work action to evaluate rules against that specific source set. This action runs with classpath isolation using the *archRuleRuntime configuration. Within this action, a ServiceLoader is used to discover rule definitions. The action ends by writing a binary serialization of rule violations to a file for reporting.

In a project running rules, you also have the ability to customize rule configurations using the archRules extension. For example, you can override a rule’s priority level:

archRules {
    ruleClass("com.netflix.nebula.archrules.deprecation") {
        priority("HIGH")
    }
}

Other customizations include disabling running rules on certain source sets and configuring the failure threshold (i.e., high priority failures will cause the build to fail).

Reporting

The ArchRules runner plugin has two built-in reports: JSON and console. The json report will collect the output from all source sets within a project and create a single json file with all of the data. The console report also collects the output from all source sets within a project, but it prints to the console an easy to read report, for example:

Console Report output

Note that failure details feature a detailed plain English description, along with a pointer to the exact line of code in violation.

For custom reporting, you can either use the JSON file, or create your own task that reads the binary files. Take a look at the source code for the ArchRules runner plugin’s report tasks for an example of how to do this.

Case Study Solution

Going back to our original problem, using ArchRules, we were able to deliver a platform for library authors to track the usage of their APIs. They write ArchRules to detect usage of the annotations, scoped to their library’s package, such as:

ArchRuleDefinition.priority(Priority.MEDIUM)
    .noClasses().that(resideOutsideOfPackage(packageName + ".."))
    .should()
    .dependOnClassesThat(resideInAPackage(packageName + "..").and(are(deprecated())))
    .orShould().accessTargetWhere(targetOwner(resideInAPackage(packageName + ".."))
        .and(target(is(deprecated())).or(targetOwner(is(deprecated())))))
    .allowEmptyShould(true)
    .because("Deprecated APIs are subject to removal");

NB: the deprecated() predicate comes from nebula-archrules.

Our internal Nebula standard Gradle wrapper and plugin suite automatically enable the ArchRules runner on every project, and provides a custom reporter which sends the report data to our Internal Developer Portal on every main-branch CI build. This way, library authors can easily see a report of all downstream consumers using their experimental, deprecated, or non-public APIs, giving them confidence to make “breaking” changes, knowing that it will not actually break downstream consumers. If their changes are currently blocked by downstream usage, they can easily see exactly which projects are reporting those usages.

OSS Rule Libraries

While the most powerful way to use ArchRules is for you to write your own rules, we have built some OSS rule libraries that anyone is free to use, or reference as examples.

Nullability

These rules enforce proper nullability annotation in Java, for example, that every public class is marked with JSpecify’s @NullMarked. It is smart enough to exclude Kotlin code, as Kotlin has built-in nullability.

Gradle Plugin Best Practices

Writing Gradle plugins can be hard, especially since there are many APIs and patterns that should not be used anymore. These rules help enforce current best practices when writing Gradle plugins.

Joda / Guava Rules

These rule libraries discourage the use of Joda Time and Guava classes (respectively) as these have been superseded by java.time and standard library enhancements.

Security Rules

These rules help mitigate CVEs by detecting usage of known vulnerable APIs. Ideally, we keep dependencies up to date to mitigate CVEs. But sometimes that is not immediately feasible, and in those cases, a compile time check to ensure the specific vulnerable API is not used is often good enough.

Conclusion

We are now running 358 (and counting) rules across over 5,000 repositories detecting over nearly 1 million issues. About 1,000 of these issues are for “High” priority rules. Being able to run these rules on this scale allows us to quickly gain insight into our large fleet of microservices, and identify the areas carrying the most critical technical debt. This makes it easier to focus and prioritize our efforts.

Going forward, we will be exploring how to tie auto-remediation solutions into the ArchRules findings. ArchUnit currently provides very specific and detailed information about failures in reports, which makes a very strong input signal to an auto remediation tool. We will explore deterministic solutions such as OpenRewrite and non-deterministic solutions such as LLMs. Pairing the easy rule authorship and deterministic results of ArchUnit with an auto-remediation tool that can correctly interpret the results to solve the issue at hand will be a very powerful combination.

We also will investigate how to get ArchRule failure information surfaced in the IDE as inspections.

If you have questions or feedback about Nebula ArchRules, reach out to us by posting in the #nebula channel on the Gradle Community Slack.

Scaling ArchUnit with Nebula ArchRules was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix Technology Blog — Mon, 04 May 2026 16:01:02 GMT

Saish Sali, Nipun Kumar, Sura Elamurugu

Introduction

As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the industry standard, our ML teams were relatively small, and optimizing member engagement was our primary use case. Fast forward to today, and machine learning has become the backbone of Netflix’s business transformation. We now apply ML across various business domains, including:

Personalization: Optimizing engagement and helping members discover content they’ll love
Studio: Pre and post-production workflows
Payments: Fraud detection, payment routing, and recurring billing optimization
Ads: Our newest domain, requiring real-time decisioning and targeting

… and a growing number of additional use cases across the company

Each domain operates with a different tech stack, different business metrics, and a distinct organizational structure. While this diversity is a testament to how machine learning has evolved to drive value across many verticals at Netflix, this growth introduces a new challenge: enabling cross-pollination of models and data across domains.

The Challenge: A Fragmented ML Landscape

As our ML investments scaled across these domains, a critical problem emerged: the models produced largely became black boxes. Without any discovery infrastructure, ML practitioners couldn’t easily collaborate or share work across business verticals.

Consider a concrete example: content embeddings. Our Studio teams create sophisticated embeddings that identify scene boundaries, detect visual transitions, and understand content structure. These embeddings were originally built for production workflows.

But those same embeddings could be incredibly valuable elsewhere. Ads could hypothetically use content embeddings for context matching (ensuring advertisements align with the tone and content of what’s currently playing). Personalization could leverage them for episodic merchandising and recommendations (matching the topic or mood of an episode with a user’s preferred viewing preferences). Yet making this cross-pollination happen is extraordinarily difficult.

Why? Our ML tools exist in silos, each with its own backend services and user interface. The model registry is unaware of which A/B tests were using its models, and the pipeline orchestrator is unaware of downstream model dependencies. ML practitioners have to traverse multiple systems to answer basic questions about their work. Finding a model requires opening the model registry, understanding its lineage means switching to the pipeline orchestrator, and tracking which A/B tests use that model requires navigating to the experimentation platform. This fragmentation prevents practitioners from answering critical questions:

Discovery: What features exist? What data sources are available for generating features for a model?
Lineage: Which pipeline is generating data for a specific model? What data sources feed those features?
Impact: Which A/B tests are running this model? Which models will break if I change this feature? Who owns each piece of this chain?

The Hard Problem: Connecting everything

The real challenge wasn’t just building a consolidated UI. We needed to connect the different pieces of infrastructure our ML practitioners were using to perform different parts of the ML lifecycle.

Our ML ecosystem generates metadata from dozens of sources:

Pipeline orchestration systems emit execution details, stage dependencies, and data transformations
Deployed model registry tracks model versions, artifacts, staleness, and deployment history
Experimentation platform manages A/B tests and their configurations
Feature store catalog feature definitions and usage
AI Dataset platform tracks the creation, management, discovery, and loading of datasets.
Identity platform maintains user, team, and organization metadata

Each system employs different formats, identifiers, and mental models. The hard technical problem we had to solve was: How do we collect this heterogeneous metadata, transform it into a unified entity model, and build a connected graph that enables true exploration and collaboration across business domains?

The Solution: Metadata Service and the Model Lifecycle Graph

Our answer was the Metadata Service (MDS), which builds a Model Lifecycle Graph that indexes and connects ML-related entities across Netflix. MDS is optimized for real-time ingestion of ML metadata (e.g., models, features, pipelines, experiments, datasets) and to answer cross-domain questions such as “Which experiments are running this model?” or “Which models share these features?” It is the foundation that enables discovery, ingesting events from diverse sources, enriching them with context, and materializing relationships across entities.

Our vision: to make every ML asset at Netflix discoverable, understandable, and reusable by every ML practitioner, regardless of their team or domain.

Core Abstractions: The Vocabulary of the System

Before diving into the technical implementation, it’s helpful to understand the conceptual model that underpins MDS. This vocabulary enables consistent communication across teams and systems:

Component: Any object that is uniquely addressable using an AI Platform’s (AIP) Uniform Resource Identifier (URI). An AIP URI follows the formataip:////, ensuring global uniqueness. For example:

Models: aip://model/registry/ranking-v5
Users: aip://user/identity/alice
Pipelines: aip://pipeline/orchestrator/weekly-training

Entity: A component within the ML ecosystem, characterized by additional properties such as name, description, creation date, and owners. Entities represent ML-specific assets, such as models, features, and pipelines.

Entity Type: A group of entities that share the same data shape. A data shape is a set of property constraints that specify the attributes and relationships an entity must have.

Domain: A functional grouping of related entity types that defines the abstract interface for a category of ML assets. For example, the Models domain defines what a Model and Model Instance look like, while the Pipelines domain defines Schedules, Requests, and Executions.

Provider: A concrete implementation of a domain, backed by a specific source system. For example, the Models domain is currently backed by our internal model registry. This separation allows MDS to support multiple providers for the same domain. If a new model registry were introduced, it could be added as an additional provider without changing the domain interface.

We can summarize these concepts with a concrete example:

This URI-based addressing scheme is crucial as it allows any service to reference any ML asset with a single string, and MDS can resolve that reference back to rich, connected metadata.

From Events to Entities to Graph

The journey from raw system events to a queryable graph happens in stages. Let’s walk through each with a concrete example: connecting a model to its A/B tests through relationship inference.

1 Event Ingestion

MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time. Source systems emit thin events that include an identifier and an event type.

Example event:

{
  "event_type": "model_instance_created",
  "instance_id": "ranking-model-v5-20XX0101",
  ...
}

This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements.

Each source system has dedicated event handlers in MDS:

Pipeline Orchestration: Ingests pipeline execution events, including node definitions, schedules, requests, and job attempts
Model Registry: Captures model deployments, configurations, and version updates
Feature Store: Tracks feature definitions and their versions
Experimentation Platform: Monitors A/B test configurations and allocations
Datasets: Tracks ML datasets and their versions
Identity Platform: Maintains ownership and team membership information

2 Entity Enrichment

MDS implements a hydration contract for each event type. When an event arrives, MDS:

Validates the event schema
Calls the source system’s API to fetch the complete, current state
Transforms the response into a normalized entity

This design has a crucial property: the order of events doesn’t matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes.

This notification of change pattern has a few important tradeoffs. On the plus side, it keeps producers simple, makes us robust to out-of-order or dropped events, and ensures that MDS can always reconcile to the latest state by reading from the source of truth. The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don’t overload them.

For our ranking model example, when the model_instance_created event arrives, MDS calls the Model Registry API: GET /api/v1/instances/ranking-model-v5-20XX0101

The registry responds with a full descriptor. Example response (key fields only):

{
  "id": "ranking-model-v5-20XX0101",
  "pipeline_run_id": "train-weekly-ranking-20XX0101",
  "owner_emails": ["alice@netflix.com"],
  "labels": [{"key": "team", "value": "personalization"}],
  ...
}

3 Data Transformation and Normalization

Raw events are heterogeneous and each source system has its own schema and semantics. MDS workers transform these events into a unified entity model with standardized fields.

Without normalization, downstream consumers would need to understand every source system’s schema. Normalization creates a consistent interface, allowing queries and relationships to work across all entity types. Here is an example.

Normalized MDS entity:

{
  "id": "aip://model/registry/ranking-model-v5-20XX0101",
  "pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
  "entity_type": "ModelInstance",
  "owners": ["aip://user/identity/alice"],
  "tags": [{"tag": "team", "value": "personalization"}],
  ...
}

The normalization process standardizes field names and formats. For example, platform-specific IDs become global AIP URIs, owner_emails becomes owners with resolved user URIs, and labels become tags. Foreign keys like pipeline_run_id are transformed into entity references. However, there’s still no reference to which A/B tests are using this model. The Model Registry doesn’t track experiments, and the Experimentation Platform doesn’t track which pipeline produced a given model. This is where knowledge enrichment becomes critical.

4 Storage and Indexing

Once normalized, entities are persisted to Datomic and immediately indexed in Elasticsearch. This happens synchronously within the event processing flow.

Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as both a local cache and a graph database.

Why Datomic? Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state.

What we store:

All entity attributes as facts
Entity references (foreign keys that may point to entities not yet fully resolved)
All relationships as reified edges (added by enrichment processes)
Entity lifecycle state (tracking which entities are fully enriched vs awaiting hydration)

This enables:

Complex graph traversals: Navigate from a model to its features to their data sources in a single query
Entity relationships: Join across multiple domains without N+1 query problems
Flexible schema evolution: Easy to add new entity types and attributes as the catalog grows
Progressive enrichment: Background jobs efficiently identify and process entities requiring additional hydration, enabling gradual graph completion without reprocessing fully enriched entities

In practice, we use Datomic for relationship-heavy, navigational queries such as:

Starting from this model instance, show me all upstream datasets and downstream experiments.
Given this feature, list all consuming models and their owning teams.

These queries often span multiple hops in the graph and benefit from Datomic’s immutable fact model and efficient joins across entity relationships.

Elasticsearch for Discovery
Immediately after writing to Datomic, entities are indexed in Elasticsearch to power fast, full-text search across the catalog.

What we index:

Primary fields: Entity name, description, entity type, owner names
Relationship metadata: Names of related entities (e.g., a model’s features, pipelines, A/B tests) stored in the related field
Tags: Domain-specific metadata stored as key-value pairs (e.g., team::personalization, env::production, model.state::released)

Index structure:

Single entities index: All entity types (models, features, pipelines, etc.) are indexed in one unified index, differentiated by the entityType field
Separate owners index: Dedicated index for users and groups to enable cross-entity owner searches
Relevance boosting: Exact name matches score higher than other relevant matches

This enables:

Multi-field text search across entity names, descriptions, tags, and related metadata
Relevance ranking with boosting (exact name matches score significantly higher)
Complex filtering by entity type, ownership, tags, and domain-specific attributes (stored as tags)
Fuzzy matching to handle typos and partial queries

Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page. Indexing happens in near real-time as part of the ingestion and enrichment workflows, so changes are usually visible in the Portal with a short delay that is acceptable for interactive use.

5 Knowledge Enrichment and Graph Formation

Once entity metadata is persisted in Datomic, scheduled background processes take over to discover and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist only as references without full metadata).

The enrichment workflow:

Identify candidates: Find entities marked as uncached or with unresolved references
Hydrate relationships: Query source-of-truth systems to fetch related entity details
Materialize edges: Write discovered relationships back to Datomic
Re-index: Trigger Elasticsearch indexing for updated entities
Mark as enriched: Update entity status to prevent redundant processing

This asynchronous approach allows MDS to handle the computational cost of graph formation without blocking real-time event ingestion. It also enables retry logic and gradual enrichment as new entities become available.

Because enrichment is asynchronous, newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it’s safe to rely on a particular relationship for debugging or impact analysis.

Why enrich? Source systems are purpose-built and don’t know about entities in other domains. Enrichment discovers and materializes cross-system relationships that enable powerful lineage and impact queries.

Example: Connecting Models to A/B Tests

When MDS processes a new model instance, background enrichment jobs discover relationships through multi-hop inference:

Step 1: Direct link to pipeline

The model references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B test associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101

Response:

{
"run_id": "train-weekly-ranking-20XX0101", "pipeline":  "weekly-ranking-trainer",
"ab_test_cells": [
   {"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
 ]
 ...
}

Step 2: Discover A/B test context
The enrichment job discovers the pipeline ran for A/B test cell #2 and queries the Experimentation Platform for test details: GET /api/v1/tests/12345

{
 "test_id": "12345",
 "name": "Ranking Model v5 vs v4",
 "status": "ACTIVE",
 "cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
 ...
}

Step 3: Infer transitive relationships
The enrichment job now has the complete chain:

Model Instance was produced by Pipeline Run
Pipeline Run was executed for A/B Test Cell #2
The A/B Test Cell #2 belongs to A/B Test “Ranking Model v5 vs v4”
Model Instance now gets associated with this A/B Test

The job writes the inferred relationship back to Datomic and triggers re-indexing, and materializes these edges in the graph. MDS doesn’t just store what it’s told; it derives new knowledge by walking the graph in the background.

Why this matters: Without MDS, answering “Which A/B tests are using this model?” requires:

Looking up the model in the Model Registry
Finding which pipeline produced it
Checking the Pipeline Orchestrator for A/B test tags
Querying the Experimentation Platform for test details

With the model lifecycle graph, it’s a single query:

query {
  model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
    name
    owners { name }
    currentInstance {
      version
      pipeline {
        name
        owners { name }
      }
      features {
        edges {
          node {
            name
            data { edges { node { name } } }
          }
        }
      }
      associatedAbTests {
        name
        cells { number name }
      }
    }
  }
}

The reverse query also works: “What models are being tested in experiment 12345?”

Enabling Exploration, Not Just Search

With the Model Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t just about finding a model; It’s about traversing relationships:

Start with a model, explore its features
From features, navigate to the core data driving them
From the data, trace back to the pipelines generating it
From pipelines, see which teams own and depend on them
From experiments, understand which models are being tested

For example, imagine an engineer investigating a degraded engagement metric for a personalization model. They might:

Start with the model instance powering the affected recommendations in the AIP Portal.
Inspect the model’s features and follow a suspicious feature to its upstream dataset.
From the dataset page, see that its pipeline recently had failed runs and identify the owning team.
Confirm which A/B tests are currently running this model instance to understand which members and surfaces are impacted.

Before MDS and the Model Lifecycle Graph, this required manual checks across multiple tools (model registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.

This graph-based exploration answers questions that were previously impossible:

Lineage queries: What is the complete lineage of this model, from training data to production experiments?
Impact analysis: Which models will be affected if I change this feature?
Usage discovery: Which A/B tests are using this model?
Dependency mapping: What data sources does my pipeline transitively depend on?
Deprecation planning: Which entities are no longer being used and can be retired?

Every entity has deep context: its creation time, ownership, update history, and most importantly, its relationships to other entities.

The Model Lifecycle Graph is surfaced to practitioners through the AIP Portal, a unified interface that provides full-text search across all entity types, detailed entity pages with navigable relationships, and personalized views for teams and individuals.

A typical interaction in the AIP Portal looks like:

Search: Type a model, feature, dataset, or team name into the single search box backed by Elasticsearch.
Inspect: Land on an entity page that shows key metadata (description, owners, domains, tags) alongside a relationships panel.
Explore: Click through to related entities (upstream datasets, downstream experiments, and sibling model versions) to navigate the Model Lifecycle Graph without leaving the portal.

When new entity types are introduced into MDS, the portal automatically provides baseline search, entity pages, and relationship navigation, and we can then layer on domain-specific visualizations (such as model deployment history or dataset version timelines) over time.

The Road Ahead: Open Challenges

Building the ML lifecycle graph is an ongoing journey. Significant challenges remain, and these represent the future opportunities for us:

Tool Proliferation: As new ML tools emerge, we need robust integration patterns that scale. How do we design plugin architectures that make adding new sources seamless? If we don’t keep up with new tools, practitioners will be forced back into fragmented views, and the Model Lifecycle Graph will lose coverage and trust.
Domain-Specific Visualizations: Different entity types require distinct visualization experiences. Model pages should display deployment history, A/B test associations, and performance metrics. Feature pages should highlight data lineage and consuming models. Pipeline pages must show execution history, dependencies, and schedules. Dataset pages require versioning timelines and downstream consumers. How do we design a flexible UI framework that allows each entity type to have its own tailored experience while maintaining consistent navigation and interaction patterns across the portal? Without rich, domain-specific experiences, the portal risks becoming a generic catalog rather than a tool that ML practitioners rely on in their daily workflows.
Metadata Quality: Today, MDS ensures data consistency through source-of-truth hydration and schema validation at ingestion. Background enrichment jobs continuously infer relationships and materialize entities from source systems. However, challenges remain in ensuring completeness and timeliness at scale. When source systems fail to emit events, when ownership information becomes stale, or when entities lack descriptions and contextual metadata, the graph’s utility degrades. How do we build automated validation and enrichment systems to detect metadata anomalies, suggest missing relationships, and maintain quality benchmarks across millions of entities? Poor or stale metadata erodes practitioner trust: if the graph is incomplete or incorrect, teams will revert to ad hoc knowledge and one-off integrations rather than using MDS as their source of truth.
Advanced Relationship Inference: Beyond explicit relationships declared in source systems, how do we infer implicit connections? Can we detect that two models serve similar purposes based on shared features? Can we recommend features based on usage patterns from similar pipelines? We are in the early stages of exploring these ideas. Done well, they would turn MDS from a passive catalog into an active recommendation engine for ML assets, accelerating reuse and reducing duplicate work across domains.

Acknowledgments

This work represents the collective effort of stunning colleagues across the AI Platform organization: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Oleniuk, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

State of Routing in Model Serving

Netflix Technology Blog — Fri, 01 May 2026 21:03:13 GMT

By Nipun Kumar, Rajat Shah, Peter Chng

Introduction

This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale across various domains (e.g., title recommendations, commerce). In this introductory blog post, we will dive into our domain-independent API abstraction and its traffic routing capabilities that the central ML model serving platform exposes to several domain-specific microservices for model inference. This singular API, or entry point, into the ML model serving platform has significantly increased the speed of innovation for iterating on newer versions of existing ML experiences, as well as enabling completely new product experiences with ML.

Machine Learning use cases powering member experiences on Netflix require rapid iteration and evolution in response to new learnings. The success of our ML model serving infrastructure largely depends on enabling researchers to rapidly experiment with new hypotheses and safely, at scale, release their models into production. Equally important is enabling multiple microservices at Netflix to seamlessly get model inference without exposing the complexities of ML model inference. To achieve this in a uniform and scalable manner, we created a centralized ML serving platform. As of 2025, the platform serves hundreds of model types and versions, netting 1 million requests per second. In this post, we’ll zoom in on a core challenge of any large-scale ML serving system: How to route traffic to the right model instance, on the right cluster shard, for the right user and use case, while preserving a simple abstraction for both client services and model researchers.

Background

Models at Netflix

To properly frame our discussion, let’s first clarify the distinction between model serving and model inference. At Netflix, the definition of an ML model has historically been somewhat unique. While model inference typically focuses only on an infer(features) -> score capability, models at Netflix act as self-contained workflows that transform inputs to outputs. A “model” encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts. We refer to the end-to-end execution of this workflow as model serving. This distinction matters because our routing and API abstractions operate at the level of workflows, not just individual scoring functions.

A few simplified examples of model serving use cases:

Use case: Personalized Continue Watching row on Netflix Homepage

Input: UserId, Country, Device ID
Output: Ranked List of movies and shows (aka title): [titleId1, titleId2, titleId3,…]

Use case: Payment Fraud Detection

Input: UserId, Country, Payment Transaction details
Output: Probability of the transaction being fraudulent

A typical flow of this serving workflow is depicted below:

To achieve this higher level of abstraction, the model definition contains a list of facts (raw, unprocessed data or observations built as states in different business workflows) that it needs to compute features, and it relies on the model serving platform to supply these facts at serving time by calling several other microservices. Likewise, during offline training, Netflix’s ML fact store provides snapshots for bulk access to facilitate feature computation.

The important takeaway from this model definition is that the calling services only need to provide standard request context (such as userId, country, device), and the relevant domain context (such as titles to rank, or payment transaction for fraud detection), and the model can itself compute features and perform inference as part of the execution flow. This common set of request contexts across domains enables them to share a standard API abstraction and standardizes how various client microservices can uniformly integrate with the serving app. Furthermore, clients are shielded from the model selection and execution, allowing the model architecture and data inputs to evolve with minimal client coordination.

This post focuses on showcasing the technical details to support this design paradigm. We’ll first describe how we implemented this abstraction with Switchboard, a centralized routing service, and then discuss the operational challenges we encountered at scale and how they led us to the Lightbulb architecture.

ML Model Serving Platform Principles

We envisioned a central model serving platform for all of Netflix’s member-facing ML Model serving needs. This ambitious effort required principled thinking to provide the right level of abstraction for both the researchers and client applications. The following ideas, which are relevant to the topic of this blog post, ensured that the platform acts as an enabler of rapid ML innovation and limits the exposure of ML model iterations to the client apps:

Model innovation independent of client apps: There should be only a one-time integration effort by the calling app with the ML serving platform for a new use case. After that, almost all model iterations, including intermediate model A/B experiments, should be mostly opaque to the calling apps. This implies that the platform should handle tasks such as model selection based on a user’s A/B allocation, fetching additional data needed by experimental models, logging for further training or observability, and more. This also benefits the ML researcher, as they only need to coordinate with one platform for model innovation.
Decouple clients from model sharding: Models are distributed across multiple serving compute cluster shards, each with its own Virtual IP (VIP) Address. Various factors, such as traffic patterns, SLAs, model architecture, and CPU/Memory availability, affect model-to-cluster mapping, and changes to this mapping result in changes to the VIP address at which a model is reachable. The serving platform should make clients agnostic to such frequent VIP address changes while ensuring high availability.
Flexible traffic routing rules: Support flexible mechanisms to introduce new traffic routing rules. This includes supporting traffic routing based on A/B experiments, providing a knob to slowly shift traffic to new models and VIP addresses, and allowing client overrides.

Introducing Switchboard

Standard out-of-the-box API Gateway solutions (such as AWS API Gateway, a standalone Service Mesh proxy) did not meet all our requirements. In particular, we needed first-class integration with Netflix’s experimentation platform, the ability to expose gRPC endpoints to clients, and the ability to use rich domain-specific context for routing customizations, which generic proxies were not designed to handle. Furthermore, the platform required customizations to model-specific lifecycle stages (shadow mode, canaries, rollbacks) to enable safe rollouts and migrations.

Hence, we embarked on building a custom service that serves as a flexible proxy layer for all traffic, handling over 1 million requests per second while maintaining high availability and reliability. We named it Switchboard.

Switchboard serves as the central entry point for the system, acting as a mandatory interface for all clients to access the appropriate model based on their context. Its role is to perform context-aware routing and to apply any configured context enrichment to the model inputs.

Here is a visual representation of the request flow from different clients to different serving clusters:

Objective Abstraction

To support this system design, we introduce the concept of an “Objective”. It’s an Enumeration defined by the serving platform that every request into the system must provide. It has three key purposes:

In short, an Objective is the serving platform’s name for a specific business use case (e.g., ContinueWatchingRanking), which decouples clients from concrete models and guides the platform’s routing and model selection decisions.

Key Capabilities of Switchboard

To summarize, these are the key capabilities of Switchboard:

Common Client Abstraction: Switchboard provides a single point of contact for all our clients’ model needs. When clients wish to consume additional models for new ML applications addressing the same business need, there is no new service dependency to introduce or new clients to manage to make requests to the models. From an ML Ops perspective, this also gives us knobs to control client rate limits across model versions and manage central concurrency limits to deal with bad clients.
Context-Aware Routing: Switchboard can route a request based on a rich set of contextual features, such as the user’s current device, locale, ranking surface type (e.g., home page vs. search results), or the current A/B test a user is in.
Dynamic Traffic Splitting: It enables real-time traffic splitting for canary deployments and experimentation. This allows engineers to safely roll out a new model version to a small, controlled percentage of users before a full launch.
Model Versioning and Lifecycle Management: Switchboard inherently manages concurrent request traffic to multiple versions of the same model. This is crucial for:

Shadow Mode Testing: Routing production traffic to a new model version without affecting the user experience, enabling performance comparisons.
Instant Rollback: Immediate switching of traffic away from a problematic new model version back to a stable one.

But is this the whole story? Not quite. Introducing this routing layer adds complexity to our model deployment cycles. In addition, we need a mechanism to collect the context-based routing information from the researchers when they choose to deploy model variants.

The Glue — Switchboard Rules

Given that Objectives serve as the contract between clients and the serving platform, we needed a way for researchers to attach model variants, experiments, and traffic splits to those Objectives without changing client code. This is where Switchboard Rules comes in.

The primary UX for model researchers to define models associated with an objective in a flexible manner is a JavaScript configuration, which we call Switchboard Rules. It’s used to produce a set of rules (typically a JSON file) that primarily dictate the following things to the serving platform:

The default model to use for a given Objective
A/B experiments to configure for a set of Objectives and the corresponding models to load for those experiments
Customizations to gradually shift traffic to a new model

Here is an example of an A/B test rule in the context of the Continue Watching row:

/**
Configuration rule written by a Model Researcher to add an A/B experiment in the Model Serving system.
Cell 1: Uses the default, currently productized model
Cell 2 and Cell 3: Use different experimental (candidate) models
**/

function defineAB12345Rule() {
    const abTestId = 12345;

    const objectives = Objectives.ContinueWatchingRanking;
    const abTestCellToModel = {
        1: {name: "netflix-continue-watching-model-default"},
        2: {name: "netflix-continue-watching-model-cell-2"},
        3: {name: "netflix-continue-watching-model-cell-3"}
    };

    return {
        cellToModel: abTestCellToModel,
        abTestId: abTestId,
        targetObjectives: [objectives],
        modelInputType: constants.TITLE_INPUT_TYPE,
        modelType: 'SCORER'
    };
}

These rules are consumed by both the Switchboard and the Model Serving clusters. Given these rules, the serving platform components can take various actions, some detailed below:

Control Plane Flow:

Assignment: Produce model-to-cluster shard assignment.
Validation: Load all specified models into the Serving Cluster Shard and validate model dependencies to ensure successful execution.
Mapping: Provide the model-to-shard VIP address mapping to Switchboard.

Data Plane Flow:

Allocation: If the request is for Objective=ContinueWatchingRanking, query the Experimentation Platform for the userId’s cell allocation.
Model Selection: Use the allocation and A/B test rule to select the appropriate model.
Request Routing: Route the request to the serving cluster shard with the selected model and context.
Model Execution (on the serving host): Run the model workflow steps and return the response.

A key highlight of this setup is the decoupling of the experimentation config from the serving platform code. This includes having an independent release cycle for the rules, separate from the code deployments. Netflix’s Gutenberg system provides an excellent ecosystem that enables a flexible pub-sub architecture, facilitating proper versioning, dynamic loading, easy rollbacks, and more. Both Switchboard and the Serving Cluster Host subscribe to the same Switchboard Rules configuration.

To prevent race conditions and ensure proper sync of the dynamic Switchboard Rules configuration, the following flow is considered:

Evolving Challenges

Switchboard solved the primary problem of improving model iteration and innovation velocity, and provided an excellent ML serving abstraction to over 30 service clients. However, as the system scale increased, a few challenges and problems with this design became apparent:

Single point of failure: The presence of Switchboard in the critical request path clearly highlights the risks of shutting down access to all serving hosts in extreme cases, such as unintentional bugs or noisy neighbors sending excessive traffic.
Why this matters: Switchboard became a shared dependency whose failure would degrade or disable multiple ML-powered experiences at Netflix.
Added latency due to additional network hop: Switchboard in the request path adds between 10–20ms of latency due to serialization-deserialization operations, depending on payload size. Additionally, it further exposes a request to tail latency amplification.
Why this matters: The added latency is unacceptable for some latency-sensitive clients, resulting in end-user impact due to service timeouts.
Reduced Client flexibility: Switchboard obscures visibility into client request origins from the serving clusters. Consequently, distinguishing data logged for real vs artificial traffic, which is essential for model training, is difficult and requires ongoing customization and increased MLOps overhead.
Why this matters: It makes it harder to do tenant separation and test traffic isolation.

What Next? — Lightbulb

The aforementioned challenges of operating Switchboard at scale forced us to rethink the core implementation while retaining its key features. Our goal was not to throw away Switchboard’s design, but to refactor where and how its responsibilities were executed, keeping the benefits while reducing risk and latency. Particularly:

Common Client Abstraction
Decouple clients from model sharding
Flexible traffic routing rules
Lightweight system client
Single place to define model and experimentation config
Fast experimentation config propagation
Fallback and client-side caching in case of failures

However, we did want to address some of the previous design choices to move forward with:

Remove the routing service from the direct request path: Having a single service in the active request path introduces another failure mode and limits fallback flexibility. While routing rules change infrequently, maintaining consistency comes at the cost of increased availability risks.
Separate model inputs from the request metadata: In certain cases, the request payload could be quite large. Needing to deserialize and then re-serialize the payload as it flowed through Switchboard to make a routing decision was a significant contributor to latency and increased serving costs.
Provide better isolation for the routing layer: Consolidating multiple use cases (tenants) into a single routing cluster poses two main challenges. First, error propagation posed a risk, as a surge of problematic requests from one tenant could cascade errors back to Switchboard, potentially impacting other users. Second, the cluster had to accommodate diverse latency requirements because the requests from different use cases varied significantly in complexity.

This required some changes in our setup flow: While it largely remained unchanged, however, we created separate components for Routing and Model Selection (Lightbulb):

We now take the rules for an Objective and break them into distinct sets of configuration:

Model Serving Configuration: This allows us to determine which model should be used at request time, along with the required metadata
Routing Rules: Given a model we want to serve at request time, this tells us which VIP the request should be routed to.

The Data Plane changes also reflect this separation, as we now rely on Envoy to take care of the routing details:

Envoy is already used for all egress communication between apps at Netflix, and it can route requests to different clusters (VIPs) based on the configurable Routing Rules published from our control plane. However, it lacks the information needed to make routing decisions and the ability to enrich the request body with additional serving parameters required for A/B testing model variants. We introduced Lightbulb to cover this gap:

Lightbulb consumes the minimal request context, which contains use-case information, and provides the metadata mapping required for routing at the Envoy layer.
Lightbulb resolves the request context to determine a routingKey configuration along with the ObjectiveConfig — this is where we place the model id along with other request-specific configurations required for model execution. This is done to separate the config resolution associated with the request from the placement and routing information needed to reach it on the inference cluster.
While the routingKey is added to the headers for Envoy proxy to consume, the client adds the ObjectiveConfig parameters to the request itself. This is done to avoid bloating the request headers while passing additional parameters for the model to process the request appropriately.
The routing of the actual request is performed by the Envoy proxy, which has the metadata to map the routingKey to the actual cluster VIP running the model. Because the routingKey is in a header, this determination can be made with minimal overhead.

These changes retain the advantages of Switchboard, such as a single integration point, abstraction of model id from use case, context-aware routing, while addressing the challenges we observed over time.

Conclusion

The evolution from Switchboard to Lightbulb marks a significant architectural refinement in our ML model serving infrastructure. While Switchboard provided the initial abstraction layer critical for rapid innovation, its latency and single-point-of-failure risk posed scaling hurdles. The subsequent adoption of Lightbulb, a decoupled service focused solely on routing metadata, and its integration with Envoy successfully resolved these challenges. This sophisticated new architecture preserves the key benefits — seamless client integration and flexible experimentation — while ensuring reliable, efficient, and scalable delivery of personalized member experiences, positioning us well for future ML growth.

In future posts in this series, we’ll dive deeper into other aspects of our ML serving platform, including inference and feature fetching, and how they interact with the routing architecture described here.

Special thanks to Sura Elamurugu, Sri Krishna Vempati, Ed Maddox, and Sreepathi Prasanna for their invaluable feedback and partnership in iterating on this idea and bringing this blog post to life.

State of Routing in Model Serving was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Camera File Processing at Netflix

Netflix Technology Blog — Fri, 24 Apr 2026 15:06:01 GMT

Orchestrating Media Workflows Through Strategic Collaboration

Authors: Eric Reinecke, Bhanu Srikanth

Introduction to Content Hub’s Media Production Suite

At Netflix, we want to provide filmmakers with the tools they need to produce content at a global scale, with quick turnaround and choice from an extraordinary variety of cameras, formats, workflows, and collaborators. Every series or film arrives with its own creative ambitions and technical requirements. To reduce friction and keep productions moving smoothly, we built Netflix’s Media Production Suite (MPS) with the goal of automating repeatable tasks, standardizing key workflows, and giving productions more time to focus on creative collaboration and craftsmanship.

A critical part of this effort is how we handle image processing and camera metadata across the hundreds of hours and terabytes of camera footage that Netflix productions ingest on a daily basis. Rather than build every component from scratch, we chose to partner where it made sense–especially in areas where the industry already had trusted, battle-tested solutions.

This article explores how Netflix’s Media Production Suite integrates with FilmLight’s API (FLAPI) as the core studio media processing engine in Netflix’s cloud compute infrastructure, and how that collaboration helps us deliver smarter, more reliable workflows at scale.

Why We Built MPS

As Netflix’s production slate grew, so did the complexity of file-based workflows. We saw recurring challenges across productions:

File wrangling sapping time from creative decision-making
Inconsistent media handling across shows, regions, or vendors
Difficult to audit manual processes that are prone to human error
Duplication of effort as teams reinvented similar workflows for each production

Content Hub Media Production Suite was created to address these pain points. MPS is designed to:

Bring efficiency, consistency, and quality control to global productions
Streamline media management and movement from production through post-production
Reduce time spent on non-creative file management
Minimize human error while maximizing creative time

To achieve this, MPS needed a robust, flexible, and trusted way to handle camera-original media and metadata at scale.

The Right Tool for the Job

From the start, we knew that building a world-class image processing engine in-house is a significant, long-term commitment: one that would require deep, continuous collaboration with camera manufacturers and the wider industry.

When designing the system, we set out some core requirements:

Inspect, trim, and transcode original camera files and metadata for any Netflix production with trusted color science
Support a wide variety of cameras and recording formats used worldwide while staying current as new ones are released
Run well in our paved-path encoding infrastructure, enabling us to take advantage of proven compute and storage scalability with robust observability

FilmLight develops Baselight and Daylight, which are commonly used in the industry for color grading, dailies, and transcoding. Their FilmLight API (FLAPI) allows us to use that same media processing engine as a backend API.

Rather than duplicating that work, we chose to integrate. FilmLight became a trusted technology partner, and FLAPI is now a foundational part of how MPS processes media.

The Media Processing Engine

MPS is not a single application; it’s an ecosystem of tools and services that support Netflix productions globally. Within that ecosystem, the FilmLight API plays the following key roles.

Parsing camera metadata on ingest

Productions upload media to Netflix’s Content Hub with ASC MHL (Media Hash List) files to ensure completeness and integrity of initial ingest, but soon after, it’s important to understand the technical characteristics of each piece of media. We call this workflow phase “inspection.”

Footage ingested with MPS is inspected using FLAPI and all metadata is indexed and stored

At this stage, we:

Use FLAPI to gather camera metadata from the original camera files
Conform the workflow critical fields to Netflix’s normalized schema
Make it searchable and reusable for downstream processes

This metadata is integral to:

Matching footage based on timing and reel name for automated retrieval
Debugging (e.g., why a shot looks a certain way after processing)
Validations and checks across the pipeline

FLAPI provides consistent, camera-aware insight into footage that may have originated anywhere in the world. Additionally, since we’re able to package FLAPI in a Docker image, we can deploy almost identical code to both cloud and our production compute and storage centers around the world, ensuring a consistent assessment of footage wherever it may exist.

2. Generating VFX plates and other deliverables

Visual effects workflows constantly push image processing pipelines to their absolute limits. For MPS to succeed, it must generate images with accurate framing, consistent color management, and correct debayering/decoding parameters — all while maintaining rapid turnaround times.

To achieve this, we leverage Netflix’s Cosmos compute and storage platform and use open standards to provide predictable and consistent creative control.

At this phase, we use the FilmLight API to:

Debayer original camera files with the correct format-specific decoding parameters
Crop and de-squeeze images using Framing Decision Lists (ASC FDL) to ensure spatial creative decisions are preserved
Apply ACES Metadata Files (AMF), providing repeatable color pipelines from dailies through finishing
Generate an array of media deliverables in varied formats

These processes are automated, repeatable, and auditable. We deliver AMFs alongside the OpenEXRs to ensure recipients know exactly what color transforms are already applied, and which need to be applied to match dailies.

Because we use FilmLight’s tools on the backend, our workflow specialists can use Baselight on their workstations to manually validate pipeline decisions for productions before the first day of principal photography.

The Media Processing Factory in the Cloud

Finding an engine that competently processes media in line with open standards is an important part of the equation. To maximize impact, we want to make these tools available to all of the filmmakers we work with. Luckily, we’re no strangers to scaled processing at Netflix, and our Cosmos compute platform was ready for the job!

Cloud-first integration

The traditional model for this kind of processing in filmmaking has been to invest in beefy computers with large GPUs and high-performance storage arrays to rip through debayering and encoding at breakneck speed. However, constraints in the cloud environment are different.

Factors that are essential for tools in our runtime environment include that they:

Are packageable as Serverless Functions in Linux Docker images that can be quickly invoked to run a single unit of work and shut down on completion
Can run on CPU-only instances to allow us to take advantage of a wide array of available compute
Support headless invocation via Java, Python, or CLI
Operate statelessly, so when things do go wrong, we can simply terminate and re-launch the worker

Operating within these constraints lets us focus on increasing throughput via parallel encoding rather than focusing on single-instance processing power. We can then target the sweet spot of the cost/performance efficiency curve while still hitting our target turnaround times.

When tools are API-driven, easily packaged in Linux containers, and don’t require a lot of external state management, Netflix can quickly integrate and deploy them with operational reliability. FilmLight API fit the bill for us. At Netflix, we leverage:

Java and Python as the primary integration languages
Ubuntu-based Docker images with Java and Python code to expose functionality to our workflows
CPU instances in the cloud and local compute centers for running inspection, rendering, and trimming jobs

While FLAPI also supports GPU rendering, CPU instances give us access to a much wider segment of Netflix’s vast encoding compute pool and free up GPU instances for other workloads.

To use FilmLight API, we bundle it in a package that can be easily installed via a Dockerfile. Then, we built Cosmos Stratum Functions that accept an input clip, output location, and varying parameters such as frame ranges and AMF or FDL files when debayering footage. These functions can be quickly invoked to process a single clip or sub-segment of a clip and shut down again to free up resources.

Elastic scaling for production workloads

Production workloads are inherently spiky:

A quiet day on set may mean minimal new footage to inspect.
A full VFX turnover or pulling trimmed OCF for finishing might require thousands of parallel renders in a short time window.

By deploying FLAPI in the cloud as functions, MPS can:

Allocate compute on demand and release it when our work queue dies down
Avoid tying capacity to a fixed pool of local hardware
Smooth demand across many types of encoding workload in a shared resource pool

This elasticity lets us swarm pull requests to get them through quickly, then immediately yield resources back to lower priority workloads. Even in peak production periods, we avoid the pain of manually managing render queues and prioritization by avoiding fixed resource allocation. All this means lightning-fast turnaround times and less anxiety around deadlines for our filmmakers.

Designed for Seasoned Pros and Emerging Filmmakers

Netflix productions range from highly experienced teams with very specific workflows to newer teams who may be less familiar with potential pitfalls in complex file-based pipelines.

MPS is designed to support both:

Industry veterans who need to configure precise, bespoke workflows and trust that underlying image processing will respect those decisions.
Productions without a color scientist on staff — those who benefit from guardrails and sane defaults that help them avoid common workflow issues (e.g., mismatched color transforms, inconsistent debayering, or incomplete metadata handling).

The partnership with FilmLight lets Netflix focus on workflow design, orchestration, and production support, while FilmLight focuses on providing competent handling of a wide variety of camera formats with world-class image science!

Collaboration and Co-Evolution

Netflix aimed to integrate MPS into a wider tool ecosystem by developing a comprehensive solution based on emerging open standards, rather than making MPS a self-contained system. Integrating FLAPI into our system requires more than an API reference–it requires ongoing partnership. FilmLight worked closely with Netflix teams to:

Align on feature roadmaps, particularly around new camera formats and open standards
Validate the accuracy and performance of key operations
Debug edge cases discovered in large-scale, real-world workloads
Evolve the API in ways that serve both Netflix and the wider industry
Create a positive feedback cycle with open standards like ACES and ASC FDL to solve for gaps when the rubber hits the road

One example of this has been with the implementation of ACES 2. FilmLight’s developers quickly provided a roadmap for support. As our engineering teams collaborated on integration, we also provided feedback to the ACES technical leadership to quickly address integration challenges and test drive updates in our pipeline.

This collaborative relationship–built on open communication, joint validation, and feedback to the greater industry–is how we routinely work with FilmLight to ensure we’re not just building something that works for our shows, but also driving a healthy tooling and standards ecosystem.

Impact

While much of this work takes place behind the scenes, its impact is felt directly by our productions. Our goal with building MPS is for producers, post supervisors, and vendors to experience:

Fewer delays caused by missing, incomplete, or incorrect media
Faster turnaround on VFX plates and other technical deliverables
More predictable, consistent handoffs between editorial, color, and VFX
Less time spent troubleshooting technical issues, and more time focused on creative review

In practice, this often shows up as the absence of crisis: the time a VFX vendor doesn’t have to request a re-delivery, or the time editorial doesn’t have to wait for corrected plates, or the time the color facility doesn’t have to reinvent a tone-mapping path because the AMF and ACES pipeline are already in place.

Looking Ahead

As camera technology, codecs, open standards, and production workflows continue to evolve, so will MPS. The guiding principles remain:

Automate what’s repeatable
Centralize what benefits from standardization
Partner where deep domain expertise already exists

The integration with FilmLight API is one example of this philosophy in action. By treating image processing as a specialized discipline and collaborating with a trusted industry partner, Netflix is delivering smarter, more reliable workflows to productions worldwide.

At its core, this partnership supports a simple goal: reduce manual workflow and tool management, giving filmmakers more time to tell stories.

Acknowledgements

This project is the result of collaboration and iteration over many years. In addition to the authors, the following people have contributed to this work:

Matthew Donato
Prabh Nallani
Andy Schuler
Jesse Korosi

Scaling Camera File Processing at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.