DEV Community

What happens when companies become too AI-pilled?

Judy — Sat, 27 Jun 2026 01:00:26 +0000

This article is a deep-dive from JudyAI Lab — an AI engineering playbook series with 100+ published guides, 5,000+ weekly readers across 60+ countries, focused on the practical side of running AI agents, trading systems, and content pipelines in production.

📰 TL;DR

Box founder Aaron Levie recently called out what he calls "AI psychosis" — the phenomenon where executives who greenlight "AI can replace this job" are often the ones who know the least about what that job actually entails. He's warning that this decision-making blind spot is spreading across tech — decision-makers, overconfident in AI's potential, are rushing to implement mass layoffs without truly understanding workflows, role nuances, or the human judgment required.

In a concrete case, collaboration platform ClickUp just announced cutting 22% of its workforce, explicitly stating it will use AI Agents to take over those functions. This wave of layoffs has pushed 2026's total tech layoffs to nearly match all of 2025 — before even reaching the mid-year point — showing AI-driven workforce reduction is accelerating.

Levie's core argument isn't against AI adoption, but rather a warning: companies rushing to replace human labor with AI lack deep understanding of the work itself. When decision-makers aren't familiar with the actual complexity of the roles being replaced, they tend to overestimate AI's real coverage capability, ultimately hurting organizational efficiency. This "over-AI'd" thinking is becoming a new management risk in Silicon Valley. Full interview available at the source link.

💬 JudyAI Lab Take

The "AI psychosis" Levie pointed out is a red flag worth every AI implementer paying attention to: the execs loudest about "AI can replace this job" are often the ones most unfamiliar with that role — that's the real decision blind spot.

ClickUp's 22% headcount reduction replacing roles with AI Agents has already pushed 2026 tech layoffs near matching all of 2025 — before midyear. There's a wake-up call for the AI builder community here: the环节 where automation design fails most isn't tech selection, but insufficient understanding of the "work being automated." When designing Agent flows or workflows, skipping deep interviews with actual executors easily leads to overestimating AI's coverage — automating visible steps while missing a lot of implicit human judgment and exception handling. Levie's critique is essentially a requirements analysis problem: not understanding the true complexity of a job means more AI investment can lead to bigger organizational efficiency losses.

Before planning any AI replacement plan, talk to the people actually doing that job first. Ask them "what do you know that nobody else knows?" — that's often exactly where AI falls shortest.

References

Originally published at Judy AI Lab. Visit for more articles on AI engineering and development.

Why Enterprise AI Needs Structured Dissent, Not Just More Agents

Amit Kumar Singh — Sat, 27 Jun 2026 00:53:39 +0000

Many AI projects today are presented as multi-agent systems.

One agent investigates. Another agent analyzes risk. A third agent checks compliance. A fourth agent gives a recommendation.

It sounds advanced.

But in a bank, adding more agents does not automatically make a workflow safe.

A bank cannot freeze a customer account, block a payment, file a regulatory report, or label a transaction as fraud simply because an AI system produced a confident answer.

The real question is not:

How many AI agents are involved?

The real question is:

Can the system show evidence, challenge its own conclusion, apply deterministic rules, and stop for human approval when the decision is high impact?

That is the difference between an interesting multi-agent demo and an enterprise-ready AI workflow.

A banking example: suspicious wire transfer

Imagine a bank detects a wire transfer for $250,000.

The payment is unusual because:

The customer has never sent a transfer of this size.
The destination account is in a new country.
The transaction happens outside the customer’s normal business hours.
The beneficiary was added only a few minutes before the transfer.
The customer recently changed their phone number and email address.

A simple AI chatbot might say:

“This transaction looks suspicious. Consider blocking it.”

That is not enough.

A bank needs to know:

Which transaction patterns triggered the concern?
Is the customer actually violating a known risk threshold?
Is there a sanctions or AML issue?
Could this be a legitimate business payment?
What policy applies?
Should the payment be blocked, held, or released?
Who is allowed to make that decision?
Can the bank explain the decision later to auditors, compliance teams, and the customer?

This is where structured multi-agent design matters.

A better design: a banking fraud decision room

Instead of letting one model make a decision, the bank can create a controlled workflow with specialized agents.

Transaction Alert
      ↓
Fraud Detection Agent
      ↓
Customer Behavior Agent
      ↓
AML / Sanctions Agent
      ↓
Policy and Risk Agent
      ↓
Decision Reviewer
      ↓
Human Compliance Officer

Each agent has a limited responsibility.

1. Fraud Detection Agent

This agent analyzes transaction behavior.

It may identify:

Unusual payment amount
New beneficiary
New country
Unusual transaction time
Sudden profile changes
Prior fraud indicators

Its job is not to freeze the transaction.

Its job is to create a structured fraud signal.

{
  "event_type": "FRAUD_SIGNAL",
  "transaction_id": "TXN-784921",
  "customer_id": "CUST-10048",
  "risk_indicators": [
    "new_beneficiary",
    "amount_12x_customer_average",
    "unusual_country",
    "recent_contact_change"
  ],
  "risk_score": 82,
  "confidence": 0.88
}

This gives the next stage a reviewable artifact instead of a paragraph generated by an LLM.

2. Customer Behavior Agent

A transaction may look suspicious but still be legitimate.

For example, a corporate customer may be making a valid acquisition payment or paying a new overseas vendor.

The Customer Behavior Agent looks at:

Historical payment behavior
Customer segment
Typical payment ranges
Known business relationships
Recent support interactions
Whether the customer informed the bank about a major payment

This agent can produce a counterpoint:

{
  "event_type": "CUSTOMER_CONTEXT",
  "transaction_id": "TXN-784921",
  "historical_pattern": "Outside normal range",
  "known_business_event": "No supporting event found",
  "customer_contacted_bank": false,
  "assessment": "Transaction behavior remains inconsistent",
  "confidence": 0.76
}

This is important because the system should not treat every unusual payment as fraud.

Structured dissent is necessary

Now imagine the fraud agent recommends blocking the payment.

A good enterprise workflow should not simply accept that recommendation.

It should require another role to challenge it.

For example:

The Fraud Agent says: “High fraud risk.”
The Customer Context Agent says: “No evidence of a legitimate business event.”
The AML Agent says: “Beneficiary has elevated geographic risk.”
The Policy Agent says: “The bank’s hold threshold is met.”
The Decision Reviewer says: “Human approval required before blocking.”

That is structured dissent.

It is not about making agents argue for entertainment.

It is about making assumptions visible before the bank takes action.

In high-stakes workflows, disagreement is not a weakness. Hidden disagreement is the real risk.

The LLM should not make the final decision alone

LLMs are useful for many parts of the workflow:

Summarizing transaction history
Explaining why a transaction appears unusual
Reading customer notes
Interpreting investigation findings
Drafting a case narrative
Generating a compliance-review summary

But an LLM should not control deterministic rules.

For example, these should come from governed systems and rules engines:

Daily transaction thresholds
Sanctions screening results
AML policy conditions
Regulatory filing timelines
Customer account restrictions
Approval authority limits
Payment-hold policies
Risk score calculations

A safe architecture looks like this:

AI Layer
- Investigates
- Summarizes
- Explains
- Recommends

Rules Layer
- Calculates thresholds
- Applies risk policies
- Checks sanctions lists
- Enforces approval limits
- Determines required escalation

Human Layer
- Approves
- Rejects
- Overrides
- Requests further investigation

This distinction matters.

The AI can explain why a payment looks suspicious.

The rules engine can determine whether the bank’s fraud-hold threshold has been crossed.

The compliance officer can decide whether the payment should actually be blocked.

An evidence panel is more important than a chatbot answer

The final decision should not be a black-box score.

A compliance officer should see an evidence panel like this:

Transaction:
TXN-784921

Customer:
Corporate customer — existing account for 4 years

Amount:
$250,000

Risk indicators:
- New beneficiary
- New destination country
- Payment amount is 12x normal average
- Contact information changed within past 24 hours
- No matching historical vendor relationship

Policy checks:
- Enhanced review threshold: Triggered
- Manual compliance approval: Required
- Sanctions screening: Clear
- AML monitoring alert: Triggered

AI assessment:
High-risk transaction requiring manual review

Human decision:
Payment placed on temporary hold

Approved by:
Compliance Officer

Decision timestamp:
2026-06-26 14:22 UTC

This is what enterprise AI should produce.

Not just an answer.

A decision record.

Human approval is part of the architecture

Human approval should not be added as an afterthought.

In banking, some actions should be automated.

For example:

Action	AI / system role	Human role
Summarize alert	Automatic	Review if needed
Identify unusual transaction patterns	Automatic	Review exceptions
Create investigation case	Automatic	Monitor
Place temporary low-risk review hold	Rule-based	Review later
Freeze account	Recommend only	Explicit approval required
File SAR or regulatory report	Draft supporting evidence	Compliance approval required
Close customer account	Never autonomous	Senior human decision

The system should know when to proceed, when to pause, and when to escalate.

That is not a limitation.

That is good enterprise design.

What this means for data engineering teams

This same pattern applies directly to data engineering.

A data-engineering copilot should not only generate SQL or YAML from a source-to-target mapping document.

It should operate as a governed workflow.

For example:

STTM / DDL / Source Metadata
          ↓
Metadata Extraction Agent
          ↓
Mapping Validation Agent
          ↓
Transformation Logic Agent
          ↓
SQL / YAML Generator
          ↓
Reviewer Agent
          ↓
Data Engineer Approval

The reviewer should validate things such as:

Does the source column exist?
Is the target data type compatible?
Is the join supported by the mapping?
Is the transformation rule documented?
Is a sign rule missing?
Is a derived metric using an unapproved assumption?
Are there duplicate or unused YAML objects?
Has an engineer approved the generated output?

Then every generated artifact should include traceability.

Target Column:
PROFIT_AMT

Source:
sales.PROFIT_AMT

Transformation:
CASE WHEN SALES_TYPE = 'CANCEL'
THEN PROFIT_AMT* -1
ELSE PROFIT_AMT
END

Business Rule:
Cancellation transactions must store Profit as negative.

Source Reference:
STTM row 42

Validation:
- Source column exists
- Transformation approved
- Target data type compatible
- Human review status: Approved

This is how generated code becomes a governed engineering artifact.

A practical checklist for enterprise AI

Before calling a multi-agent system enterprise-ready, ask:

Does each agent have a clear responsibility?
Are handoffs structured instead of free-text only?
Can one agent challenge another agent’s conclusion?
Are critical calculations and policy checks deterministic?
Can every recommendation be traced to source evidence?
Does the system show assumptions and confidence levels?
Is there a clear escalation path for uncertainty?
Can a human approve, reject, or override the decision?
Can the organization reconstruct the full decision later?

If the answer is no, the solution may still be a useful prototype.

But it is not ready for high-stakes enterprise use.

Final thought

The future of enterprise AI is not one intelligent assistant making every decision.

It is also not a collection of agents talking continuously.

The future is a governed decision system where AI helps teams investigate faster, compare perspectives, identify risk, and prepare recommendations.

But evidence remains visible.

Rules remain enforceable.

Disagreement remains allowed.

And people remain accountable.

That is how AI becomes useful in banking, finance, data engineering, and other enterprise workflows where trust matters as much as speed.

https://dataengineeringcopilot.com

https://github.com/amising6/data-engineering-copilot

https://www.linkedin.com/in/amit-singh-57980030

I wanted a Go networking engine that gets out of the way, so I built one (Breeze).

Farshad Khazaei Fard — Sat, 27 Jun 2026 00:49:42 +0000

Over the past few months, I've been working on Breeze, a networking engine built on top of gnet.

The goal wasn't to create "another web framework."

The goal was to explore how far an event-loop architecture can go for modern Go services.

Some design decisions I made:

⚡ Event-loop driven architecture
🌐 Native HTTP and WebSocket support
🚀 WebSocket fast-path (avoids the HTTP router after the upgrade)
🧵 Worker pool to keep the event loop responsive
📦 Low-allocation request handling
📚 Built-in Swagger support
🔌 Built-in WebSocket Hub for real-time applications

One design choice I'm particularly interested in discussing is that HTTP and WebSocket aren't treated the same.

Every incoming connection is classified inside the event loop. HTTP requests follow the router, while upgraded WebSocket connections are dispatched directly to the WebSocket engine. It keeps the hot path small and avoids unnecessary work once the protocol is established.

The project is still evolving, and I'm deliberately questioning every architectural decision before calling it "production ready."

I'd genuinely appreciate feedback from developers who have experience with:

High-concurrency Go servers
gnet or event-loop architectures
Large-scale WebSocket systems
Low-latency backend services

Repository:
https://github.com/nelthaarion/breeze
Documentation:
https://nelthaarion.github.io/breeze

I'm especially interested in hearing what you would change. Architecture discussions are often more valuable than benchmark numbers.

Happy to answer any questions or dive into implementation details. 🚀

Mastering the "Quantified Self": Building a Blazing-Fast Heart Rate Dashboard with DuckDB and Streamlit

Beck_Moulton — Sat, 27 Jun 2026 00:44:00 +0000

As programmers, we love data. We track our commits, our uptime, and our deployment frequencies. But what about our most important "server"—our heart? 💓

The "Quantified Self" movement has led to an explosion of wearable data. However, if you've ever tried to analyze raw heart rate CSVs (often sampled every few seconds), you'll quickly realize that standard relational databases or even pure Pandas can get sluggish once you hit that 100k+ row mark.

In this tutorial, we are going to build a high-performance Quantified Self Dashboard. We will leverage DuckDB—the "SQLite for Analytics"—to perform vectorized execution on heart rate data, paired with Streamlit and Plotly for a slick, interactive frontend. We’ll focus on Python data engineering, time-series analysis, and fast SQL processing.

Why DuckDB? 🦆

Traditional databases are row-based, which is great for transactions but terrible for analytical queries. DuckDB is a columnar-vectorized query engine. This means it processes data in chunks (vectors) and utilizes modern CPU instructions (SIMD) to crunch numbers at speeds that make standard Python loops look like they're standing still.

The Architecture

Here is how our data pipeline flows from raw pixels (well, raw CSV rows) to actionable insights:

graph TD
    A[Raw Heart Rate CSVs] -->|Direct Ingestion| B(DuckDB Engine)
    B -->|Vectorized SQL Execution| C{Data Aggregation}
    C -->|Moving Averages/Outliers| D[Streamlit App State]
    D -->|Plotly| E[Interactive Visualization]
    E -->|User Input| D

Prerequisites 🛠️

Ensure you have the following stack installed:

Python 3.9+
DuckDB: For the heavy lifting.
Streamlit: For the UI.
Plotly: For the beautiful charts.

pip install duckdb streamlit plotly pandas

Step 1: Ingesting 100,000+ Data Points in Milliseconds

One of the coolest features of DuckDB is its ability to query CSV files directly without a formal "import" step. This is a game-changer for developer productivity.

import duckdb
import pandas as pd

# Let's assume 'heart_rate.csv' has columns: timestamp, bpm
def load_data(file_path):
    # DuckDB can read CSVs directly and infer types!
    con = duckdb.connect(database=':memory:')

    # High-performance SQL query to aggregate data into 1-minute buckets
    query = f"""
    SELECT 
        time_bucket(INTERVAL '1 minutes', timestamp) AS time,
        AVG(bpm) AS avg_bpm,
        MAX(bpm) AS max_bpm
    FROM read_csv_auto('{file_path}')
    GROUP BY 1
    ORDER BY 1
    """
    return con.execute(query).df()

Step 2: Building the Interactive Dashboard

Now, let's wrap this in Streamlit. We want to calculate a Moving Average to smooth out the noise from the sensor.

import streamlit as st
import plotly.express as px

st.set_page_config(page_title="Heart Rate Analytics", layout="wide")

st.title("🏃‍♂️ Quantified Self: Heart Rate Insights")
st.markdown("Processing 100k+ data points in real-time using **DuckDB**.")

uploaded_file = st.file_uploader("Upload your heart rate CSV", type="csv")

if uploaded_file:
    # Save the uploaded file temporarily
    with open("temp_data.csv", "wb") as f:
        f.write(uploaded_file.getbuffer())

    # Query using DuckDB
    df = load_data("temp_data.csv")

    # Add a moving average using Pandas (or do it in SQL for more speed!)
    window_size = st.slider("Smoothing Window (minutes)", 1, 60, 5)
    df['smoothed_bpm'] = df['avg_bpm'].rolling(window=window_size).mean()

    # Create the Plotly Chart
    fig = px.line(df, x='time', y='smoothed_bpm', 
                  title="Heart Rate Trend (Smoothed)",
                  labels={'smoothed_bpm': 'BPM', 'time': 'Time'})

    fig.update_traces(line_color='#ef4444')
    st.plotly_chart(fig, use_container_width=True)

    # Key Metrics
    col1, col2, col3 = st.columns(3)
    col1.metric("Max HR", f"{int(df['max_bpm'].max())} BPM")
    col2.metric("Avg HR", f"{int(df['avg_bpm'].mean())} BPM")
    col3.metric("Data Points", f"{len(df)} rows")

The "Production" Way: Advanced Patterns 🥑

While this setup is perfect for local analysis, scaling "Quantified Self" apps for production requires more robust data architecture. If you're interested in how to deploy these types of analytical apps at scale or want to see more advanced SQL optimization patterns for time-series data, I highly recommend checking out the WellAlly Blog.

They provide excellent deep dives into production-ready data engineering and have some fantastic resources on building performant monitoring systems that go far beyond basic CSV parsing.

Step 3: Performance Comparison

Why did we use DuckDB instead of standard Pandas?

Operation	Pandas (Standard)	DuckDB (Vectorized)
CSV Ingestion	1.2s	0.15s
Group By Aggregation	0.8s	0.04s
Memory Footprint	Moderate	Low (Streaming)

As you can see, DuckDB is consistently 5-10x faster for these analytical workloads. For a developer dashboard where you want instant feedback when sliding a filter, these milliseconds matter!

Conclusion: Take Back Your Data! 📊

Building your own tools to visualize your health data is incredibly rewarding. By combining DuckDB's speed with Streamlit's ease of use, you've created a tool that can handle massive datasets on your laptop without breaking a sweat.

Your turn:

Try adding a SQL query to detect "Zone 5" training sessions.
Use DuckDB's JOIN capabilities to correlate your heart rate with your GitHub commit frequency!

If you enjoyed this tutorial, drop a comment below or share your own Quantified Self projects! And don't forget to visit wellally.tech/blog for more advanced engineering content. Happy coding! 💻🔥

Why your prototype works for you but not for anyone else

Saveyourproject — Sat, 27 Jun 2026 00:37:43 +0000

TL;DR — A prototype that works for you but breaks for everyone else usually isn't bad luck. It's four repeatable culprits: you designed for one assembly, your fasteners drift, the enclosure ignores real loads, and you never wrote down why it works. Fix those, and "works on my bench" becomes "works, period."

You built the thing, and it works. In your hands, on your bench, every single time.

Then a friend tries it, or it sits in the garage a week, or the temperature drops one night, and it just stops.

Frustrating doesn't really cover it.

Here's the reassuring part: that gap between "works for me" and "works for anyone" is almost always the same small handful of culprits. You're not missing some secret skill. Once you've met them a few times, you start designing around them without even thinking about it.

1. You built it for one. Now build it for two.

That first one fit because you were there — nudging, sanding, coaxing it together. The trouble is, all of that lived in your hands, not in the model.

So the second copy fights you.

If you can't make a second one without the fiddling, it isn't done yet. Bake the clearance into the CAD, then print one you promise not to touch up. That's the real test.

2. Your fasteners are quietly betraying you.

Press-fits creep. Hot glue lets go. Jumper wires back out. Double-sided tape taps out the first warm afternoon.

I know the boring fixes aren't the fun part — a screw boss, a captive nut, a bit of strain relief, a connector that actually clicks home. But boring is exactly what's still holding a year from now.

3. The enclosure is a load, not a lid.

It's easy to treat the box as an afterthought. But heat, dust, and vibration are real forces working on your build.

A board that runs cool in the open can slowly cook once it's sealed up. A connector that's happy on the bench can buzz itself loose in a drawer that gets opened every day.

Give the heat somewhere to go, mount the board instead of letting it dangle from its wires, and clamp down anything that moves.

4. Write down why it works.

Future-you is begging you.

Six months from now, v2 breaks and you won't remember which dimension was load-bearing, which resistor value you finally settled on, or why that one screw is longer than the rest.

One page of "why" notes — just the decisions that actually mattered — turns a miserable teardown into a five-minute fix.

You don't need a factory for any of this

It's really just one step past "it works," and it's a step anyone can learn.

(There's a whole other layer once you want to make ten of something, where sourcing and repeatability change the math. But that's a problem for after this one.)

So I'm curious: which of these four bites you the most? For me it's almost always #1 — designing for one, and forgetting the second one has to exist too.

I originally posted this on my own site, if you'd like it in one place: https://www.saveyourproject.com/blog/bench-to-real-use-reliability

Event-Driven Architecture: sistemi che reagiscono invece di chiedere

Dev-Iadicola — Sat, 27 Jun 2026 00:34:20 +0000

Il problema: catene di chiamate rigide

Quando un utente si registra, il sistema deve: inviare l'email di benvenuto, creare il profilo default, notificare il team di vendita, aggiornare le statistiche, attivare il periodo di prova. Il controller chiama cinque servizi in sequenza. Se aggiungi un sesto step (integrazione CRM), devi modificare il controller. Se il servizio email e lento, rallenta tutto. Se fallisce, blocca i passaggi successivi. Le dipendenze crescono linearmente con le funzionalita.

L'architettura event-driven inverte il flusso: il controller fa una sola cosa — registra l'utente e pubblica un evento UserRegistered. Ogni componente interessato reagisce all'evento indipendentemente. Il controller non sa quanti listener ci sono, ne cosa fanno. I listener non si conoscono tra loro.

Concetti fondamentali

Evento: un fatto accaduto

Un evento e un fatto immutabile che descrive qualcosa che e già successo: OrderPlaced, PaymentReceived, ArticlePublished. Non e una richiesta ("fai qualcosa") ma una notifica ("e successo qualcosa"). Questa distinzione e fondamentale: l'emittente non si aspetta una risposta e non sa chi sta ascoltando.

Producer e Consumer

Il producer emette l'evento. Il consumer (listener/subscriber) reagisce. Un evento può avere zero, uno o molti consumer. Aggiungere un consumer non richiede modifiche al producer. Rimuoverne uno non rompe nulla.

Esempio teorico: ciclo di vita di un ordine

OrderPlaced → SendOrderConfirmationListener invia l'email, ReserveInventoryListener blocca lo stock, NotifyWarehouseListener prepara la spedizione, UpdateDashboardListener aggiorna le metriche
PaymentReceived → ConfirmOrderListener aggiorna lo stato, GenerateInvoiceListener crea la fattura
OrderShipped → SendShippingNotificationListener notifica il cliente, StartDeliveryTrackingListener attiva il tracking

Ogni listener e una classe isolata, testabile, che fa una sola cosa. Aggiungere un'integrazione con un nuovo sistema di analytics? Crei un listener, lo registri sull'evento, e tutto funziona senza toccare il codice esistente.

Sincrono vs Asincrono

Sincrono: i listener vengono eseguiti nella stessa request. Semplice, ma se un listener e lento, rallenta la response. Adatto per operazioni veloci e critiche (validazione, aggiornamento stato).
Asincrono: i listener vengono accodati (Redis, RabbitMQ, database) e eseguiti da worker separati. Non rallenta la response. Adatto per operazioni lente (email, PDF, integrazioni esterne) o non critiche.

In pratica, la maggior parte dei sistemi usa un mix: alcuni listener sincroni per le operazioni critiche, altri asincroni per il resto.

Event-Driven in Soft PHP MVC

Il framework usa eventi tramite l'Observer Pattern nei Model: i lifecycle hooks (beforeSave, afterSave, beforeDelete) sono eventi sincroni che permettono di reagire ai cambiamenti delle entita. Il CacheObserver invalida la cache quando un articolo viene modificato — senza che il Model sappia che la cache esiste.

Quando usare Event-Driven Architecture

Usa eventi quando un'azione ha effetti collaterali che non sono responsabilità del componente che la esegue
Usa eventi quando vuoi che nuove funzionalita possano "agganciarsi" senza modificare il codice esistente
Usa eventi quando il disaccoppiamento tra componenti e una priorità
Non usare eventi per flussi lineari semplici dove una chiamata diretta e più chiara
Non usare eventi se il debugging e già difficile: gli eventi rendono il flusso meno tracciabile

L'architettura event-driven non e un'alternativa a MVC o Hexagonal: e un principio di comunicazione che si applica dentro qualsiasi architettura. Quando i componenti reagiscono a fatti invece di essere chiamati direttamente, il sistema diventa più flessibile, più estensibile e più resiliente.

👉 Leggi l'articolo completo su iadicola.it

I Tried Connecting an OpenAI-Compatible API to WordPress AI

Ailles — Sat, 27 Jun 2026 00:33:22 +0000

I was curious about something while testing WordPress AI features:

What happens if the AI provider I want to use is not one of the default providers?

WordPress AI can connect to supported providers directly, but many AI services today expose an OpenAI-compatible API instead of being officially supported one by one.

That includes things like:

OpenRouter
OpenWebUI
Ollama
LiteLLM
other custom OpenAI-compatible endpoints

At first, I assumed that if an API follows the OpenAI format, I could just paste the endpoint somewhere and call it a day.

Turns out, not always.

So I tested a bridge approach using a WordPress plugin called Koneek to connect an OpenAI-compatible API provider to WordPress AI.

Here is what I found.

The thing I wanted to test

The question was simple:

Can WordPress AI use an OpenAI-compatible provider even if WordPress does not support that provider directly?

In my test, I used OpenRouter as the provider because it exposes an OpenAI-compatible endpoint and also has free models available.

The goal was not to build a custom integration from scratch.

I wanted to see whether WordPress AI could use a third-party compatible API through configuration only.

Why OpenAI-compatible APIs are not always plug-and-play

This part surprised me a little.

Even when a provider says it is OpenAI-compatible, it does not mean every app can automatically use it.

Different providers may use:

a different base URL
custom authentication behavior
extra headers
different model naming
slightly different endpoint expectations
different timeout behavior

So even if the request and response format is mostly compatible, WordPress still needs a way to know:

where to send the request
which model to use
which API key to send
how long to wait for a response

That is where the plugin comes in.

What I used

For this test, I prepared:

WordPress 7.0
the official WordPress AI plugin
the Koneek – AI Provider for OpenAI-Compatible plugin
an OpenAI-compatible API key
a model name
the provider base URL

For OpenRouter, the base URL I used was:

https://openrouter.ai/api/v1

For the model, the original test used:

cohere/north-mini-code:free

The exact model can change depending on what is available from your provider, so I would not hardcode that forever. Always check the provider's model list before configuring it.

The setup flow I tested

After installing and activating the required plugins, I went into:

Settings > Koneek

From there, I added a new provider configuration.

The fields were pretty straightforward:

Name: Any label you want
Provider: OpenAI-Compatible
API Key: Your provider API key
Model: The model name from your provider
API URL: The OpenAI-compatible base URL
Timeout: Optional

For OpenRouter, the important values looked like this:

Provider: OpenAI-Compatible
Model: cohere/north-mini-code:free
API URL: https://openrouter.ai/api/v1

I left the timeout empty at first because I wanted to see whether the default behavior was good enough.

If the provider is slow or the model takes longer to respond, setting a custom timeout may be worth testing.

Testing before saving was the useful part

One thing I liked about this workflow is that I did not have to blindly save the config and hope it worked.

Before saving, I tested the connection.

This is important because there are several small things that can break the setup:

invalid API key
wrong base URL
typo in the model name
provider-side rate limit
free model no longer available
timeout too low

In my case, once the connection test returned successfully, I saved the configuration.

That small test step saves a lot of guessing.

Enabling WordPress AI features

After the provider was connected through Koneek, the next step was enabling the AI features from:

Settings > AI

This part depends on the WordPress AI plugin configuration.

Once enabled, WordPress AI could use the configured provider for features like:

rewriting text
rephrasing content
regenerating titles
other editor AI actions

That was the moment where the setup actually felt useful.

The provider was not just saved somewhere in settings. It became usable inside WordPress.

What worked well

The biggest win is flexibility.

Instead of waiting for WordPress AI to support every provider directly, this setup makes it possible to connect providers that expose an OpenAI-compatible API.

That opens the door to testing different backends, such as:

hosted AI providers
local AI gateways
self-hosted OpenWebUI
LiteLLM routers
OpenRouter model routing
internal proxy APIs

For developers, that is useful because AI providers change fast.

Today you may want OpenRouter. Tomorrow you may want a local Ollama setup behind an OpenAI-compatible proxy.

The WordPress side does not need to care too much as long as the bridge configuration works.

The trade-offs I noticed

This setup is convenient, but I would not treat it as magic.

There are still a few things to watch:

Model names matter

OpenAI-compatible providers often use their own model naming format.

For example, OpenRouter model names usually look different from OpenAI model names.

If the model name is wrong, the connection may fail even if the API key and base URL are correct.

Free models may change

If you use a free model, do not assume it will stay available forever.

This is fine for testing, but for production workflows, I would use a stable paid model or at least have a fallback plan.

Timeout may need tuning

Some models respond slower than others.

If WordPress keeps failing while the provider works elsewhere, timeout settings are one of the first things I would check.

API keys still need to be protected

According to the Koneek plugin information, API keys are stored using AES-256-CBC encryption with the encryption key derived from WordPress secret salts in wp-config.php.

That is better than storing plain text secrets, but I would still treat API keys seriously:

do not reuse important keys everywhere
rotate keys if you suspect exposure
avoid giving unnecessary permissions
keep wp-config.php secure

When I would use this approach

I would use this setup when I want WordPress AI to work with a provider that is not directly supported yet.

Good use cases:

testing multiple AI providers
using OpenRouter for model routing
connecting an internal OpenAI-compatible gateway
experimenting with local or self-hosted AI
avoiding vendor lock-in
giving WordPress AI access to cheaper or specialized models

I probably would not use a random free provider for serious production content automation unless I had already tested its reliability, latency, and privacy policy.

The main thing I learned

The interesting part is that "OpenAI-compatible" is more of a compatibility layer than a universal guarantee.

It gives you a common API shape, but the app still needs to know the provider details.

Once those details are configurable, WordPress AI becomes much more flexible.

The setup that worked for me was:

WordPress AI
    ↓
Koneek provider configuration
    ↓
OpenAI-compatible API endpoint
    ↓
AI provider/model

That is a pretty clean flow.

Final thoughts

I went into this assuming the hardest part would be the API itself.

It was not.

The real trick was making WordPress AI aware of the provider's base URL, model name, and authentication details.

Once that bridge existed, the rest felt surprisingly simple.

While researching this setup, I also came across an Indonesian article that walks through a similar configuration: Cara Menghubungkan OpenAI-Compatible API ke WordPress AI

I am curious how other WordPress developers are approaching this.

Are you using official AI providers only, or are you routing everything through OpenAI-compatible gateways like OpenRouter, LiteLLM, Ollama, or OpenWebUI?

Redis Isn't PostgreSQL: Building a Hybrid Change Data Capture Runtime in Ruby

Ken C. Demanawa — Sat, 27 Jun 2026 00:26:45 +0000

I Built Commercial Redis CDC Source Drivers for Ruby — Here's What I Learned

For the past couple of years I've been building a Change Data Capture (CDC) ecosystem for Ruby.

Like many CDC projects, it started with PostgreSQL. PostgreSQL's Write-Ahead Log (WAL) is an excellent source of truth: durable, ordered, replayable, and well understood. It provides exactly the properties you want when you're building reliable event pipelines.

But the deeper I went into distributed systems, the more I realized something important.

Many systems don't observe change from PostgreSQL first.

They observe it from Redis.

Redis often sits at the front of modern architectures:

Redis Streams carry application events.
Pub/Sub distributes transient state changes.
Keyspace notifications react to cache invalidation and key expiry.
Redis Cluster routes events across multiple primaries.

In many systems, Redis sees a change before PostgreSQL ever commits it.

That raised an interesting question:

Can Redis become a first-class Change Data Capture source?

The obvious answer is "yes."

The interesting answer is "yes—but not in the same way PostgreSQL does."

That distinction eventually became cdc-redis-pro, a commercial Redis source driver for the Ruby CDC ecosystem.

This article isn't a product announcement.

It's an engineering write-up about the architectural decisions behind the project, the tradeoffs Redis forces you to make, and the execution model that ultimately emerged.

Redis Doesn't Have One CDC Interface

One misconception I frequently encounter is the assumption that Redis has an equivalent of PostgreSQL's WAL.

It doesn't.

Instead, Redis exposes several completely different mechanisms for observing change.

Source	Delivery	Replay
Streams	At-least-once	Yes
Pub/Sub	At-most-once	No
Sharded Pub/Sub	At-most-once	No
Keyspace Notifications	At-most-once	No

At first glance they all look like "events."

Operationally they're completely different systems.

Streams are durable.

Pub/Sub isn't.

Keyspace notifications exist primarily as operational signals.

Sharded Pub/Sub introduces routing constraints that don't exist elsewhere.

Treating them all as the same abstraction inevitably hides important guarantees—and hidden guarantees eventually become production incidents.

Instead of pretending every Redis source behaves identically, I wanted the API to expose those differences explicitly.

If a source cannot replay missed messages, the API should say so.

If a reconnect creates a loss window, operators should know exactly when it happened.

Infrastructure software shouldn't hide reality.

It should make reality easier to reason about.

Redis and PostgreSQL Solve Different Problems

A common question is:

"If Redis can generate change events, why not replace PostgreSQL CDC entirely?"

Because they solve different problems.

PostgreSQL's WAL is the durable history of your system.

Redis is often the earliest signal that something is happening.

One tells you what committed.

The other tells you what is happening right now.

They're complementary.

Not competing.

Conceptually, I think about them like this:

                    PostgreSQL WAL
                          │
                          ▼
                 Durable Record of Truth

Redis Streams / PubSub / Keyspace
              │
              ▼
        Fast Operational Signal

The goal isn't choosing one over the other.

The goal is allowing both to participate in the same downstream processing pipeline.

That required another architectural boundary.

A Common Language for Change Events

One of the design goals of the broader CDC ecosystem is that downstream processors shouldn't care where an event originated.

Whether a change comes from PostgreSQL logical replication or Redis Streams, the downstream processing model should remain identical.

That boundary is CDC::Core::ChangeEvent.

Instead of exposing PostgreSQL-specific or Redis-specific payloads to processors, each source is normalized into a common event model.

Conceptually the pipeline looks like this:

                PostgreSQL WAL
                     │                     
                pgoutput-client
                     │
                     ▼
                 ChangeEvent
                       ▲
                       │
                 cdc-redis-pro
                       │
        Streams / PubSub / Keyspace

Everything downstream consumes the same normalized event.

A webhook processor doesn't need to know whether the event came from WAL or Redis.

A search indexing pipeline doesn't care.

An audit sink doesn't care.

Even the execution runtime doesn't care.

That separation between source acquisition and event processing became one of the defining architectural decisions of the ecosystem.

As the project grew, it became clear that acquiring events efficiently and processing them efficiently are two different problems—and they scale independently.

That realization eventually led to a separate execution engine: cdc-orchestrator-pro.

We'll come back to that shortly.

First, let's look at what makes each Redis source fundamentally different.

Redis Isn't One Event System. It's Four.

The first surprise when building a Redis CDC source is that there isn't a single Redis change stream.

There are four.

Each has different delivery guarantees.

Each behaves differently during failures.

Each recovers differently after reconnects.

And each answers a different operational question.

Treating them as interchangeable would have made the implementation simpler—but it also would have hidden the exact information operators need during production incidents.

Instead, cdc-redis-pro embraces those differences.

Redis Streams: The Durable Path

Redis Streams is the closest thing Redis has to a traditional CDC source.

Messages are persisted.

Consumers maintain checkpoints.

Consumer groups coordinate work.

Failed consumers leave pending entries behind for recovery.

In many ways, Streams feels familiar to anyone coming from Kafka or PostgreSQL logical replication.

That made it the natural foundation for the recoverable side of the driver.

The Streams implementation supports:

XREAD
XREADGROUP
Consumer Groups
Pending-entry inspection
XAUTOCLAIM
Duplicate suppression
Optional dead-letter streams

Operationally, Streams is the only Redis source that provides genuine replay.

If a downstream worker crashes halfway through a batch, processing resumes from the last committed checkpoint rather than silently dropping work.

Conceptually, it looks like this:


             Producer
                │
                ▼
          Redis Stream
                │
          Consumer Group
                │
                ▼
          cdc-redis-pro
                │
           ChangeEvent
                │
                ▼
         Downstream Runtime

This is the strongest consistency story Redis offers.

It isn't PostgreSQL's WAL—but it isn't trying to be.

It's a durable event log designed for application-level workflows.

Pub/Sub: Fast, But Ephemeral

Pub/Sub solves a completely different problem.

Messages exist only while subscribers are connected.

Disconnect for five seconds.

Those five seconds are gone forever.

That isn't a bug.

It's the contract.

Many libraries attempt to hide this by automatically reconnecting.

The problem is that reconnecting doesn't recover missed messages.

It only resumes receiving future ones.

Pretending otherwise creates false confidence.

Instead, cdc-redis-pro treats Pub/Sub as an explicitly at-most-once source.

Reconnects are measured.

Loss windows are reported.

Operators can immediately see:

when the disconnect occurred,
how long the subscriber was offline,
and exactly where message loss became possible.

That distinction matters.

Infrastructure software shouldn't promise guarantees the underlying system doesn't provide.

Sharded Pub/Sub Changes the Topology

Redis Cluster introduces another variation.

Sharded Pub/Sub distributes channels across multiple primaries.

That improves scalability, but it also means subscriptions become topology-aware.

A reconnect isn't always reconnecting to the same node.

During resharding, ownership of a channel may move entirely.

Handling that correctly requires continuously tracking cluster topology rather than assuming a fixed server layout.

The driver automatically discovers topology through CLUSTER SHARDS and transparently rebinds subscriptions as ownership changes.

To downstream processors, events continue arriving normally.

To operators, topology changes remain observable.

Keyspace Notifications Aren't Really CDC

Keyspace notifications are probably the easiest Redis feature to misunderstand.

They're incredibly useful.

They're also incredibly easy to misuse.

Keyspace notifications exist to announce that Redis itself performed an operation:

a key expired,
a value changed,
a key was deleted,
a hash was updated.

They're operational signals.

They're not durable history.

They're not replayable.

And by the time you receive an expiration notification, the value may already be gone.

That's simply how Redis works.

Rather than pretending every notification contains complete information, the driver offers optional best-effort value enrichment whenever the value still exists.

If it doesn't, the event still proceeds.

The guarantee remains explicit.

Delivery Guarantees Should Stay Visible

One design principle shaped almost every API in the project.

I didn't want to normalize away delivery semantics.

Instead, I wanted them to remain visible all the way to the operator.

Think of it like a database transaction.

You wouldn't want a library to silently convert an eventually-consistent operation into something that merely looks transactional.

The same idea applies here.

Different Redis sources have different operational characteristics.

The API should preserve them.

That philosophy can be summarized like this:

Source	Replay	Delivery	Typical Use
Streams	✓	At-least-once	Durable workflows
Pub/Sub	✗	At-most-once	Live events
Sharded Pub/Sub	✗	At-most-once	Cluster-scale broadcasts
Keyspace Notifications	✗	At-most-once	Operational signals

None of these are "better."

They're simply optimized for different workloads.

Topology Matters More Than Features

Supporting Redis isn't just about supporting commands.

It's about supporting deployments.

A surprising amount of complexity came not from Streams or Pub/Sub themselves, but from the environments they run in.

The driver currently supports:

Standalone Redis
Redis Sentinel
Redis Cluster
TLS
ACL authentication

Cluster support turned out to be particularly interesting.

Streams must remain within a single hash slot.

Cross-slot reads fail.

Pub/Sub subscriptions migrate during resharding.

Connections disappear during primary failover.

Those aren't edge cases.

They're normal operating conditions in production.

Every supported topology is continuously exercised using Docker-based integration tests covering failover, node restarts, resharding, authentication, and TLS.

I wanted the implementation to reflect how Redis is actually deployed—not just how it behaves on a laptop.

Acquiring Events Is Only Half the Problem

By this point, the source layer was capable of reliably acquiring events from every major Redis deployment model.

The next question became much harder.

How do you process them efficiently?

One worker?

Ten workers?

Hundreds?

How do you preserve ordering where it's required while still exploiting modern Ruby's parallelism?

It turned out that reading events from Redis wasn't the difficult part.

Scheduling what happened after they were read became the real engineering challenge.

That challenge eventually became HybridRuntime, the execution engine inside cdc-orchestrator-pro.

And surprisingly, the solution wasn't built around threads.

It was built around ownership.

The Architecture I'm Most Proud Of

Surprisingly, reading events from Redis wasn't the hardest part of the project.

Scheduling what happened after those events arrived was.

Modern Ruby gives us two powerful concurrency primitives:

Ractors for parallel CPU execution
Fibers for concurrent I/O

Most systems choose one.

I wanted both.

That eventually became HybridRuntime, the execution engine inside cdc-orchestrator-pro.

Its job isn't tied to Redis.

Redis simply happened to be the workload that exposed the problem first.

Event Acquisition and Event Processing Are Different Problems

One architectural realization changed the direction of the project.

Reading events from a source and processing those events are two completely different concerns.

They're limited by different bottlenecks.

They scale independently.

A PostgreSQL logical replication connection is fundamentally serial.

A Redis Stream consumer is similarly constrained.

But once an event has been acquired and normalized into a CDC::Core::ChangeEvent, downstream processing becomes embarrassingly parallel.

That naturally separates the pipeline into two halves.

                    Source Layer
                         │
         PostgreSQL WAL / Redis Streams
                         │
                         ▼
                CDC::Core::ChangeEvent
                         │
                         ▼
                  Execution Layer

Once an event reaches the execution layer, its origin no longer matters.

Redis.

PostgreSQL.

A future Kafka adapter.

A future S3 replay.

The runtime simply processes ChangeEvent.

That separation turned out to be one of the most valuable architectural decisions in the ecosystem.

HybridRuntime

HybridRuntime combines two existing execution engines from the CDC ecosystem.

cdc-parallel provides pools of prewarmed Ractors for true CPU parallelism.
cdc-concurrent provides asynchronous Fiber pools for overlapping I/O within each Ractor.

Together they form a nested execution model.

                 HybridRuntime
                        │
        ┌───────────────┴───────────────┐
        ▼                               ▼
  Ractor Pool                    Ractor Pool
        │                               │
        ▼                               ▼
   Fiber Pool                     Fiber Pool
        │                               │
        ▼                               ▼
Redis Connections              Redis Connections

The interesting observation is that parallelism and concurrency solve different problems.

Ractors increase throughput by executing work simultaneously.

Fibers increase throughput by avoiding idle time while waiting for I/O.

The runtime deliberately uses both.

The Inception Pool

As the architecture evolved, I noticed something amusing.

Every layer owned another pool.

The runtime owns a pool of Ractors.

Each Ractor owns a LocalPool.

Each LocalPool owns a pool of Fibers.

Each Fiber owns a live Redis connection.

It looked like this:

HybridRuntime
     │
     ▼
Prewarmed Ractor Pool
     │
     ▼
LocalPool
     │
     ▼
Fiber Pool
     │
     ▼
Redis Connections

Internally I started calling it the Inception Pool.

A pool containing pools containing pools.

The name stuck.

Ownership Instead of Synchronization

Most concurrent systems solve shared state by protecting it.

Threads
  │
  ▼
Mutex
  │
  ▼
Shared Connection Pool

The more workers you add, the more frequently they compete for the same resources.

Locks become unavoidable.

HybridRuntime takes a different approach.

Instead of synchronizing ownership...

...it avoids sharing ownership entirely.

Every Redis client is created inside the Ractor that will use it.

It never leaves that Ractor.

Nothing is borrowed.

Nothing is shared.

Nothing requires a mutex.

Conceptually it looks like this.

Ractor 1
   │
   ├── Redis Connection A
   ├── Redis Connection B
   └── Fiber Scheduler

Ractor 2
   │
   ├── Redis Connection A
   ├── Redis Connection B
   └── Fiber Scheduler

The only thing that crosses a Ractor boundary is an immutable ChangeEvent.

Everything else remains local.

This aligns naturally with Ruby's ownership model.

Mutable state belongs somewhere.

Rather than fighting that constraint, the runtime embraces it.

Why LocalPool Exists

That ownership model eventually led to another component: Ratomic::LocalPool.

Unlike traditional connection pools, a LocalPool isn't shared across Ractors.

The facade itself is shareable.

The resources are not.

Each Ractor lazily constructs its own pool the first time it needs one.

Shareable LocalPool
          │
          ├────────────┐
          ▼            ▼
     Ractor A     Ractor B
          │            │
     Local Pool   Local Pool
          │            │
     Redis Conn   Redis Conn

This turns out to be a remarkably natural fit for long-lived resources:

Redis clients
PostgreSQL connections
HTTP clients
Elasticsearch clients
Kafka producers

The resource stays exactly where it was created.

The work moves.

Not the connection.

Two Independent Scaling Axes

Another consequence of this architecture is that acquisition and processing no longer have to scale together.

Suppose a Redis deployment only needs three acquisition workers.

That says nothing about how many processing workers you need.

You might run:

Acquisition

3 Ractors
5 Fibers each

↓

Processing

7 Ractors
20 Fibers each

Each side can be tuned independently.

Adding more downstream workers doesn't require opening additional Redis Streams.

Adding more source readers doesn't require changing the execution topology.

The two halves of the pipeline evolve independently.

That separation proved invaluable during benchmarking because it exposed where the real bottlenecks actually lived.

Beyond Redis

One realization surprised me.

HybridRuntime wasn't solving a Redis problem.

It was solving an event-processing problem.

Redis happened to be the first source.

The same execution model works for:

PostgreSQL logical replication
Redis Streams
Webhook delivery
Search indexing
Object storage sinks
Future Kafka adapters
Future message brokers

Anything capable of producing a CDC::Core::ChangeEvent automatically inherits the same execution engine.

That ultimately justified extracting the runtime into its own commercial component: cdc-orchestrator-pro.

Originally it lived inside another project.

Eventually it became obvious that it wasn't a Redis runtime.

It wasn't a Sidekiq runtime.

It wasn't even a PostgreSQL runtime.

It was an execution fabric for normalized change events.

Redis simply happened to be the benchmark that inspired it.

Parallelism Isn't Free

One thing the benchmarks made very clear is that parallelism isn't magic.

Adding more Ractors doesn't produce linear speedups.

It introduces coordination costs.

Partition routing.

Mailbox communication.

Ordering constraints.

Preserving correctness means accepting those costs.

Understanding where those tradeoffs appear became just as interesting as the throughput numbers themselves.

Let's look at what those benchmarks actually measured.

Where This Actually Fits

After spending so much time discussing architecture, it's worth asking a simple question.

Who actually needs this?

The honest answer is:

Not every Rails application.

If Redis is simply a cache sitting beside your database, this project is probably unnecessary.

Likewise, if every important state transition already commits to PostgreSQL before anything else happens, PostgreSQL logical replication alone may be all the CDC infrastructure you need.

cdc-redis-pro exists for a much narrower class of systems.

Systems where Redis is part of the application's event architecture rather than merely its cache.

Redis Streams as an Event Bus

This is probably the most natural fit.

Many distributed systems already use Redis Streams as their internal event bus.

Order Service
    │
    ▼
Redis Stream
    │
    ▼
Consumers

Once Redis becomes the place where work is coordinated, durability suddenly matters.

Consumers crash.

Deployments restart.

Networks partition.

A consumer needs to know where to resume.

Redis Streams already provides those building blocks.

Consumer Groups.

Pending Entries.

Checkpoint IDs.

XAUTOCLAIM.

The job of cdc-redis-pro isn't replacing those mechanisms.

It's integrating them into a larger event-processing pipeline while preserving their semantics.

Fast Signals Before Durable State

Many systems generate transient events before anything reaches PostgreSQL.

Examples include:

inventory availability
market data
IoT telemetry
collaborative editing
multiplayer game state

These events often exist for milliseconds.

Some are never intended to become permanent records.

Waiting for a database commit before reacting introduces unnecessary latency.

Redis already has the signal.

The application simply needs a reliable way to observe it.

That's exactly where Redis becomes a valuable CDC source.

Not because it replaces the database.

Because it observes change sooner.

Redis and PostgreSQL Together

The architecture becomes much more interesting when both sources exist simultaneously.

Imagine an order-processing pipeline.

Customer clicks Buy
       │
       ▼
Redis Stream
       │
 Immediate downstream processing
        │
 PostgreSQL Transaction
        │
        ▼
 Logical Replication

Redis carries the operational signal.

PostgreSQL records the durable history.

Eventually both become the same normalized object.

Redis Streams
        │
        ▼
   ChangeEvent
        ▲
        │
PostgreSQL WAL

Once normalized, downstream processing becomes identical.

That separation allows each technology to do what it does best.

Redis optimizes for responsiveness.

PostgreSQL optimizes for durability.

Neither replaces the other.

Event Processing Shouldn't Care About the Source

One of the design goals of the CDC ecosystem is that processors shouldn't know—or care—where an event originated.

A webhook dispatcher shouldn't behave differently because the event came from Redis instead of PostgreSQL.

Neither should:

search indexing
audit sinks
analytics
cache invalidation
AI pipelines
object storage
future message brokers

Every processor consumes exactly the same event model.

Redis
   │
   ▼
 ChangeEvent
       ▲
       │
 PostgreSQL
      │
      ▼
 Processor
        │
 ┌──────┼────────┬────────┬────────┐
 ▼      ▼        ▼        ▼        ▼

Webhook Search  Audit   Redis   Future...

That separation is what allows the runtime to remain completely source-agnostic.

Ordered Workloads

Not every workload benefits equally from parallelism.

Suppose an application updates customer balances.

+100
-20
+15

Processing those out of order would produce incorrect state.

Ordering matters.

Other workloads don't have that constraint.

Search indexing.

Webhook fan-out.

Telemetry aggregation.

Independent cache updates.

Those can often execute concurrently.

One of the runtime's responsibilities is recognizing that not every processor requires the same ordering guarantees.

Correctness always comes first.

Throughput comes second.

Why Not Just Use Sidekiq?

This is probably the question Ruby developers ask most often.

After all, Sidekiq already provides a robust distributed job system.

The answer is that jobs and change streams solve different scheduling problems.

A job queue answers:

"What work should execute next?"

A CDC runtime answers:

"How should related events flow through the system while preserving their correctness?"

Those are similar questions.

They're not the same question.

Jobs are independent.

Change events frequently aren't.

Ordering.

Checkpoints.

Replay.

Transaction boundaries.

Partition routing.

Those become first-class concerns in CDC systems.

Rather than replacing Sidekiq, the runtime sits at a different layer.

Sidekiq remains an excellent execution engine for background jobs.

HybridRuntime focuses on ordered event pipelines.

The two complement one another rather than compete.

Lessons Learned

Building cdc-redis-pro changed how I think about event-driven systems.

A few observations kept appearing throughout development.

Redis isn't PostgreSQL.

Trying to force Redis into a WAL-shaped abstraction usually hides important operational behavior.

Delivery guarantees matter more than APIs.

Two systems exposing similar methods may have completely different recovery characteristics.

Ownership scales better than synchronization.

Keeping mutable resources inside a single Ractor proved simpler than sharing them across many workers.

Acquisition and processing are independent problems.

The bottleneck for reading events is rarely the same bottleneck for processing them.

Treating those concerns separately made both architectures significantly cleaner.

Most importantly...

Infrastructure shouldn't hide tradeoffs.

It should make them explicit.

That's the philosophy behind the entire project.

The benchmark results ended up reflecting exactly those design decisions.

What the Benchmarks Actually Mean

Benchmark numbers are easy to misunderstand.

They're also surprisingly easy to exaggerate.

I wanted to avoid both.

Rather than publishing a single headline number, I built a benchmark matrix that explored how the runtime behaves under different execution strategies.

The goal wasn't to find the biggest number.

The goal was to understand where the architecture stops scaling—and why.

Measuring Different Parts of the Pipeline

Not every benchmark measures the same thing.

Some benchmarks measure source acquisition.

Others measure downstream execution.

Others measure the orchestration layer itself.

Treating those numbers as interchangeable would be misleading.

I ended up thinking about the benchmarks as three different phases.

Redis Source
      │
      ▼
ChangeEvent Acquisition
      │
      ▼
HybridRuntime
      │
      ▼
Downstream Sink

Each phase has different bottlenecks.

Acquisition is constrained by Redis.

Processing is constrained by CPU, I/O latency, ordering requirements, and scheduling overhead.

Understanding which phase you're measuring is more important than the final throughput number.

The Synthetic Benchmark

The largest number observed was approximately 54,500 events per second.

That's intentionally not presented as an end-to-end Redis benchmark.

It measures the execution capacity of the orchestration layer after events have already been acquired.

In other words:

ChangeEvent
      │
      ▼
HybridRuntime
      │
      ▼
Processor

This benchmark answers a very specific question:

"How quickly can the runtime schedule and execute already-available work?"

That's useful.

It just isn't the same as measuring an entire Redis pipeline.

End-to-End Pipelines

Real systems spend time doing real work.

Reading from Redis.

Writing to PostgreSQL.

Calling HTTP services.

Updating search indexes.

Those operations introduce latency that no scheduler can eliminate.

When measured end-to-end, the results naturally become lower.

Current peak observations include:

Redis Streams → Runtime: approximately 17,600 events/sec
PostgreSQL WAL → Redis: approximately 20,000 events/sec

Those numbers include actual I/O rather than isolated scheduling.

Personally, I find them more interesting than the synthetic benchmark because they reflect complete pipelines.

Scaling Isn't Linear

One result immediately stood out.

Adding more Ractors did not produce proportional speedups.

That's exactly what I expected.

Parallelism always introduces coordination costs.

Events must be routed.

Partitions must remain consistent.

Workers communicate through Ractor mailboxes.

Ordering constraints occasionally delay otherwise-complete work.

The runtime spends part of its time doing useful work...

...and part of its time coordinating that work.

That coordination isn't overhead to eliminate.

It's the cost of preserving correctness.

The benchmark matrix made those tradeoffs visible.

Rather than chasing perfect scaling, the goal became identifying the point where additional parallelism stopped producing meaningful throughput gains.

For the current implementation, that sweet spot consistently appeared around:

3 prewarmed Ractors
5 Redis connections per Ractor
50 Fibers

That balance delivered high throughput without introducing excessive scheduling overhead.

Ordering Has a Cost

One benchmark compared ordered and unordered execution.

The difference wasn't dramatic.

Ordered execution consistently performed slightly slower.

That's expected.

Maintaining ordering means the runtime occasionally waits for earlier work to complete before later work can safely continue.

Event 1
Event 2
Event 3

cannot become:

Event 2
Event 3
Event 1

simply because Event 2 happened to finish first.

Preserving correctness sometimes requires sacrificing a little throughput.

That's a tradeoff I consider worthwhile.

Correctness scales better than debugging race conditions.

The Interesting Bottleneck

The benchmark wasn't really about Redis.

It was about coordination.

At low parallelism, workers spend most of their time processing events.

At high parallelism, workers spend increasingly more time coordinating with one another.

Eventually another Ractor contributes more scheduling overhead than useful work.

Finding that point was considerably more valuable than finding the largest throughput number.

It answered a much more practical question:

"How should I actually configure this in production?"

Chaos Matters More Than Throughput

Raw throughput is only one characteristic of an event pipeline.

Recovery behavior is arguably more important.

The benchmark suite includes failure scenarios covering:

Redis restarts
PostgreSQL restarts
connection interruption
checkpoint recovery
consumer recovery

Streams resumed processing from checkpoints.

Pub/Sub sources reported explicit loss windows.

Recovery behavior remained consistent with each source's documented guarantees.

That consistency mattered more to me than achieving another few thousand events per second.

Long-Running Stability

Short benchmarks rarely expose operational problems.

Memory leaks.

Connection exhaustion.

Scheduler starvation.

Queue growth.

Those usually appear over time.

The runtime was therefore exercised continuously using soak tests.

One representative run processed approximately 1.34 million events over five minutes.

No processing failures were observed.

Median throughput degraded by roughly 2% over the duration of the run.

That's encouraging, although much longer overnight and multi-day soak tests remain on my roadmap.

Operational confidence comes from sustained behavior—not just impressive graphs.

What I Learned

Perhaps the most surprising outcome of the benchmarking work was this:

The execution runtime wasn't the limiting factor.

The limiting factor was almost always the surrounding system.

Network latency.

Redis.

HTTP endpoints.

Disk.

Database writes.

The scheduler spent most of its time waiting for external systems.

That reinforced one of the central architectural decisions behind HybridRuntime.

Fibers overlap waiting.

Ractors overlap computation.

Neither attempts to eliminate latency.

They simply ensure latency in one part of the system doesn't unnecessarily stall everything else.

The result isn't infinite scalability.

It's predictable scalability.

And for infrastructure software, predictability is usually the more valuable property.

The complete benchmark reports—including raw CSV data, SVG charts, chaos-recovery artifacts, and soak-test results—are published alongside the documentation.

I'd much rather readers inspect the raw data than rely on a single headline number.

Benchmarks are most useful when they're reproducible.

What's Next

cdc-redis-pro is only one piece of a much larger ecosystem.

The long-term goal was never to build "yet another Redis client."

The goal was to build a source-agnostic Change Data Capture platform for Ruby.

Today, PostgreSQL logical replication and Redis happen to be the two primary sources.

Tomorrow, that could just as easily include:

Kafka
NATS
Amazon SQS
Webhooks
Object storage
Search indexes
Other databases

The important observation is that the runtime doesn't need to change.

As long as a source can be normalized into a CDC::Core::ChangeEvent, everything downstream already knows how to process it.

That was the motivation behind separating source acquisition from execution.

        Source
           │
           ▼
   CDC::Core::ChangeEvent
           │
           ▼
    cdc-orchestrator-pro
           │
           ▼
        Processors

Every new source becomes an adapter.

Not a new runtime.

Why Split the Runtime?

One architectural decision deserves a brief explanation.

Originally the execution engine lived inside another project.

As the ecosystem evolved, I realized something important.

The runtime wasn't solving a Redis problem.

It wasn't solving a PostgreSQL problem.

It wasn't even solving a Sidekiq problem.

It was solving an event-processing problem.

That realization led to extracting the execution engine into its own commercial component:

cdc-orchestrator-pro

Today it powers Redis CDC.

Tomorrow it can power any source capable of producing normalized change events.

Separating those concerns keeps both halves of the system simpler.

Source adapters acquire events.

HybridRuntime processes them.

Each evolves independently.

Open Source First

Although cdc-redis-pro and cdc-orchestrator-pro are commercial products, the ecosystem they're built upon remains open source.

That includes:

cdc-core
cdc-parallel
cdc-concurrent
pgoutput-client
pgoutput-parser
pgoutput-decoder
Mammoth

Those projects define the common event model, execution primitives, and PostgreSQL integration that everything else builds upon.

The commercial components focus on operational capabilities rather than replacing the open-source foundation.

That separation is intentional.

I believe infrastructure ecosystems become valuable through adoption and trust—not artificial feature restrictions.

Looking Ahead

Redis replication remains one of the larger pieces still on the roadmap.

Today, cdc-redis-pro consumes Redis event sources such as Streams, Pub/Sub, and Keyspace Notifications.

A future version will move further upstream by treating Redis itself as a replication source.

That's a significantly more ambitious problem.

I'd rather stabilize the current architecture before expanding its scope.

There are also areas where I think the execution engine itself can continue to improve.

Adaptive scheduling.

Smarter partition routing.

Better observability.

Long-running soak tests.

More topology-aware execution.

Those improvements belong to the runtime rather than any particular source adapter—which is exactly why separating acquisition from execution turned out to be such a useful architectural boundary.

Final Thoughts

I started this project thinking I was building Redis CDC.

Somewhere along the way I realized I was really building an execution model.

Redis happened to expose the problem first.

PostgreSQL reinforced it.

Future source adapters will probably validate it again.

The most interesting lesson wasn't about Redis at all.

It was this:

Acquiring events and processing events are different problems.

They have different bottlenecks.

They scale differently.

They deserve different architectures.

Once those responsibilities are separated, the rest of the system becomes remarkably composable.

Redis becomes another source.

PostgreSQL becomes another source.

Tomorrow's adapters become just that—adapters.

The runtime stays the same.

For me, that's the most exciting part of the entire project.

Not because it produced the largest benchmark numbers.

Not because it uses Ractors or Fibers.

But because it led to an architecture that's easier to reason about, easier to extend, and honest about the tradeoffs of the systems it builds upon.

The benchmark reports are public.

The documentation is public.

The implementation is commercial.

If you're building event-driven systems in Ruby—or you're wrestling with Redis and PostgreSQL in the same architecture—I'd genuinely love to hear how you're approaching those problems.

I'm convinced there's still a lot left to explore.

Security Profiles Operator hits v1 with stable APIs and a hardening pass

Leo — Sat, 27 Jun 2026 00:24:52 +0000

After several years carrying a beta tag, the Kubernetes Security Profiles Operator went 1.0.0 on June 26, freezing eight CRD APIs and clearing a third-party security audit with no criticals. For cluster admins, the practical effect is small but consequential: the syscall and LSM profile a workload runs under is now declared on APIs that will not move under your feet.

The release was announced by Sascha Grunert of Red Hat on the CNCF blog. SPO is the Kubernetes operator that manages seccomp, SELinux and AppArmor profiles as cluster-scoped objects, then attaches them to pods. Until now the value proposition was good and the API was provisional. v1.0.0 nails the second half down.

What's actually stable

All eight CRDs graduated to v1, including SeccompProfile, ProfileRecording, SelinuxProfile, RawSelinuxProfile, and the AppArmor profile type. Conversion webhooks ship with the release, so a cluster running earlier API versions can roll forward without scheduling downtime. The older versions remain available and are slated for removal in a future release. The migration is on the clock, not on fire.

The audit pass came with some shape changes that are worth reading before you upgrade. SelinuxProfile swapped its boolean permissive field for a mode enum with Enforcing and Permissive values, which means any GitOps templates that hard-coded permissive: true need a rewrite. RawSelinuxProfile is now gated by an enableRawSelinuxProfiles configuration flag and a validating admission webhook, so the most privileged path through the operator is off by default. AppArmor inputs run through strict regex validation, raw policy payloads are capped at 500 KB, and the eBPF profile recorder picked up explicit resource limits.

Why a cluster team should care

The point of an operator like this is to take the profile out of the host's filesystem and into the API. That changes the blast radius of "we shipped a container with no profile at all." With SPO and a workload-attached profile, the runtime gets the rules from the cluster, the cluster gets the rules from Git, and a rollback is a kubectl apply away. Without it, a profile change usually means a node image bake, which is slower and harder to undo at 3am.

The v1.0.0 line means platform teams can adopt SPO without writing "subject to API churn" in the runbook. In most platform-engineering shops that was the gating concern, not the technology.

The bet still on the table

KEP 6061, OCI Artifact-Based Security Profile Distribution, is proposed for an upcoming Kubernetes release as alpha. The idea is that kubelet itself learns to fetch profiles from an OCI registry, the same way it already pulls images. If that lands, the cluster no longer needs SPO to deliver the profile to the node; the operator stays useful for authoring and lifecycle, but the distribution path moves into core.

This is the right direction for reliability. It collapses two control planes into one. It also means SPO's own value proposition will sit closer to "compiler and CI for profiles" than "runtime mover of files." Worth tracking before you commit to a long-term operator deployment plan.

Watch list for the upgrade

A few rough edges. The permissive to mode rename is a silent break for any template that does not run through the conversion webhook before write. The default-off posture on RawSelinuxProfile is correct, but teams that relied on it must explicitly opt back in. The upstream KEP is alpha, not GA, so SPO's role in production stays where it is for at least two Kubernetes release cycles.

How other projects approach the same surface is fragmented. Falco watches behavior at runtime and alerts. SPO writes the rule that would block the behavior in the first place. Kyverno and Gatekeeper can enforce that a pod has a profile, but they neither author nor distribute one. The combination most platform teams end up running is SPO for the rules, an admission policy engine for the "must have a profile" guard, and a runtime detector for what slips through.

v1.0.0 makes the SPO half of that stack stable enough to depend on. The KEP decides whether its role grows or shrinks from there.

Why I Built a Tiny Repeated-Game Poker Analysis Tool

ty215 — Sat, 27 Jun 2026 00:23:00 +0000

Most poker solvers answer one question very well: given a single hand and a single decision tree, what is the equilibrium strategy? (Yes, there is subgame solving, node locking, and plenty more — but the default frame is still one hand, one equilibrium.)

I kept getting stuck on a different one. What if the same kind of spot shows up over and over, and a player can commit to a fixed strategy across those repetitions? In a few toy games I had a hunch, worked out by hand, that committing to a fixed strategy could change its value relative to the one-shot picture. I wanted a tool that could make that commitment value precise — to actually analyze it rather than just believe it. (Whether any of this rises to a repeated-game equilibrium is a much stronger claim, and one I am deliberately not making here.)

I'm still learning software engineering, so until recently I couldn't implement this — I was stuck reasoning about toy games on paper. AI tooling made the analysis feasible, so I finally started building it: repeated-poker-analysis.

It's a small research project: write one narrow model down, run small examples, and record what the model does and doesn't justify.

What `repeated-poker-analysis` is

It is an experimental Python toolkit for small abstract poker games. The current MVP covers:

fixed Hero commitment candidates,
exact Villain best-response diagnostics in small finite trees,
candidate generation and filtering,
T_deadline, an economic adaptation deadline,
local T_detect, an observable-distribution sensitivity estimate,
analysis reports and Markdown summaries.

It is small on purpose. It is not a full solver and it is not wired to real solver ranges. It starts from one toy game — a river spot — that is tiny enough to inspect and test by hand.

That toy spot is one where showdown always chops but rake still bites. In a single-hand view, putting more money into a raked pot can be locally unattractive. Across repeated occurrences the same spot raises a commitment question: if one player refuses to fold in a fixed pattern, how does the other respond, and how fast would that response have to come for the commitment to stop being worth it?

This is the question I wanted a tool to make precise — not a claim that any new equilibrium exists.

Why repeated poker is tempting — and where the trap is

Repeated games sound like a natural home for reputation, punishment, and adaptation, and poker has obvious repeated structure: similar river spots, similar blind-vs-blind situations, similar sizings, similar pools.

Here is the trap I had to respect. If the number of repetitions is known, the game is fully observed, each spot is independent, and both players are perfectly rational, then a finite repeated game often collapses back toward the one-shot equilibrium by backward induction. "This spot happens five times" is not by itself enough to claim a reputation equilibrium. That is the standard game-theory result, and it is the reason the project keeps the layers below separate.

So the project keeps several ideas apart that are easy to blur:

a one-hand baseline strategy,
a fixed Hero commitment candidate,
Villain's exact best response to that fixed Hero strategy,
an economic deadline for adaptation,
a local estimate of how visible the change is,
and the much stronger claim of a repeated-game equilibrium.

The MVP mostly lives in the commitment-analysis layer: if Hero is fixed to a candidate strategy in the supplied tree, what are Villain's exact best responses, and what happens to Hero EV under conservative tie handling?

What the MVP can do

(This describes the MVP on main at the time of writing. I'm still changing it, so details may move.)

It runs an end-to-end candidate-analysis pipeline on a small abstract game:

build tiny finite two-player game trees with rake,
evaluate fixed Hero and Villain mixed strategies,
enumerate exact Villain best responses for small trees,
report Hero EV under worst- and best-case Villain best-response tie rules,
generate simple Hero candidates from a baseline,
filter candidates before comparison,
compare candidate values against a baseline profile,
compute T_deadline and local T_detect,
render a Markdown summary.

In plain terms, the analysis loop is:

Start from a baseline profile: a fixed Hero strategy and a fixed Villain strategy on the supplied tree. These are action probabilities at each information set — not hand ranges; the tool does not model or import real solver ranges.
Generate Hero candidates by shifting probability between two actions at a single Hero information set (a blind, systematic enumeration of small shifts — not a search aimed at hurting Villain).
For each candidate, lock Hero to it and compute Villain's exact best response. That yields Hero's worst-case EV after Villain adapts.
Flag a candidate robustly_profitable only when that post-response worst-case Hero EV is strictly higher than Hero's EV in the baseline profile. The point is not "positive EV" — it is "still better than the one-shot baseline even after the opponent best-responds."
T_deadline / T_detect then add repeated-game timing on top of the candidates that survive.

The main entry point is run_candidate_analysis_pipeline.

python scripts/check_mvp.py

A simplified workflow:

from nuts_chop_river import build_nuts_chop_river, default_hero_strategy
from candidate_library import baseline_villain_strategy

from repeated_poker import (
    CandidateFilterConfig,
    CandidateGenerationConfig,
    run_candidate_analysis_pipeline,
)

tree = build_nuts_chop_river()
baseline_hero = default_hero_strategy()
baseline_villain = baseline_villain_strategy()

result = run_candidate_analysis_pipeline(
    tree,
    baseline_hero,
    baseline_villain,
    generation=CandidateGenerationConfig(shift_amounts=[0.1, 0.2]),
    horizon=5,
    profit_tolerance=-2.0,
    max_selection_l1_distance=0.3,
    detection_log_likelihood_threshold=3.0,
    detection_occurrence_probability_per_opportunity=0.5,
    filtering=CandidateFilterConfig(
        max_l1_distance=0.3,
        min_required_observations=5,
    ),
)

print(result.markdown_summary)

The output is a diagnostic report for the model you supplied, not a poker recommendation. Here is an excerpt from the actual output of examples/analysis_pipeline.py on the nuts-chop river toy game (I've trimmed the Configurations block and some columns; 8 candidates generated, 6 dropped by the filter, 2 compared):

generated=8 kept=2 excluded=6
compared=2

## Candidate Analysis Summary

### Summary Counts
- total: 2
- eligible: 2
- excluded: 0
- minimum_villain_ev: 1
- pareto_frontier: 2

### Candidate Rows

| candidate_id            | fixed_hero_ev | post_response_hero_ev_worst | robustly_profitable | t_detect | exclusion_reasons |
| ----------------------- | ------------- | --------------------------- | ------------------- | -------- | ----------------- |
| H1\|check->bet\|shift=0.1 | 0.625         | -0.850                      | no                  | 278      | -                 |
| H1\|bet->check\|shift=0.1 | 0.275         | -0.750                      | no                  | 294      | -                 |

The baseline Hero EV in this run is +0.45. The column that matters is robustly_profitable: it is yes only when post_response_hero_ev_worst exceeds that baseline. Here both candidates are no (-0.85 and -0.75 are below +0.45). A candidate that clears the baseline is rare and can exist in constructed cases — the tool's job is to search the candidates and find it when it does. The next section is a hand-built spot where one does.

A toy game where the commitment beats the baseline

I needed at least one example where the machinery clearly does what it is meant to: a known spot where committing to a fixed strategy leaves Hero better off than the one-shot baseline, even after the opponent best-responds. This nuts-chop steal is that example, and I wrote a dedicated test for it (tests/test_nuts_chop_steal_commitment.py). Treat it as a check that the tool can detect the effect at all — not as the end goal, and not as a claim about real games. Outside this constructed spot I do not know which situations, if any, are profitable to commit in.

The spot: a river where the board is already the nuts, so every showdown chops. There is no value betting — the only reason to bet (shove) is fold equity. Rake is below its cap, so a called pot just bleeds chips to the house. With a small starting pot and a big shove, a single hand looks like this:

initial commitment = 1, initial pot = 2, bet = 98, rake = 5%, cap = 4

| Line         | Hero/IP EV | Villain/OOP EV |
|--------------|-----------:|---------------:|
| check-check  |      -0.05 |          -0.05 |
| bet-fold     |      -1.00 |          +1.00 |
| bet-call     |      -2.00 |          -2.00 |

In one hand the caller folds: -1.00 (fold) beats -2.00 (call). So the one-shot subgame answer is OOP bets / IP folds — a pure steal, since the board is a chop and there is no value in betting.

Now lock IP to always call and ask the tool for OOP's exact best response. The steal's only profit source (fold equity) is gone, a called pot is -2.00 for OOP, so OOP's exact best response flips to check — and check-check is -0.05 for both. The test asserts exactly this: solve_exact_response returns {"OOP_river": "check"} once Hero is locked to call.

And crucially, this clears the baseline: Hero's EV goes from -1.00 (the one-shot steal baseline) to -0.05 after OOP adapts — still negative, but strictly better than the baseline, which is exactly the robustly_profitable condition. That is the whole point of the project stated in one example: the one-shot subgame answer (bet/fold) is not the answer under the fixed commitment I wanted to test (check/check). The commitment to call removes the opponent's only incentive to bet. (Whether this constitutes a repeated-game equilibrium is the stronger claim I am deliberately not making — this is a commitment-analysis result, not an equilibrium proof.)

The tool also puts a number on how long that commitment stays worth it. With baseline Hero EV = -1.00 (steal), pre-adaptation = -2.00 (locked call while OOP still bets), post-adaptation = -0.05 (OOP has switched to check), T_deadline comes out as floor(1 + 19N/39):

| N (horizon) | T_deadline |
|------------:|-----------:|
|          10 |          5 |
|          20 |         10 |
|          50 |         25 |
|         100 |         49 |

The honest caveat: this is a tiny, hand-built tree, and the EVs are ones I can check by hand — that is exactly why I trust this result more than anything else in the repo. It is not evidence about real games; it is evidence that the model and the code agree on one constructed example built to validate the effect.

Verification on my machine:

python -m pytest tests/test_nuts_chop_steal_commitment.py -v → 15 passed
python -m pytest -q → 500 passed
python scripts/check_mvp.py → passes
git diff --check → clean

How I worked with AI

I supplied the algorithm and the poker model. Codex wrote the implementation instructions and reviewed the results; Claude Code wrote the code. I checked Codex's prompts and corrected wrong premises, but I did not review the code line by line — I relied on the Codex/Claude review loop and the test suite (currently 500 passing tests).

Two things from that process are worth recording:

The assistant kept drifting toward the general case. For the commitment analysis I wanted Hero fully fixed and only Villain's exact best response computed, but it repeatedly tried to set up CFR — wasted machinery when Hero is fixed. Stopping it led to a side question I hadn't considered: CFR with one side frozen looks like a fixed-environment learning problem. I'm noting that as a question, not a result.
Explaining the toy game to the model was harder than explaining it to a person — it over-generalized and assumed things that don't apply (e.g. Villain value-bets on a board that is already the nuts). I ended up brushing each spec up in a chat first, then handing the cleaned version to the coding agent.

A note on terms the code keeps separate: T_deadline is economic (how late Villain can adapt while the locked policy still beats the baseline); T_detect is visibility (how many local observations before the candidate's action distribution looks distinguishable from baseline). They are different questions.

What I learned

Best-response ties matter. If Villain has several best responses with identical Villain EV, Hero's EV can still differ across them. Returning one arbitrary response would hide that risk, so the MVP reports both ev_h_worst and ev_h_best across the tie set. (Verified: BestResponseResult exposes both and the action variation across optimal pure strategies.)

Small examples are not a weakness. The nuts-chop river benchmark is tiny on purpose: easier to hand-check, harder to mistake for a real-money recommendation.

Current limitations

The main one: the code has not had an independent human code review. Tests pass, but I haven't read the implementation line by line and nobody else has either. Rather than rely on reading the code, I plan to validate it from the outside — design the verification to be as exhaustive as I can make it, run simulations across many configurations, and check that the results hold up. Whether static or property-based checking can give that coverage is something I'm still working out.

The narrower limits: it is not a full solver, does not import real solver ranges yet, does not solve large no-limit games, and does not do STT / ICM / preflop push-fold yet. The exact response engine enumerates Villain pure strategies, so it is meant for small abstract trees only — there is an explicit max_pure_strategies ceiling, default 100,000. Candidate generation is simple: finite shifts from a baseline, not a continuous strategy space.

Most importantly: positive EV inside this model does not guarantee profitable play. The model can be wrong if the abstraction, action tree, rake rule, ranges, or adaptation assumptions are wrong.

This is not gambling, bankroll, financial, or legal advice.

Next steps

The toy game confirmed the effect, so next I want to extend the tool: analyze with hand ranges rather than abstract action probabilities, and model the opponent adapting gradually (e.g. a Bayesian update of their response over repetitions) instead of switching to an exact best response in one step. Alongside that, I want to firm up the outside-in verification described above before trusting results on new spots.

Links

Repository: guriguri215-lang/repeated-poker-analysis
MVP walkthrough: docs/mvp_walkthrough.md
Assumptions and limitations: docs/assumptions_and_limitations.md
Publication policy: docs/publication_policy.md

Disclosure: I used AI assistance throughout this project and to draft this article. The division of labor was deliberate: I supplied the algorithm and the poker model, Codex handled instructions and review, and Claude Code wrote the code; I checked the prompts and relied on automated review and tests for the implementation. This article was also drafted with AI help and then rewritten to reflect my own decisions, mistakes, and open questions. Technical claims are marked where I have verified them against the code myself; where I say something is provisional or unreviewed, that is literally true.

Tests Pass, Design Breaks: Why TDD Can't Hold the Line on Design Intent

Sho Naka — Sat, 27 Jun 2026 00:21:34 +0000

There is a popular misconception that if you do TDD, your design also stays correct. That if the tests pass, quality is guaranteed. In AI-assisted development, this misconception is the kind that quietly accumulates — the more tests you have, the more invisible damage builds up underneath.

All tests passed. The design was still broken.

Here is what happened today.

A function called safe_post.py had its signature changed. Two arguments — notify_sh and doctor_sh — were removed. The test suite passed in full.

But the callers were still using the old signature. They were silently broken.

Why did the tests pass? Because the test code itself was using the old signature. The tests had been written (by AI) at a time when the design intent was already misunderstood. The misunderstanding was baked into the tests from the start.

Tests passing and the design being correct are two different things.

"All tests pass" tells you only one thing: the implementation matches what the tests expect. Whether the tests express the right design intent is a separate question.

TDD verifies "implementation against tests" — nothing more

Let me restate the TDD definition.

Red → Green → Refactor. Write a test. Write the implementation that passes the test. Refactor.

In this loop, what the test verifies is whether the implementation meets the test's expectation. That is one verification — and only one.

What TDD does not verify is whether the test itself correctly expresses the design intent.

The structure looks like this:

Design intent  →  Tests  (← this link is not verified)
                    ↓
                  Implementation  (← this link is verified by tests)

If the person writing the tests misunderstands the design intent, the tests will pass and the design will still be wrong. Machine learning engineer Hamel Husain calls this the "Gulf of Specification" — the gap between what you intended to measure and what your metric actually measures. Optimize hard against a flawed metric and you optimize hard in the wrong direction. The same dynamic plays out in TDD.

This is not a critique of TDD. It is a statement that TDD, by its structure, cannot solve this particular problem.

You can't escape the snowball

"Then review the tests," is the natural counter. Yes — but how do you review the review?

Design intent  →  Tests  →  Implementation
                    ↑
               Human reviews (does it express intent?)
                    ↑
               Who reviews the reviewer?
                    ↑
               ... (infinite regress)

The only way out of this snowball is to design a terminator for the review chain. And the terminator must, eventually, be a human.

The problem is that AI accelerates this loop. AI writes the implementation quickly, writes the tests quickly, makes them pass quickly. The faster the AI side moves, the more "is this test expressing intent?" work piles up on the human side. The paradox is sharp: the more you automate, the more confirmation work humans inherit.

Speed and quality, paradoxically

As a Forward Deployed Engineer working on AI adoption in the field, I run into this paradox often. The pattern goes: "AI made our development faster" — then a few weeks later — "but the design is getting more tangled."

When speed goes up, the share of time allocated to design review goes down in relative terms. When the number of tests goes up, the cost of asking "is this test correct?" goes up with it. Use AI without being aware of this, and the speed benefit converts itself into a quality cost.

What happens when you combine AI and TDD

AI is good at writing tests. "Write tests for this code" — a few seconds, and you have a plausible test file.

That is exactly where the problem is.

The tests AI writes tend to be "tests reverse-engineered from the implementation." They describe what the code currently does. This is excellent for verifying "implementation against tests." It is nearly useless for verifying "tests against design intent."

The reason is simple: AI does not know the design intent. Unless it is in the context, AI reads the implementation, observes the behavior, and turns that behavior into tests. It converts "this is how it currently behaves" into a test, not "this is how it was supposed to behave."

The safe_post.py story is exactly this. The tests had been written against the old signature. Nobody noticed. The tests faithfully verified that the implementation matched a now-outdated assumption. After the signature changed, the tests stayed where they were.

The "tests pass = OK" trap

What makes this nasty is that the discovery is delayed.

Normal bugs are caught the moment the implementation fails the test. But "tests don't express the design intent" bugs only surface when the actual runtime behavior diverges from what was intended. From the test output, everything looks fine.

In the safe_post.py case, the fact that callers were using the old signature didn't surface until the code path actually ran. From the test suite alone, the answer was "all green."

The one way out

The only way to stop the snowball is to separate what can be machine-verified from what cannot.

Machine-verifiable:

Whether the implementation passes the tests (TDD's job)
Whether signatures and types are consistent (the type checker's job)
Whether boundary conditions hold (automated tests)

Not machine-verifiable:

Whether the tests correctly express the design intent
Whether the implementation's "why" matches the design intent's "why"

Humans only confirm the second category. Everything in the first goes to machines.

If you skip this split and march forward under the belief that "more tests = more safety," every new test adds another item to the "do I trust this test?" pile. Confirmation cost grows linearly with test count.

What a type checker really buys you

In the safe_post.py case, the signature change was something a type checker could have caught. With Python type annotations, mypy could have pointed straight at the caller using the old signature.

A different layer from TDD. A different mechanism. Widening the machine-verifiable surface is the realistic way to keep design integrity intact. Be explicit about which range tests own, which range the type checker owns, and which range humans own.

Minimizing the human surface

To shrink the human surface, externalize design intent as context.

When asking AI to write tests, lead with the intent. Not "write tests for this function" but "this function's responsibility is X and Y; it does not handle Z; please write tests that verify those two." When you change a signature, write: "this function's responsibility now excludes the notification side; tests should reflect that exclusion."

Even then, misunderstandings happen. But the divergence between intent and generated test is smaller than when you hand AI nothing but implementation code.

Not a criticism of TDD

To be clear: I am not against TDD.

Tests are necessary. Automated tests are the only practical way to verify boundary conditions. They are the only mechanism that can flag "did this signature change break the callers?" — provided the prerequisite holds, that the tests themselves correctly express the design intent.

The problem is the belief that "if you do TDD, your design is also safe."

TDD is a tool that raises implementation quality. It is not a tool that verifies design intent. Use it with that distinction in mind, and TDD becomes a powerful weapon. Confuse the two and you get a state where "confidence rises but the actual coverage of quality assurance shrinks."

In AI-assisted development this distinction matters more, not less. The faster AI can generate tests, the more the gap between "tests written" and "intent verified" widens — unless you deliberately design the mechanism that closes it.

A three-layer model for test design

A practical organizing frame:

Layer 1: implementation correctness (TDD)
Tests carry expectations; the implementation must satisfy them. Red/Green/Refactor. The layer AI is best at.

Layer 2: design integrity (types / static analysis)
Signature consistency, type matching, contracts with callers. Type checkers and linters do this. Machine-owned.

Layer 3: alignment with design intent (humans)
Whether the test truly expresses "why this should behave this way." Whether the implementation's "why" matches the design intent. Humans only.

When AI accelerates test generation, Layers 1 and 2 stay machine-owned. Build the discipline of confirming only Layer 3 by human. That is the realistic design for keeping speed and quality together.

Why "verbalizing design intent" is the core skill of the AI era

The conversation broadens slightly from here.

As AI-assisted development accelerates, the value of being able to articulate design intent rises.

The cost of writing code has dropped. The cost of writing tests has dropped. Both can be generated in seconds. But "what should we build?" and "why does this design have to look like this?" — these AI does not figure out for you. More precisely: unless you put the intent into the context, AI defaults to "the design inferred from the current implementation."

A person who can verbalize design intent gives AI more concrete instructions. "This function's responsibility is X and Y. Z is out of scope. Tests should verify these two." Hand AI that, and the gap between intent and generated tests shrinks.

A person whose design intent lives only in their own head, hands AI nothing concrete. Every confirmation step boomerangs back to the human. When the design intent is not verbalized, the faster AI goes, the more confirmation cost the human inherits.

I see this pattern more often in the field now: "we introduced AI, development sped up, but quality confirmation has become exhausting." The "exhausting" part is mostly the design-intent verbalization gap. Speed exposes what was tacit.

TDD does not guarantee design intent for the same reason AI does not guarantee design intent. Both are tools that process what is written. Design intent, unless humans put it into writing, lives nowhere a machine can read it.

Where to write the design intent

A concrete question: where should you put it?

In code, via test names. Not in comments, in the test name itself. The test name is the place to say "what should this implementation be doing, and why." test_safe_post_handles_missing_file says less than test_safe_post_completes_without_notify_when_notify_sh_is_absent. The longer name carries the intent.

In documents, via ADRs (Architecture Decision Records). Why you chose this design, what alternatives existed, the assumptions behind the choice. You do not need perfection. A single paragraph — "the current signature is X and Y for these two reasons" — drastically lowers the cost of judging a future signature change.

In conversation, via PR comments and issue threads. A code review comment that carries design intent becomes a future tracer for "why is it like this?"

The common move across all three: externalize design intent. Do not keep it in your head. Put it where a machine can reference it.

Conclusion

There is no shortcut to verifying design intent. The region machines cannot handle stays with humans.

What you can do is shrink the human region. Automate the machine-verifiable side aggressively. Confirm only what is left.

Not "tests pass, so we are correct." But "did I confirm that the tests express the design intent correctly?"

TDD is a powerful tool. Use it with a clear sense of what it covers and what it does not. Without that distinction, the faster AI development gets, the more quietly things break underneath.

That is the lesson from today.

References

Hamel Husain, "Your AI Product Needs Evals" (2024) — origin of the "Gulf of Specification" concept
Jeannette Wing, "Computational Thinking" (2006, Communications of the ACM)

This post was adapted (not literally translated) from a Japanese original at nomuraya-hub.pages.dev. I am the same author writing under different pen names — "nomuraya / shimajima / 中翔" — depending on the medium.

I Built a Serverless VPN on Lambda MicroVMs — 12 Builds, 5 Dead Ends, 1 Working Architecture

Vivek V. — Sat, 27 Jun 2026 00:20:56 +0000

TL;DR

I built a personal VPN using AWS Lambda MicroVMs. Your traffic exits from AWS. When you disconnect, the MicroVM terminates — zero cost, nothing running. When you reconnect, a fresh MicroVM launches in about 20 seconds.

./vpn.sh start   # All Mac traffic now exits from AWS
./vpn.sh stop    # Back to your real IP

Here is what I learned across 12 image builds — dead ends, kernel limitations, and what finally worked.

The Idea

Lambda MicroVMs launched in June 2026 (4 days ago). They are Firecracker VMs with:

Full Linux OS — your own binaries, eBPF, iptables, network namespaces
Suspend/resume — state preserved on snapshot, resumes in ~1s per GB (or terminate for zero ongoing cost)
Hardware-level isolation — every session gets its own sandbox
Per-second billing — ~$0.13/hr for a 2GB ARM64 (Graviton) instance
8-hour max lifetime (active + suspended combined)

I wanted to run a VPN inside one. Connect when I need privacy. Disconnect and pay nothing for compute. Resume instantly when I reconnect.

Took 12 image builds to get there.

What I Tried (and Failed)

Attempt 1: NAT Gateway Replacement (The Original Idea)

This is actually where the project started. I was paying $32/mo for a NAT Gateway and thought: what if a MicroVM running nftables could replace it? Serverless NAT. Pay only when traffic flows.

Why it failed: Lambda MicroVMs cannot act as VPC route targets. Their networking is ingress-only (HTTPS + JWT). Other VPC resources cannot route through a MicroVM. The VPC egress connector gives the MicroVM its own internet access. It does not make it a transit device.

That killed the NAT idea. But it made me think — if I cannot route VPC traffic through it, what about routing my own laptop's traffic through it? That is how Serverless VPN was born.

Attempt 2: VPC Egress Connector

I created a VPC, subnets, security groups, and a network connector. One hour wasted.

MicroVMs have INTERNET_EGRESS by default. The connector is only needed for reaching private VPC resources (RDS, internal NLBs). For a VPN that exits to the public internet, default egress works fine.

Attempt 3: Kernel WireGuard

$ ip link add wg0 type wireguard
Error: Unknown device type.

The MicroVM kernel does not have wireguard.ko. Setting additionalOsCapabilities: ["ALL"] does not help. "ALL capabilities" means Linux capabilities (CAP_NET_ADMIN, etc.). Not kernel modules. The Firecracker kernel is compiled by AWS. You cannot load modules.

Attempt 4: Boringtun (Userspace WireGuard)

Failed to initialize tunnel
error: Socket(Os { code: 19, kind: Uncategorized, message: "No such device" })

Even after mknod /dev/net/tun c 10 200, the kernel has no TUN driver. CONFIG_TUN is not compiled in. The device node exists in the filesystem, but nothing in the kernel backs it.

eBPF works because CONFIG_BPF=y is in the kernel. TUN is not. This is a kernel config choice, not a permissions issue.

Attempt 5: Boringtun Daemonize

BoringTun failed to start

Boringtun forks by default. The Firecracker environment blocks fork() in daemon mode. Fix: --foreground flag. But TUN still does not work, so this was moot.

Always use --foreground for any daemon process in MicroVMs.

What Actually Works

veth + SOCKS5 Proxy

Credit to Aidan Steele (AWS Serverless Hero) who pointed me towards veth pairs. He had already built a Kubernetes cluster across MicroVMs — multiple pods per MicroVM, all using veth + network namespaces. No TUN needed.

The kernel supports veth pairs, network namespaces, iptables, and IP forwarding. Just not TUN.

Final architecture:

Mac → wstunnel (WSS) → MicroVM:8080 → microsocks (SOCKS5) → internet (AWS IP)

No TUN. No WireGuard kernel module. No VPC. Just:

wstunnel — WebSocket tunnel. Wraps TCP in WSS for MicroVM ingress.
microsocks — 20KB SOCKS5 proxy. Routes traffic to the internet.
iptables MASQUERADE — NATs traffic out the MicroVM's eth0.
macOS networksetup — Sets system-wide SOCKS proxy.

The 12-Build Journey

Build	What Changed	Result
v1	Kernel WireGuard	❌ Unknown device type
v2–v4	Fixed S3 access, IAM propagation	❌ Access denied → fixed
v5	Added `additionalOsCapabilities: ALL`	✅ ip_forward works
v6	Added boringtun (pre-built binary)	❌ Binary was HTML 404 page
v7	Multi-stage Docker build	❌ Multi-stage FROM not supported
v8	Single-stage Rust compile	✅ Built, but ENODEV on TUN
v9	ALL caps + microvmHooks enabled	✅ /run hook fires
v10	mknod /dev/net/tun + boringtun --foreground	❌ ENODEV (no CONFIG_TUN)
v11	Better error logging	Confirmed: no TUN driver
v12	Filed AWS support ticket	Waiting on CONFIG_TUN
v13	veth + microsocks + wstunnel	✅ Working

Gotchas

1. update-microvm-image Strips Settings

additionalOsCapabilities and hooks are lost when you call update-microvm-image. Always create a fresh image with a new name.

2. API Version Prefixes

Lambda MicroVMs: /2025-09-09/ (service: lambda)
Network Connectors: /2026-04-04/ (service: lambda-core)
Both use the same host: lambda.us-east-1.amazonaws.com

3. boto3 Does Not Have the Service Model

Lambda Python 3.12 runtime ships boto3 that does not know lambda-microvms. Use SigV4-signed raw HTTP or update the CLI.

4. Do Not Setup Networking in /ready Hook

/ready runs during image build. Networking capabilities may not be fully available. Only do filesystem ops. Real networking goes in /run hook.

5. ALL Capabilities ≠ Kernel Modules

additionalOsCapabilities: ["ALL"] grants Linux capabilities. It does NOT:

Load kernel modules
Enable CONFIG_TUN or CONFIG_WIREGUARD

It DOES enable: sysctl, iptables, ip link (veth/bridge), eBPF, network namespaces.

6. Hooks Must Be Set at Image Creation

--hooks '{"microvmHooks":{"run":"ENABLED","runTimeoutInSeconds":60}}'

If you forget this, /run never fires and your VPN never starts.

7. wstunnel Version Compatibility

Server v10.1.0 and client v10.5.5 may crash. Match versions exactly.

8. 8-Hour Max Lifetime

MicroVM terminates after 8 hours (active + suspended combined). Run ./vpn.sh start again when it expires.

9. Token Expiry (60 min max)

Auth tokens expire. The WebSocket connection persists after initial auth, but reconnection needs a fresh token.

The Final Stack

./vpn.sh start
    ↓
aws lambda-microvms run-microvm (image: serverless-vpn-v13)
    ↓
MicroVM launches from snapshot (~20s)
    ↓
/run hook fires:
  - Creates veth pair + network namespace
  - Enables ip_forward
  - iptables MASQUERADE
  - Starts microsocks (SOCKS5 on :1080)
  - Starts wstunnel server (WSS on :8080)
    ↓
wstunnel client on Mac connects via WSS
    ↓
networksetup -setsocksfirewallproxy Wi-Fi 127.0.0.1 1081
    ↓
ALL TCP TRAFFIC → SOCKS5 → WSS → MicroVM → AWS IP

Cost

State	Cost
Connected (browsing)	~$0.13/hr (2GB ARM64 MicroVM)
Disconnected (terminated)	$0 — nothing running, no storage
Each session start	~20s cold start from snapshot
Data transfer out	$0.09/GB (standard AWS egress)
Typical month (2hr/day)	~$8 AWS costs + $5 software
Light usage (1hr/day)	~$5 AWS costs + $5 software

The MicroVM can burst up to 4x baseline (8GB/4vCPU) during peak usage at peak rates.

How to Deploy

Option 1: Subscribe on AWS Marketplace (10 minutes)

The full stack is available on AWS Marketplace with a 7-day free trial. One-click CloudFormation deploy, no servers to manage.

→ Serverless VPN on AWS Marketplace

Option 2: Build It Yourself (under 1 hour)

This blog documents every gotcha I hit. Point Kiro or any AI coding agent at this post plus the Lambda MicroVM Skill and it will scaffold the stack. The gotchas above are what take weeks to discover on your own.

You need:

AWS CLI v2.27+ (brew upgrade awscli — required for lambda-microvms commands)
wstunnel (brew install wstunnel)
AWS account with Lambda MicroVM access (us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1)

Multi-Region

# Run setup once per region (builds the MicroVM image there)
REGION=ap-northeast-1 ./vpn.sh setup  # One-time per region
REGION=ap-northeast-1 ./vpn.sh start  # Tokyo

REGION=eu-west-1 ./vpn.sh setup
REGION=eu-west-1 ./vpn.sh start       # Ireland

Your IP appears from that country. Each region requires a one-time setup (image build is regional). Available in us-east-1, us-east-2, us-west-2, eu-west-1, and ap-northeast-1.

Good to Know

Session VPN, not 24/7 — designed for work sessions, not always-on. Use for privacy during active browsing.
SOCKS5 proxy — routes TCP traffic (web, APIs). Does not tunnel UDP or raw IP (gaming, VoIP) like a full WireGuard VPN.
Runs on Graviton (ARM64) — Amazon Linux 2023 base.
macOS and Linux — Windows supported via WSL2.
Your own instance — dedicated MicroVM in your AWS account. No shared servers. No traffic logging.
CloudTrail auditable — all operations logged in your account.

What's Next

CONFIG_TUN support — I have filed an AWS support ticket requesting TUN/TAP in the Firecracker guest kernel. If AWS adds it, microsocks gets swapped for WireGuard (proper full-tunnel VPN with UDP support).
UDP tunneling — currently SOCKS5 handles TCP only. WireGuard would fix this.
Auto-token refresh — re-auth before 60-min expiry for long sessions.