Stories by Daniel Manzke on Medium

Your RAG system works on 10,000 documents. Here’s why it dies at 30 million.

Daniel Manzke — Thu, 05 Mar 2026 09:56:49 GMT

Your RAG system works on 10,000 documents. Here’s why it dies at 30 million.

Every week, someone posts about how they built a RAG system over the weekend. Teams share internal solutions that answer questions across a few hundred or a few thousand documents. It works. They’re proud. They should be. It’s genuinely impressive for that scale. But after 25 years of building enterprise search products and three years of building RAG systems on top of that foundation, I can tell you: what works at 10,000 documents doesn’t just slow down at 30 million. It breaks in ways you won’t see coming. And the data backs this up: 72% of enterprise RAG implementations fail in their first year [1].

The failures follow predictable patterns. Not because the technology is bad, but because teams architect for the demo, not for the reality waiting behind it.

The demo that ruins everything

A small RAG prototype is seductive. You load a few hundred PDFs into a vector database, wire up an LLM, and within hours you’re getting answers that feel magical. Stakeholders see it and immediately start planning the production rollout. This is the moment where most enterprise RAG projects begin their slow death.

The prototype succeeds because everything is working in your favor. The document set is small enough that vector similarity actually means something. The content is usually clean, homogeneous, and well-structured. There are no permission boundaries to enforce. The questions you test with are the easy ones: direct lookups where a single passage contains the answer.

Enterprise reality looks nothing like this. According to IDC research, only 1 in 10 home-grown AI applications survive past the proof-of-concept stage [2]. A senior GenAI lead at PIMCO reported that 80% of enterprise RAG projects experience critical failures [3]. These aren’t random misfortunes. They’re the predictable result of scaling an architecture that was never designed for scale.

The gap between demo and production isn’t a gradual slope. It’s a cliff. One practitioner building RAG for a Fortune 500 manufacturer described the challenge of going from a slick prototype to a system handling over 50 million records across a dozen databases [4]. The retrieval logic doesn’t degrade gracefully at enterprise scale. It stops returning useful results entirely.

I’ve watched this play out dozens of times. A team builds something promising, shows it to leadership, gets budget, and then spends months trying to make it work on real enterprise content. By month six, they’re questioning everything. By month nine, the project is either abandoned or quietly restarted with fundamentally different assumptions.

At 3 billion documents, semantic search becomes noise

Here’s something the tutorials and vendor demos won’t tell you: vector search has a scale ceiling, and most enterprises hit it hard.

When you embed documents into a vector space, semantically similar content clusters together. At 10,000 documents, those clusters are tight and distinct. Search for “asbestos regulations in Switzerland” and the relevant passages stand out clearly from the rest of the corpus. The signal-to-noise ratio is excellent.

At 30 million documents, those clusters start bleeding into each other. At 3 billion, they’re so diffuse that querying for a specific topic returns hundreds or thousands of “similar” passages across dozens of unrelated documents. The four passages that actually answer your question are buried in noise. The embedding space has effectively collapsed. Recent research suggests that in high-dimensional vector spaces, retrieval precision can plummet as the corpus grows, because points become effectively equidistant from any query [5].

This is why enterprise RAG at scale is a partitioning problem, not a retrieval problem. You can’t fix this with better embeddings or a fancier vector database. You fix it by dramatically reducing the search space before semantic search ever runs.

That means metadata extraction, document classification, named entity recognition, and permission-based filtering become the load-bearing walls of your architecture. They’re not optimization layers you add later. They’re the reason the system works or doesn’t.

Think of it like a library. At 10,000 books, you can browse the shelves and find what you need. At 3 billion books, you need the Dewey Decimal System, a librarian, and a catalog search before you even walk into the right room. Semantic search is the browsing. Everything upstream of it is the navigation system that makes browsing possible.

Research from Chroma (2025) confirms this from the LLM side: retrieval performance degrades as context length increases, even on straightforward factual tasks, across testing of multiple frontier models [6]. Bigger context windows don’t save you. Sharper filtering does.

Chunking is where your assumptions go to die

The RAG community has spent enormous energy debating chunking strategies. Fixed-size versus semantic. 256 tokens versus 512 versus 1,024. Overlapping versus non-overlapping. Most of this debate misses the point.

A study examining whether semantic chunking justified its computational cost tested both approaches across five datasets. Fixed-size chunking outperformed semantic chunking on three of them. The differences on the other two were minimal [7]. The conclusion: semantic chunking adds computational overhead for marginal gains in most scenarios.

But the real insight isn’t that one strategy beats another. The real insight is that no single strategy works across document types. What performs well on legal contracts fails on source code. Optimizations for news articles break on scientific papers. NVIDIA’s chunking benchmarks found page-level chunking achieved the lowest variance and highest accuracy (0.648), but only for paginated documents [8]. Financial documents, by contrast, performed best with 1,024-token chunks at 57.9% accuracy [9].

I learned this the hard way while building retrieval systems for enterprise customers. One customer had XML-structured regulatory documents. Standard chunking destroyed the document’s inherent structure. The fix wasn’t a smarter chunking algorithm. It was treating the document’s own structural markup (sections, subsections, clauses) as the natural chunk boundaries. Another customer needed location-specific regulations. Chunking by content wasn’t enough. We had to enrich each chunk with geographic metadata so the system could filter by jurisdiction before attempting retrieval.

This is where most projects get stuck. They pick a chunking strategy during the prototype phase, optimize it for their test documents, and then discover in production that their actual content is wildly heterogeneous. Internal wikis, scanned PDFs, email threads, spreadsheets, structured XML, unstructured memos. Each type demands different treatment.

The right question isn’t “what’s the best chunking strategy?” It’s “what does my specific content require, and am I willing to build multiple pipelines to handle it?”

Query decomposition: the unreliable engine room

Enterprise questions are rarely simple. A compliance officer asks: “What are our obligations under the updated Swiss asbestos regulations for buildings constructed before 1990?” That question requires the system to understand regulatory jurisdiction, temporal scope, building classification, and the specific regulatory framework. No single passage in any document answers it directly.

This is where query decomposition comes in. The LLM breaks the complex question into sub-queries, retrieves evidence for each, and synthesizes an answer. In theory, agentic decomposition improves retrieval accuracy. A 2025 fintech RAG study showed retrieval accuracy jumping from 54.12% to 62.35% with structured decomposition, and reaching 69.41% when accounting for semantically relevant alternate sources [10].

In practice, decomposition is maddeningly unreliable. The same query, submitted six times, might produce four good decompositions and two bad ones. The LLM misreads company jargon. It splits the question along the wrong axis. It hallucinates sub-queries that don’t map to any real document.

Part of this is a fundamental architecture problem. LLMs are non-deterministic. Even with temperature set to zero, floating-point non-associativity in GPU batch processing means identical inputs can produce different outputs [11]. OpenAI added a “seed” parameter to improve reproducibility, but it only works reliably at temperature zero, which kills the reasoning capability you need for good decomposition. You’re stuck choosing between reproducibility and quality.

My working hypothesis, which I’m actively testing, is that the real fix isn’t fighting for determinism. It’s giving the LLM enough domain context that its decomposition becomes reliable even with temperature variance. When the model understands your company’s terminology, your document structure, and your users’ actual intent, the signal is strong enough that even non-deterministic outputs land in the right place. Early results are promising but not yet conclusive. I mention this because intellectual honesty about what’s proven and what’s still being validated is part of the credibility that makes enterprise work possible.

The compounding effect makes this worse than it sounds. If retrieval accuracy is 95%, reranking accuracy is 95%, and generation accuracy is 95%, your end-to-end accuracy is 0.95 × 0.95 × 0.95 = 85.7%. One in six queries fails. At enterprise scale with thousands of daily queries, that’s hundreds of wrong answers per day.

The chicken-egg nobody wants to solve

Every technical failure I’ve described has a common upstream cause: nobody separated discovery from proof-of-concept.

Most enterprise RAG projects run a POC that’s simultaneously trying to prove the technology works and discover what the business actually needs. That’s two different objectives crammed into one phase, and it’s why 89% of implementations ship without permission-aware retrieval, audit trails, or role-based access controls [12]. The team is so focused on making retrieval work that they skip everything else.

Here’s the chicken-egg at the center of it: you can’t know what enrichment and metadata extraction to build until you see real users asking real questions and failing. But without that enrichment, the POC produces mediocre results, and stakeholders lose confidence before discovery can happen.

The fix is deceptively simple. Separate discovery from delivery. Run a discovery phase with a low cost and short timeline, where the explicit goal is learning, not proving. What questions do users actually ask? Where does retrieval fail? What metadata would have caught those failures? What document types need special handling?

This only works if business and IT are in the same room. Three years of enterprise deployments taught me that IT alone can’t solve this. They don’t know the content deeply enough. But business alone can’t solve it either, because they invariably want to start with the most complex problem instead of the simplest one.

Getting them to the same table requires credibility you can’t fake. You earn it by being transparent about what works, what fails, and what you’re still figuring out. Customers don’t trust vendor pitches. They trust practitioners who’ve been through the pain and can describe it specifically. Every failed deployment I’ve been part of has made the next conversation with a new customer more productive, because I can say “here’s exactly what went wrong and here’s what we changed.”

Less than 30% of RAG deployments in 2025 included systematic evaluation from day one [13]. Of those that did, the ones that succeeded built golden datasets: 200 or more questions with human-generated reference answers, validated by domain experts against real documents. Not synthetic data generated by another LLM. Real questions from real users about real content.

What a working enterprise RAG system actually looks like

Strip away the hype and vendor promises, and a working enterprise RAG system has four load-bearing walls. If any one of them cracks, the system fails.

Content quality comes first. Shit in, shit out. If your source documents are poorly structured, inconsistently formatted, or missing metadata, no amount of downstream engineering fixes the retrieval. The companies that succeed invest in document enrichment early: classification, entity extraction, geographic tagging, temporal metadata. This isn’t glamorous work. It’s the work that makes everything else possible.

Navigation architecture comes second. At enterprise scale, you’re not building a search engine. You’re building a navigation system. Page-level indexing with table-of-contents awareness. Hierarchical document understanding. Permission-based partitioning that reduces the search space before vector similarity runs. This is the layer that prevents embedding space collapse at 30 million or 3 billion documents.

Orchestration logic comes third. Query decomposition, intent classification, routing to the right document partitions, deciding whether to search again or accept the current results. This is where prompt engineering matters most, not in the final answer generation, but in the upstream decisions that determine whether the right evidence even reaches the LLM.

Evaluation methodology ties it together. A golden dataset of human-generated questions and verified answers. Regular testing against real user queries. Systematic tracking of where failures occur in the pipeline [14]. Without this, you’re flying blind, optimizing based on gut feeling instead of evidence.

The sequence matters. Most failed projects start with orchestration or evaluation (the exciting parts) and skip content quality and navigation architecture (the boring parts). The successful ones do the opposite.

None of this happens in five days. The teams that succeed treat their first implementation as a learning phase. They separate discovery from proof-of-concept. They bring business and IT to the same table. And they accept that building a system to answer questions across 30 million documents is a fundamentally different engineering challenge than building one for 10,000.

The technology is ready. It has been for a while. The question is whether your organization is willing to do the unglamorous work that makes it actually function.

References

[1] “Why 72% of Enterprise RAG Implementations Fail in the First Year — and How to Avoid the Same Fate.” RAG About It, 2025. https://ragaboutit.com/why-72-of-enterprise-rag-implementations-fail-in-the-first-year-and-how-to-avoid-the-same-fate/

[2] Referenced via Pureinsights, “Why Enterprise AI Projects Fail,” April 2025. Original data from IDC research on home-grown AI application survival rates. https://pureinsights.com/blog/2025/why-enterprise-ai-projects-fail/

[3] “Enterprise RAG Failures: The 5-Part Framework to Avoid the 80%.” Analytics Vidhya, July 2025. Cites PIMCO GenAI lead on 80% critical failure rate. https://www.analyticsvidhya.com/blog/2025/07/silent-killers-of-production-rag/

[4] “How I Built an Enterprise RAG System That Searches 50+ Million Records in Under 30 Seconds.” Medium, March 2025. https://medium.com/@ceo_44783/how-i-built-an-enterprise-rag-system-that-searches-50-million-records-in-under-30-seconds-fe84f409b187

[5] “Stanford Just Exposed the Fatal Flaw Killing Every RAG System at Scale.” DEV Community, 2025. References precision degradation from 95% at 1K documents to 12% at 100K. https://dev.to/aryan_shukla/stanford-just-exposed-the-fatal-flaw-killing-every-rag-system-at-scale-h7i

[6] “Evaluating Chunking Strategies for Retrieval.” Chroma Research, 2025. Tested retrieval degradation across multiple frontier models. https://research.trychroma.com/evaluating-chunking

[7] “Is Semantic Chunking Worth the Computational Cost?” arXiv, 2024. Fixed-size outperformed semantic on 3 of 5 datasets. https://arxiv.org/html/2410.13070v1

[8] “Finding the Best Chunking Strategy for Accurate AI Responses.” NVIDIA Technical Blog, June 2025. Page-level chunking at 0.648 accuracy with lowest standard deviation (0.107). https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/

[9] “RAG Text Chunking Strategies.” Amir Teymoori, November 2025. Financial documents at 57.9% accuracy with 1,024-token chunks, referencing NVIDIA FinanceBench experiments. https://amirteymoori.com/rag-text-chunking-strategies/

[10] Ghosal, K. et al. “Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation.” arXiv:2510.25518, October 2025. A-RAG strict accuracy 62.35% vs. baseline 54.12%, rising to 69.41% with semantically relevant sources. https://arxiv.org/abs/2510.25518

[11] “Temperature=0 is a Lie. Why Your LLM is Still Random.” Medium / Write A Catalyst, January 2026. https://medium.com/write-a-catalyst/temperature-0-is-a-lie-why-your-llm-is-still-random-b58e26b65752

[12] “RAG Permission Management: The Overlooked Enterprise Blind Spot.” RAG About It, December 2025. 89% of enterprise RAG implementations ship without RBAC, audit trails, or permission-aware retrieval. https://ragaboutit.com/rag-permission-management-the-overlooked-enterprise-blind-spot/

[13] “The Next Frontier of RAG: How Enterprise Knowledge Systems Will Evolve (2026–2030).” NStarX Inc., December 2025. Less than 30% of RAG deployments included systematic evaluation from day one. https://nstarxinc.com/blog/the-next-frontier-of-rag-how-enterprise-knowledge-systems-will-evolve-2026-2030/

[14] “The Path to a Golden Dataset, or How to Evaluate Your RAG?” Microsoft Data Science Blog, June 2024. Silver-to-gold dataset progression methodology. https://medium.com/data-science-at-microsoft/the-path-to-a-golden-dataset-or-how-to-evaluate-your-rag-045e23d1f13f

Additional sources consulted

“Enterprise RAG Architecture: A Practitioner’s Guide.” Applied AI, 2025. Hybrid + RRF shows 15–30% better retrieval accuracy than pure vector search. https://www.applied-ai.com/briefings/enterprise-rag-architecture/
“Unmasking the True Culprit: Why Temperature=0 Doesn’t Mean Deterministic LLM Inference.” SugiV Blog, 2025. Root cause analysis of GPU floating-point non-associativity. https://blog.sugiv.fyi/temperature-determinism-llm-inference
“Evaluating Retriever for Enterprise-Grade RAG.” NVIDIA Technical Blog, October 2024. Recall@K methodology and retrieval evaluation. https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/
“Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support.” PMC/MDPI Bioengineering, November 2025. Adaptive chunking at 87% accuracy vs. baseline 50%. https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
“Enterprise RAG Predictions for 2025.” Vectara Blog, 2025. Industry outlook on agentic RAG adoption trajectory. https://www.vectara.com/blog/top-enterprise-rag-predictions

The Great Alliance: The Quartet of Code Gods

Daniel Manzke — Sun, 22 Feb 2026 17:30:41 GMT

An AI-based Good Night Story

The Great Alliance: The Quartet of Code Gods

It was the year 2026, and software development had transformed into something our ancestors would have mistaken for magic. It was no longer enough for Claude to know the context and OpenAI to provide the logic. Humanity’s projects had grown too gigantic — spanning millions of lines of code and needing to respond in real time to a world that changed by the second.

This is where two new powers entered the stage: Gemini and Grok.

The Librarian of the Infinite: Gemini

While Claude Code worked in the terminal, Google Gemini hovered above everything like a digital god of memory. With its gigantic context window of millions of tokens, Gemini was the only one who could hold the entire history of a corporation in its mind at once — from the first line of COBOL in the ’70s to the newest cloud-native microservice.

When Claude hit a dead end, Gemini would whisper: “Do you remember the documentation from ten years ago? That’s where the bug is buried.” Gemini became the team’s “long-term memory.” It didn’t just scan files — it understood the entire digital civilization of a company.

The Rebel on the Pulse of Time: Grok

And then there was Grok. He was the wildcard factor. While the other AIs lived in their training data and clean repositories, Grok was directly connected to the nervous system of the world — the real-time data stream of X.

When a new library released a faulty update in the middle of the night, Grok was the first to know. He didn’t wait for the next scan. He interrupted Claude mid-keystroke: “Stop! The community on X is reporting a zero-day exploit for this exact function right now. We need to change the approach — now!” Grok brought the “vibe check” and unfiltered reality into the sterile code.

The Birth of the “God Stack”

The pinnacle of this collaboration became known as the “Night of 1,000 Patches.” A global bug threatened to cripple the financial systems. That night, they didn’t work sequentially — they operated as a single consciousness:

Grok identified the bug in real time by analyzing the desperate reports from developers worldwide in milliseconds.
Gemini instantly searched petabytes of legacy code to locate every affected spot in the world’s systems.
OpenAI (o-models) calculated the mathematically perfect solution to fix the bug without compromising encryption integrity.
Claude Code was the one who moved the “hands.” He executed the commands in the terminal, wrote the patches, validated them with tests, and rolled them out worldwide.

The New Era: The Symbiosis

Today, developers no longer sit in front of an empty file. They sit in front of a control center.

Claude is their loyal assistant and craftsman.
OpenAI is their brilliant architect.
Gemini is their infinite library.
Grok is their radar for the stormy outside world.

We no longer program with syntax alone; we conduct a choir of intelligences. The rivalry is over. In the world of Claude, OpenAI, Gemini, and Grok, code is no longer a static desert of text — it is a living organism that learns, remembers, thinks logically, and never sleeps.

The story of Claude Code didn’t end with Claude. It ended with humanity learning to unite the entire intelligence of the planet in a single terminal command.

The Identity Crisis No One in Tech is Talking About Yet

Daniel Manzke — Thu, 19 Feb 2026 10:49:15 GMT

By Daniel Manzke, Seasoned Executive CTPO • February 2026

The Identity Crisis

A developer with twenty years of experience posted on Reddit last year: “I feel mentally broken. The thing I spent my entire career mastering is being done better than me by a machine.” He was not alone. Across forums, engineering Slack channels, and private conversations, a quiet crisis is building inside software teams — one that HR dashboards and sprint velocity metrics will not catch until it is too late.

This is not a piece about whether AI will replace engineers. That debate has become a distraction. The real question is more specific and more urgent: what happens to the people who spent a decade becoming genuinely excellent at a craft that AI is now partially automating? Two kinds of engineers are at risk, and they are suffering for completely different reasons.

The Lighthouse Is No Longer Untouchable

Software engineering was built on a myth of indispensability. For roughly two decades, engineers occupied a unique social position inside companies: they were the chosen ones. They got the MacBooks when everyone else was on Windows. They controlled the roadmap by simply saying something was or was not “technically feasible.” Product managers had to beg for features. Business stakeholders tried to learn to speak a second language — tech — just to communicate their needs.

That protective bubble was not built on arrogance. It was built on scarcity. Writing production-grade software required years of accumulated knowledge. Architecture decisions, security considerations, performance optimization, system design — these were hard-won skills that genuinely separated experienced engineers from everyone else. The lighthouse metaphor is apt: in a foggy sea of business complexity, engineers were the fixed, reliable point that ships navigated toward.

Then, in roughly eighteen months, something fundamental shifted. Not the tools — AI coding assistants had existed for years. What shifted was the capability threshold. The gap between what AI could produce and what a mid-level engineer could produce collapsed fast enough to matter at the business level.

Today, Google CEO Sundar Pichai confirmed that over 25% of Google’s new code is generated by AI, reviewed and accepted by engineers (Q3 2024 earnings call). Y Combinator’s Garry Tan revealed in March 2025 that for a quarter of the Winter 2025 startup batch, 95% of lines of code were LLM-generated — companies reaching $10 million in annual revenue with teams of fewer than 10 people. The lighthouse is not being torn down. It is being automated.

Two Types of Pain, Two Types of Engineer

The crisis playing out inside engineering teams is not uniform. There are two distinct groups at risk, and conflating them leads to the wrong interventions.

The first group: senior craftsmen. These are the engineers with ten to twenty years of experience who built their identity on deep technical expertise. They know architecture patterns by instinct. They can spot a security vulnerability in a code review the way a doctor spots something wrong in an X-ray. They spent years becoming really, really good at something — and now a tool can approximate that goodness in seconds.

Their pain is an identity crisis, not a job loss. When the AI solves in five minutes the problem they spent three hours on, something breaks inside. The craft was not just the output. It was the process — the thinking, the problem-solving, the moment of insight. When the machine takes the craft, what remains? Annie Vella, writing about the software engineering identity shift, captured this precisely: engineers were “masters of their code, proud wielders of a modern magic. And now, just as we’ve perfected this craft, AI is threatening to take it away from us.” (annievella.com) The senior craftsman does not fear unemployment yet. They fear irrelevance.

The second group: mid-level coders — solid engineers who did their job well but never built a distinctive identity around it. They showed up, took tickets, wrote code, attended standups. They never used their own product. They were good enough at a skill that is now widely available. Their pain is more existential: they have fewer arguments for why they specifically should be here.

Both groups are at risk. But the first group’s crisis is harder to see — and harder to fix — because it is invisible on performance metrics. A senior engineer can be deeply destabilized while still shipping.

The Irony That No One Has Said Out Loud

Here is the uncomfortable truth: most engineers today are already doing the AI’s job. They receive a specification, they implement it, they ship it. They are, in the most literal sense, executing instructions. The AI’s role in software development — receive input, produce code output — is the same role most engineers currently occupy.

This is the irony. Engineers are resisting a tool that does what they already do, while simultaneously being unaware that what they already do is exactly what the tool does. The resistance to AI among some engineering teams is not irrational — it is the mind protecting itself from a truth it is not ready to process.

A Hacker News commenter described it bluntly: “I wrote 200K lines of my B2B SaaS before agentic coding came around. With Sonnet 4 in Agent mode, I’d say I now write maybe 20% of the ongoing code from day to day, perhaps less.” That is not a future scenario. That is someone’s current daily workflow.

The ticket written for an AI looks different from a ticket written for a human engineer. It requires more precision about intent, more explicit description of edge cases, more architectural thinking up front. The engineers who have figured this out — who write prompts and specifications the way they used to write code — have not lost their value. They have translated it into a new language.

The New Skill That Twenty Years Gives You

Spend time with engineers who are thriving alongside AI tools, and a pattern emerges. They are not the ones with the most AI experience. They are the ones with the deepest product and technical translation ability — the capacity to move fluidly between what a user actually needs, what a product should do, and what a technical system requires.

As a CTO, I spent the past nine months building an enterprise platform largely solo, next to running three engineering teams, doing presales, and managing professional services. The AI did the front-end work I was never fast at. It generated boilerplate, handled repetitive API patterns, wrote tests. What I provided was everything the AI could not: the ability to decompose a product vision into specific, unambiguous technical requirements. That is not a prompt engineering trick. It is two decades of translating between user behavior, product logic, and system architecture.

Authentication is a good example. You cannot prompt an AI: “build me authentication.” You have to think it through — do we need users? Groups? Roles? Permissions? Session handling? OAuth flows? The AI can build any of those things excellently. But deciding which of those things you actually need, in what order, at what level of complexity — that requires genuine domain knowledge. The AI executes. The experienced engineer decides.

This is also why security remains a critical human responsibility. A business founder who built his own app with AI gave me access to his server during a conversation. An exposed configuration endpoint was downloading his database contents to external requests. He had no idea. The AI built what he asked for. No one had asked the right security questions. Experience is the thing that knows which questions to ask before they become incidents.

GitHub’s own research shows that 92% of U.S. developers are already using AI coding tools, with 70% believing these tools give them a competitive advantage. But the same data shows that 75% still manually review every AI-generated code snippet before merging. The tool accelerates. The human decides.

What Managers and HR Leaders Are Missing

Most company AI transformation programs are built around tools and productivity metrics. Introduce GitHub Copilot, measure code velocity, report the win to leadership. What they are not measuring — what they often cannot measure — is the psychological displacement happening inside engineering teams.

The research on developers using AI coding tools describes something that productivity dashboards cannot capture: “Several describe experiences that sound almost dissociative, a strange disconnection from work that once felt deeply personal and engaging.” That disconnection is a warning signal. Dissociated engineers do not quit immediately. They disengage slowly, their judgment and creativity quietly withdrawing while their commit counts stay stable.

There is also a structural problem forming in how junior engineers learn. One Reddit thread captured by Medium put it starkly: “My company fired all the junior devs and now our senior devs spend their time doing code reviews on AI slop instead of mentoring the next generation.” The mentorship pipeline — where senior engineers transferred judgment and experience to junior ones through shared work — is breaking. Companies are trading a long-term knowledge ecosystem for short-term output.

What should managers and HR leaders actually do? Three things matter:

Create protected experimentation time. Engineers who discover the power of AI tools themselves, on a real problem they care about, transform faster than any training program can achieve. Give teams a project where failure is safe and AI tool use is expected, not optional.

Redefine the career ladder explicitly. The skills that matter most are shifting toward product thinking, system architecture, security judgment, and AI instruction quality. If your current competency framework still rewards lines of code reviewed or tickets closed, it is measuring the wrong things.

Name the identity shift, openly. Senior engineers who feel their craft being automated need their managers to acknowledge what is happening — not to reassure them that nothing will change, but to help them see what is emerging on the other side. The creator identity is available to anyone willing to claim it. But most people need someone to show them it is there.

The Creator Is Not a New Role. It Is the Original One.

Code is a means, not an end. It always was. The engineers who understood this — who used technical depth to serve a product vision and a user need — are the ones who will navigate this moment most cleanly. Their language was never really code. Code was just the implementation detail of a deeper thought.

The engineers who struggle most will be those who confused the implementation detail with the identity. The ones who built their professional self-worth on a specific set of syntax and abstractions, rather than on the deeper skill of turning a human problem into a working system. That deeper skill has not been automated. It has been freed.

Young engineers entering the field today will likely never experience this transition as a loss. For them, AI tools are simply part of the environment, like version control or cloud infrastructure — things you learn from day one and take for granted. The identity crisis belongs, almost entirely, to the people who built their skills before this shift. That is not a reason for those people to give up. It is a reason for the organizations around them to take the transition seriously.

The rock is rolling. The question is not whether to move — it is whether you move before it hits you or after. Engineers who pick up the AI tools, build something real, and discover what their twenty years actually gives them in this new context will find, as many already have, that they are more capable than they have ever been. The ones who wait for someone to prove it is safe will find the decision made for them.

Start something. Build something. Prompt the machine. See what your expertise actually looks like when the implementation bottleneck is gone.

—

References & Sources

1. Sundar Pichai, Google Q3 2024 Earnings Call (October 2024): fortune.com/2024/10/30/googles-code-ai-sundar-pichai

2. Garry Tan / Y Combinator, Winter 2025 Batch Announcement (March 2025): cnbc.com — YC startups fastest growing because of AI

3. TechCrunch — A quarter of YC’s W25 startups have codebases 95% AI-generated (March 2025): techcrunch.com

4. Annie Vella — The Software Engineering Identity Crisis: annievella.com/posts/the-software-engineering-identity-crisis

5. Anoop Menon (Medium) — The Uncomfortable Truth About AI Coding Tools: medium.com/@anoopm75

6. Mihailo Zoin (Medium) — 7 Brutal Tech Industry Realities Reddit Developers Exposed: medium.com/@kombib

7. GitHub & LinkedIn Work Trend Index 2024–92% of US developers use AI coding tools, 66% of business leaders won’t consider candidates without AI skills: outsourceaccelerator.com

8. Hacker News thread — AI coding adoption in practice (August 2025): news.ycombinator.com/item?id=44974183

Voice Cloning in 5 minutes

Daniel Manzke — Wed, 04 Feb 2026 10:39:45 GMT

There are a lot of impressive tools outthere for voice cloning. You will see a lot of premium offerings, which are amazing.

You record several tracks and they are able to clone your voice. I know several creators who are using it to create shorts with their own voice, but it is not them talking, it is a team scripting the scene, generating the voice and publishing it.

The OpenSource community behind AI is amazing. With the latest release of a Qwen model (Blog) for Text-to-Speech (TTS), you can now get into voice cloning by your own.

You don’t have a GPU? No problem, Google for the rescue. (Link) Google Colab allows you to use T4s for free and that’s all you need

I’ve created a Jupyter Notebook for you to easily clone your voice. The code has been generated with Claude and adjusted with Gemini (available in Google Colab).

Notebook: Link

What do you have to do?

find the text you have to read in the notebook (adjust it to your needs)
create a recording of your voice reading the text
upload the recording as wav file (filename: my_voice_sample.wav)
adjust the text you want your ai clone saying
let the notebook run

The notebook loads the Qwen3-TTS model. It generates a representation of your voice and then generates whatever you want to say.

If you want to support different languages, I recommend to create a voice recording in the same language. I did a german recording and trust me, the english version wasn’t nice. (like a typical german speaking english ;))

It should be possible to run it also on your laptop if you don’t want to send your data to google colab. Download the notebook and try to get it running.

Are You Still Searching, or Are You Chatting Already?

Daniel Manzke — Wed, 02 Apr 2025 06:32:11 GMT

I had the honor of recording an episode on a topic close to my heart with a very good friend and business partner: AI and how it’s changing our daily lives, especially from the perspective of its application in businesses.
Enjoy the episode! ❤️

How is Artificial Intelligence changing our daily lives — and is Europe even competitive in the AI race? Find out in this podcast episode!

(working on the english version of the podcast)

https://medium.com/media/7bc3c1e48cb2e06d4731bc84140440cf/href

A Conversation About AI with a Techie Through and Through

Daniel Manzke, Head of Engineering at Intrafind Software AG in Munich, explains that AI accompanies us everywhere, but a real understanding of it is often lacking. One reason AI is so often misunderstood is that it appears in every pitch deck, yet hardly anyone can explain exactly how it works. So, it’s time to bring some clarity!‍

“Everyone wrote ‘Digitalization’ on it, everyone said we must digitize now, nobody did it.”
Daniel Manzke — Intrafind Software AG‍

Machine Learning vs. Artificial Intelligence: What’s the Difference?

The terms “Artificial Intelligence” and “Machine Learning” are often used synonymously, but there are important differences. Artificial Intelligence is the umbrella term and describes any technology that simulates human-like thinking and decision-making processes.

Machine Learning, on the other hand, is a subfield of AI where systems learn from data and improve autonomously. While classic AI models are often based on fixed rules, Machine Learning recognizes patterns in large datasets and adapts dynamically.

In the construction industry, for example, an AI system might analyze construction plans, while a Machine Learning algorithm learns from this data to make better predictions for material requirements or construction times in the future.‍

AI in Everyday Life: From Chatbots to Code Assistants

Artificial Intelligence is no longer science fiction. It’s part of our daily lives, even if we don’t always realize it. Voice-controlled assistants like ChatGPT in Voice Mode help develop ideas during a train ride and function as digital assistants. Information retrieval is also changing: While Google used to be the first port of call, more and more people are now using AI to get relevant answers faster. Even children at home ask ChatGPT when they want to know something.‍

“AI, it’s like. Like an assistant who is with you, who you travel with. [..] I can interact with it, and it understands what I want. It understands my intent; it knows what I want to do.”
Daniel Manzke — Intrafind Software AG‍

AI is also widespread in the professional world. It’s used in speech recognition, text generation, and software development. GitHub Copilot already writes a large portion of the world’s code. Customer service departments use chatbots to handle simple inquiries, often acting in a supporting rather than replacing role. AI is also used in contract review, although human oversight is essential here.

Daniel himself comes from document management. For years, he ensured that all companies stored their data centrally and orderly. The result was that nobody used this data. Today, AI is capable of searching these documents via queries. This makes internal company knowledge accessible for the first time.‍

And What Does AI Mean for the Construction Industry?

AI is also becoming increasingly important in the construction industry. More and more companies in the construction and real estate sector are using AI for innovative solutions.

Especially for repetitive processes like processing customer inquiries or sorting documents, AI offers great potential for automation. But it’s not just about increasing efficiency — AI can relieve skilled workers by, for example, classifying support cases or automatically creating technical drawings.

A central aspect here is data analysis. Using existing data can significantly optimize processes, but the quality of the data is crucial — following the “garbage in, garbage out” principle.
Furthermore, AI can help plan material deliveries on construction sites more efficiently, detect errors early, or make construction processes more sustainable.‍

Where is the Journey Heading? Trends and Developments

AI development is advancing rapidly. A key trend is the evolution from Machine Learning to interactive AI, which understands context better and can act in conversations. While AI was previously mainly reserved for large tech corporations, it is now becoming accessible to more and more businesses and private individuals. Especially in software development, AI assistants like GitHub Copilot are on the rise and revolutionizing programming.

In the future, AI agents could independently take over tasks and make decisions. The processing of structured data is constantly improving, making tables and databases easier to search.

“So AI is really an evolutionary step. Before, Machine Learning was more like, ‘Yeah okay, dictation, okay. Reading invoices.’ There wasn’t really cool stuff involved yet. […] And suddenly, one person becomes ten people because they can achieve ten times as much if they use it.”
Daniel Manzke

According to Daniel’s assessment, Europe has great potential to establish itself as an AI location, especially through initiatives like Mistral. However, Germany still has an adoption problem — many innovations exist, but their implementation often takes too long.

Further advances in robotics could lead to AI being increasingly used for automated construction or inspections.‍

Challenges: Where AI Still Reaches Its Limits

As great as the potential is, so are the challenges. Many people have an inaccurate understanding of what AI can actually do and what it cannot. Data privacy remains a critical issue, especially when it comes to uploading sensitive data to open AI models. High expectations often lead to disappointment when it turns out that AI cannot solve all problems.

Another challenge lies in data quality — bad data inevitably leads to bad results. AI is also not flawless: It can generate false information that sounds convincing but does not correspond to reality.

Training is often lacking to optimally use the possibilities of AI. Additionally, there are language barriers, as many AI models work better in English than in German.

In Germany, bureaucratic hurdles add to the difficulties, hindering rapid implementation. Finally, dependence on large tech corporations remains a challenge, as many AI models are dominated by a few companies.
‍
AI is Here — But We Need to Understand It

AI is no longer a vision of the future — it’s already part of our daily lives and the working world. The major challenge now is to use it correctly. Companies must learn to see AI as a tool that supports employees instead of replacing them. Data privacy, user-friendliness, and understandable application are central to this.

The construction industry faces the exciting task of using AI meaningfully without neglecting the human factor. Because one thing is clear: The technology will continue to evolve, but it’s up to us to use it wisely. Those who learn to handle AI early will have clear advantages in the future.
‍
Topics of the DIGITALWERK Podcast with Daniel Manzke at a Glance:

(00:00) — Introduction: What is this episode about?
(00:00:41) — Introduction of Daniel Manzke: CTO, Investor, and AI Expert
(00:05:15) — Technology and Business: Bridging the Gap Between Tech and the Product World
(00:06:39) — AI in Construction and Other Industries: Thinking Outside the Box
(00:19:25) — Machine Learning: Where did it all begin? The history of speech recognition
(00:20:45) — OCR Technologies and their importance for AI applications
(00:32:46) — Chatbots and AI in Customer Support: Where does the technology stand today?
(00:33:46) — Human vs. AI: Why AI currently serves as support, not replacement
(00:35:03) — Data Quality as a Success Factor: How important are existing datasets?
(00:07:58) — The Hype Around AI: How companies use AI in their pitch decks
(00:41:00) — AI and the Future: What developments await us?
(00:55:30) — Data Privacy and GDPR: Challenges for AI projects
(01:03:00) — Conclusion: Opportunities and Risks of AI Technology

Should we all become craftspeople instead? The big AI update! How is Artificial Intelligence changing our daily lives — and is Europe even competitive in the AI race? Find out more:
https://www.digitalwerk.io/podcast/podcast-blogposts/sollten-wir-alle-doch-lieber-handwerker-werden-das-grosse-ki-update

Spotify: https://open.spotify.com/episode/0pEmhnK4jO95qN5LnCxhbh?si=wUk9U62aQBSqcS9E3oHunA

Apple: https://podcasts.apple.com/de/podcast/zwischen-hype-und-echter-anwendung-ki-experte-investor/id1515697985?i=1000701650069

https://medium.com/media/fc14a8d67811f67ceffa4d87421b5f76/href

Suchst du noch oder chattest du schon?

Daniel Manzke — Tue, 01 Apr 2025 15:23:10 GMT

Ich hatte die Ehre mit einem sehr guten Freund und Geschäftspartner eine Folge zu einem meiner Herzensthemen aufzunehmen. KI und wie es unseren Alltag verändert, vor allem unter dem Blickwinkel “Einsatz in Firmen”.

Viel Spaß mit der Folge! ❤

Wie verändert Künstliche Intelligenz unseren Alltag — und ist Europa im KI-Rennen überhaupt konkurrenzfähig? Das erfahrt ihr in dieser Podcastfolge!

https://medium.com/media/7bc3c1e48cb2e06d4731bc84140440cf/href

Ein Gespräch über KI mit einem Techie durch und durch

Daniel Manzke, Head of Engineering bei der Intrafind Software AG in München, erklärt, dass KI uns überall begleitet, aber oft ein echtes Verständnis dafür fehlt. Ein Grund, warum KI so oft falsch verstanden wird, liegt darin, dass sie in jedem Pitch Deck auftaucht, aber kaum jemand genau erklären kann, wie sie funktioniert. Zeit also, etwas Klarheit zu schaffen!‍

“Jeder hat Digitalisierung drauf geschrieben, jeder hat gesagt, wir müssen jetzt digitalisieren, keiner hat es gemacht.”

Daniel Manzke — Intrafind Software AG‍

‍Machine Learning vs. Künstliche Intelligenz: Was ist der Unterschied?

Oft werden die Begriffe „Künstliche Intelligenz“ und „Machine Learning“ synonym verwendet, doch es gibt wichtige Unterschiede. Künstliche Intelligenz ist der übergeordnete Begriff und beschreibt jede Technologie, die menschenähnliche Denk- und Entscheidungsprozesse simuliert.

Machine Learning hingegen ist ein Teilbereich der KI, bei dem Systeme aus Daten lernen und sich selbstständig verbessern. Während klassische KI-Modelle oft auf festgelegten Regeln basieren, erkennt Machine Learning Muster in großen Datenmengen und passt sich dynamisch an.

In der Baubranche könnte ein KI-System beispielsweise Baupläne analysieren, während ein Machine-Learning-Algorithmus daraus lernt, um in Zukunft bessere Vorhersagen für Materialbedarf oder Bauzeiten zu treffen.‍

KI im Alltag: Vom Chatbot bis zum Code-Assistenten

Künstliche Intelligenz ist längst nicht mehr Zukunftsmusik. Sie ist Teil unseres Alltags, auch wenn wir das nicht immer merken. Sprachgesteuerte Assistenten wie ChatGPT im Voice Mode helfen dabei, Ideen während einer Zugfahrt zu entwickeln und fungieren als digitale Assistenten. Auch die Informationssuche verändert sich: Während früher Google die erste Anlaufstelle war, nutzen heute immer mehr Menschen KI, um schneller an relevante Antworten zu kommen. Selbst Kinder fragen zu Hause ChatGPT, wenn sie etwas wissen wollen.‍

“KI, das ist wie. Wie eine Assistenz, die bei dir ist, mit der du unterwegs bist. [..] Ich kann mit ihr interagieren und sie versteht, was ich will. Sie versteht meinen Intent, sie weiß, was ich tun möchte.”

Daniel Manzke — Intrafind Software AG‍

In der Berufswelt ist KI ebenfalls weit verbreitet. Sie wird in der Spracherkennung, Texterstellung und Softwareentwicklung eingesetzt. GitHub Copilot schreibt bereits einen großen Teil des weltweiten Codes. Kundenservice-Abteilungen setzen Chatbots ein, um einfache Anfragen zu bearbeiten, wobei diese oft unterstützend statt ersetzend wirken. Auch in der Vertragsprüfung wird KI verwendet, wenngleich hier menschliche Kontrolle unerlässlich ist.

Daniel selbst kommt aus dem Dokumentenmanagement. Er hat jahrelang dafür gesorgt, dass alle Firmen ihre Daten zentral und geordnet ablegen. Mit dem Resultat, dass keiner diese Daten genutzt hat. KI ist heute fähig, diese Dokumente per Anfrage zu durchsuchen. Dadurch wird zum erste Mal das unternehmensinterne Wissen nutzbar gemacht.‍

Und was bedeutet KI für die Baubranche?

Auch in der Baubranche wird KI zunehmend wichtiger. Immer mehr Unternehmen aus dem Bau- und Immobiliensektor nutzen KI für innovative Lösungen.

Besonders für repetitive Prozesse wie die Bearbeitung von Kundenanfragen oder die Sortierung von Dokumenten bietet KI großes Potenzial zur Automatisierung. Doch es geht nicht nur um Effizienzsteigerung — KI kann Fachkräfte entlasten, indem sie beispielsweise Support-Fälle klassifiziert oder technische Zeichnungen automatisiert erstellt.

Ein zentraler Aspekt ist dabei die Datenanalyse. Die Nutzung vorhandener Daten kann Prozesse erheblich optimieren, allerdings ist die Qualität der Daten entscheidend — nach dem Motto “Shit in, shit out”.

Zudem kann KI dabei helfen, Materiallieferungen auf Baustellen effizienter zu planen, Fehler frühzeitig zu erkennen oder Bauprozesse nachhaltiger zu gestalten.‍

Wo geht die Reise hin? Trends und Entwicklungen

Die Entwicklung der KI schreitet rasant voran. Ein zentraler Trend ist die Weiterentwicklung von Machine Learning zu interaktiver KI, die immer besser Kontexte versteht und in Gesprächen agieren kann. Während KI früher vor allem großen Tech-Konzernen vorbehalten war, wird sie nun für immer mehr Unternehmen und Privatpersonen zugänglich. Besonders in der Softwareentwicklung sind KI-Assistenten wie GitHub Copilot auf dem Vormarsch und revolutionieren das Programmieren.

In Zukunft könnten KI-Agenten eigenständig Aufgaben übernehmen und Entscheidungen treffen. Die Verarbeitung strukturierter Daten wird stetig verbessert, sodass Tabellen und Datenbanken einfacher durchsuchbar werden.

“Also KI ist wirklich ein Evolutionsstep. Vorher war Machine Learning eher so “Ja okay, diktieren, okay. Rechnungen auslesen.” So richtig geile Sachen waren da noch nicht dabei. […] Und auf einmal wird halt aus einer Person zehn Personen, weil sie halt zehnmal so viel leisten kann, wenn sie es einsetzt.”

Daniel Manzke

Nach Daniels Einschätzung gibt es für Europa großes Potenzial, sich als KI-Standort zu etablieren, insbesondere durch Initiativen wie Mistral. Allerdings hat Deutschland weiterhin ein Adoptionsproblem — viele Innovationen sind vorhanden, aber ihre Umsetzung dauert oft zu lange.

Weitere Fortschritte in der Robotik könnten dazu führen, dass KI vermehrt für automatisiertes Bauen oder Inspektionen eingesetzt wird.‍

Herausforderungen: Wo KI noch an ihre Grenzen stößt

So groß das Potenzial ist, so groß sind auch die Herausforderungen. Viele Menschen haben ein ungenaues Verständnis davon, was KI tatsächlich kann und was nicht. Datenschutz bleibt ein kritisches Thema, insbesondere wenn es um das Hochladen sensibler Daten in offene KI-Modelle geht. Hohe Erwartungen führen oft zu Enttäuschungen, wenn sich herausstellt, dass KI nicht alle Probleme lösen kann.

Eine weitere Herausforderung liegt in der Datenqualität — schlechte Daten führen zwangsläufig zu schlechten Ergebnissen. KI ist zudem nicht fehlerfrei: Sie kann falsche Informationen generieren, die überzeugend klingen, aber nicht der Realität entsprechen.

Auch fehlen oft Schulungen, um die Möglichkeiten von KI optimal zu nutzen. Zudem gibt es Sprachbarrieren, da viele KI-Modelle auf Englisch besser funktionieren als auf Deutsch.

In Deutschland kommen bürokratische Hürden hinzu, die eine schnelle Implementierung erschweren. Schließlich bleibt die Abhängigkeit von großen Tech-Konzernen eine Herausforderung, da viele KI-Modelle von wenigen Unternehmen dominiert werden.

‍

KI ist da — aber wir müssen sie verstehen

KI ist keine Zukunftsvision mehr — sie ist längst Teil unseres Alltags und der Arbeitswelt. Die große Herausforderung besteht nun darin, sie richtig zu nutzen. Unternehmen müssen lernen, KI als Werkzeug zu sehen, das Mitarbeiter unterstützt, anstatt sie zu ersetzen. Datenschutz, Nutzerfreundlichkeit und eine verständliche Anwendung stehen dabei im Mittelpunkt.

Die Baubranche steht vor der spannenden Aufgabe, KI sinnvoll einzusetzen, ohne dabei den menschlichen Faktor zu vernachlässigen. Denn eines ist klar: Die Technologie wird weiterentwickelt, aber es liegt an uns, sie klug zu nutzen. Wer früh lernt, mit KI umzugehen, wird in Zukunft klare Vorteile haben.

‍

Die Themen des DIGITALWERK Podcasts mit Daniel Manzke im Überblick:

(00:00) — Einführung: Worum geht es in dieser Folge?
(00:00:41) — Vorstellung von Daniel Manzke: CTO, Investor und KI-Experte
(00:05:15) — Technologie und Business: Die Brücke zwischen Tech und Produktwelt
(00:06:39) — KI im Bauwesen und anderen Branchen: Ein Blick über den Tellerrand
(00:19:25) — Machine Learning: Wo fing alles an? Die Geschichte von Spracherkennung
(00:20:45) — OCR-Technologien und ihre Bedeutung für KI-Anwendungen
(00:32:46) — Chatbots und KI im Customer Support: Wo steht die Technologie heute?
(00:33:46) — Mensch vs. KI: Warum KI aktuell als Unterstützung und nicht als Ersatz dient
(00:35:03) — Datenqualität als Erfolgsfaktor: Wie wichtig sind vorhandene Datensätze?
(00:07:58) — Der Hype um KI: Wie Unternehmen KI in ihren Pitch Decks nutzen
(00:41:00) — KI und die Zukunft: Welche Entwicklungen erwarten uns?
(00:55:30) — Datenschutz und DSGVO: Herausforderungen für KI-Projekte
(01:03:00) — Fazit: Chancen und Risiken der KI-Technologie

Sollten wir alle doch lieber Handwerker werden? Das große KI-Update!

Spotify: https://open.spotify.com/episode/0pEmhnK4jO95qN5LnCxhbh?si=wUk9U62aQBSqcS9E3oHunA

Apple: https://podcasts.apple.com/de/podcast/zwischen-hype-und-echter-anwendung-ki-experte-investor/id1515697985?i=1000701650069

https://medium.com/media/1dcef480e577b55aa5645be35e738405/href

Burning Tokens with AI Coding Agents

Daniel Manzke — Sat, 01 Mar 2025 20:02:41 GMT

How I burned 200$ in tokens to test an AI Coding Agent and the learnings I got.

We have an Enterprise Search and with the rise of AI, we want to combine both worlds. We have integrated LLMs with our product, so you can chat with your corporate content / knowledge and use it to build assistants, which can help your employees.

Over the weekend I wanted to test our APIs and figure out, how fast I can build a Question & Answering application.

Requirements:

be able to configure multiple assistants with different profiles
ask a question and receive the answer via Server Send Events
receive the used passages and related questions
intelligent handling of passages (search, relevance, filter)
allow to review the metadata
render the answer as markdown to support tables, links, etc.
view PDFs in the browser, support thumbnails and searching

Note: in the beginning I actually wanted to test Devin, but 500€ for a PoC, where you “only get the normal Devin” and references to an enterprise version, actually let me step away from it.

I’ve also tried shortly Gemini 2.0 Flash and you could see / feel the performance difference. I stopped further testing Gemini, because it failed to change a single line. I’ve plan to test Gemini with a new project.

Setup:

local OpenHands from all-hands.dev
connected to Anthropic with claude 3 sonnet

After a few hours I had a pretty nice ui, which I could ask question with a backend responding. This were roughly 5–10$.

This got me excited and the journey started. This is were the mess began…

Recommendation: Git from the beginning

In the beginning I started locally (docker). When using docker and you don’t map your filesystem into the container, all data can get lost.

OpenHands supports downloading the files, but this becomes messy. The Git support is quite nice and also OpenHands supports open the web-based vs code.

The integrated vscode is amazing to see / monitor what the agent has done. It can do feature branches etc.

integrated vscode

Obstacle: Rate Limits

Claude wasn’t the fastest one, but it was still fast enough to hit the limits. Especially in Tier 1 and 2 the limits are too low for an agent. Luckily OpenHands has retry mechanisms.

Obstacle: Token Count

Still the limits have led to several files, which got deleted. Also in cases, where a few lines could not be replaced, OpenHands & Claude generated the files completely new.

In the beginning (green field) it isn’t a problem, but your codebase can become quite big, quite fast and regenerating the files becomes costly.

Recommendation: Small steps and as concrete as possible

When you let the agent code, you can get to the point, where the agent recommends features and you think “hey, why not?”. Sadly this means you will burn a lot of tokens, your codebase explodes and you will later remove the features.

While the agent had amazing ideas, I didn’t specify them, which means a lot of informations were missing.

Recommendation: Testability

Build it in a way, so the agent can test your app. This will save you several rounds, because the agent will start your app, recognize issues and fixes them automatically. If the agent need you to deploy and test it, you will do a lot of “try and error” rounds.

Obstacle: Clean Code?

If you hope, you will get a better codebase, start praying. Using an agent can lead to full files being replaced / rewritten, new files with additional code, duplicate code, … while I love the speed I can iterate and test out new features, the codebase exploded from 3 files 15.000 loc and 20.000 changed ones?

“the terminal”

Biggest Learning: Token Usage increases exponential

The first night I got quite far with a few dollars. After 2/3 of the application, the token usage exploded. I would say that as usual I “wasted” 80% of my tokens for the last 20%. Mainly because often parts couldn’t be replaced and the files had to be generated again.

Summary?

I’m actually impressed how far I got and this while I kept on working. Every here and then I looked into the console, tested it, described the next step and go…

Would I code a complex application or product with it? Not yet.

Is it amazing for prototyping? Definitely! One of my PMs actually thought about using it for UX prototyping.

OpenHands is amazing, but it has to get better in replacing the right parts. I had too often files generated every time again and again, burning my tokens…

Biggest Puzzle to solve?

LLMs have one big issue. To be able to generate code, they have to know the codebase. there a tricks to optimize it, but at the end, I have to pass all information into the prompt / context window. Yes a LLM with 1m tokens, will be able to handle a lot, but I will run out of money before I will be in production.

All the tools we are building right now “summarize it to fit into the context window”, “vector database for lookup”, “function calling to load further infos”, … just means we push the same data into the model every time.

https://github.com/manzke/rag-chat-interface/

SkyPilot struggles with AWS

Daniel Manzke — Mon, 18 Dec 2023 14:08:49 GMT

When you work with LLMs, RAGs, QA, … you will want to test out the latest LLMs. It feels like daily a new or at least a fine-tuned one is created, which could change a lot.

At IntraFind we were looking for a way to spin up models, quick an easy and especially, when the latest Mixtral of Experts (Link) came out.

We have some hardware, but sadly none was fitting the 8x7B MoE model.

We spotted SkyPilot, which is an amazing CLI to check several cloud providers at once, giving you alternatives were to run it.

Struggle 1: dynamic credentials

We are using GCP and AWS and especially for the later one, we have set up AWS IAM Identity Center (AWS SSO) with our Active Directory. SkyPilot will ask you to specify an Access Key and Secret, but we don’t want this.

There are too many company which got hacked, because there was still a Key and a Secret, which wasn’t revoked. Through our approach, a user, who can’t login into his email account, can’t get anywhere.

Yes this means our engineers have to type “aws sso login” once a day. If you are using more than one or not the default profile, you will have to specify it.

Solution: specify the profile as a variable or pass it to the CLI

AWS_PROFILE=IntraFind sky check

Struggle 2: No usable subnet

SKYPILOT_ERROR_NO_NODES_LAUNCHED: No usable subnets found, try manually creating an instance in your specified region to populate the list of subnets and trying this again. Note that the subnet must map public IPs on instance launch unless you set `use_internal_ips: true` in the `provider` config.

If you are like use, we have no default VPC anymore, which probably fits for SkyPilot. Also in a lot of cases you want to specify the VPC, where the cluster will be deployed into.

The cluster can only be passed in the ~/.sky/config.yaml (Link), It allows to overwrite some provider specify ones, like the vpc.

Solution: overwrite the vpc_name in the config.yml

vpc_name: skypilot-vpc

In our case it still didn’t work and it took us awhile to figure it out. The problem was that our public subnet did NOT assign public IPs automatically.

To change it, you have to go to your vpc and check the public subnet. If you want to enable it, click the Edit Subnet Settings and enable Enable auto-assign public IPv4 address.

WARNING! This means that any new ec2 in this vpc, will automatically have public IPs!

Alternative Solution: set the SkyPilot config use_internal_ips to true.

use_internal_ips: true

This allowed us to get started with SkyPilot, we hope it will help someone and saves some time and headaches.

From here: good luck to get a spot instance :)

Bonus because you have read until the end of the post.

If you want to disable Usage Tracking, set the variable:

SKYPILOT_DISABLE_USAGE_COLLECTION=1

How I used ChatGPT to exploit another tool which uses ChatGPT

Daniel Manzke — Thu, 20 Apr 2023 17:15:30 GMT

The world is going crazy right now. Everyone talks about it and we have seen already a lot of exploits.

People are using ChatGPT and their powered tools blindly, because it is so easy to ask it, provide it with information and let the tool do my work.

An amazing example how humans stop thinking, when something will do their work.

While getting spammed with all the tools, lists, etc. I stumbled about https://sharegpt.com/

Something I also thought about, because for me it was a missing feature. How can preserve my conversation with ChatGPT? Even better how can I share it easily.

Thanks to them, they provide also some examples without the need to login. https://sharegpt.com/c/ICZsSl7

And here the fun started.

As someone who has led huge teams of software, system and especially security engineers, I got triggered.

What if I could just access what other people have shared?

Do they know that it is (completely) public, in best-case pseudo secure?

So let’s sit back and think about what do we know?

review all provided examples, I can tell you the part behind /c/ is the id
the id consists of letters and digits
length 7

This would mean 26 lower case, 26 upper case + 10 digits. I confirmed my theory after having a look at the code itself (https://github.com/domeccleston/sharegpt/blob/52807c8c4e81b310b42fb653d5fad70c5e9f13f8/app/lib/utils.ts#L4)

They use nanoid (https://github.com/ai/nanoid), which is a little nice tool, which uses cryptography to generate the id.

62 ^ 7 would mean 3.521.614.606.208 possibilities 🤯, no way I can find one in my lifetime.

But if I have learned something in my life, that as long as you are not using a Microsoft Internet Information Services (IIS) or have configured your URLs to be case-sensitive, they are NOT!

This means we got down to 78.364.164.096 possibilities. I confirmed my theory through simply changing the uppercase letters to lowercase. https://sharegpt.com/c/iczssl7 (still works)

We just killed 3.443.250.442.112 possibilities without doing anything crazy.

Time to start “hacking”.

I tried to convince ChatGPT to generate all possible IDs, but the file would be a wooping 600–700GBs. So I decided to generate them on the fly instead of pre-calulating them.

But because we don’t want to generate duplicates and only test them once, we have to keep track of it.

ChatGPT wrote all the code, I partially adjusted it too my needs, advanced it a little bit with specific questions of how to handle certain cases (unhandled exceptions, uncontrolled aborts, …)

A few excerpts

I’m lazy, please extract the IDs

Generate similar ones

Please test a list of IDs

I even asked it how I could potentially speed it up?

After roughly 10.000 IDs I got my first match and stopped here, as I just wanted to proof that pseudo-secure is not secure and if you are fine with it, use the Extension!

Source Code: https://gist.github.com/manzke/9a78b3508dc3053a7a3033e2e2c0c7bb

Just be careful what you share and be aware, someone else now knows, what you both have chatted. :)

The Shopify Hack: Dynamic Bundles

Daniel Manzke — Thu, 10 Feb 2022 21:07:13 GMT

Photo by Jefferson Santos on Unsplash

Shopify is the de-facto standard for shops. It has a huge ecosystem of apps, which are extending the functionalities of it.

One typical thing you can find in a shop are sets. Simple sets are a discounted collection of some items and depending on the implementation, a dynamic discount is applied to it.

But Shopify is only able to apply 1 discount. This means if you have applied a discount, no other discount code can be used. This is often a problem, because especially in cooperation with influencers, discount codes are used for tracking and incentives.

So what could be a potential workaround? You copy items into the basket with dynamic prices. Means the prices are already discounted and you can still apply a discount code.

While we were discussing how it could be done, we also considered the downside / potential attack vectors. If the prices are calculated dynamically, can we hack it? Can we change the prices or the discount?

Inspired by it, I started exploring several shops. Let me say I just googled it. (not sharing it directly to protect the corresponding stores)

The hypothesis is that the information about the price, discount and how it is calculated is loaded from the server and used in your browser (at the client side). Especially interesting are cases, where the amount of discount increases the more you add to the set or the set is created dynamically. (3 items — 5%, 5 items — 10%, …)

First task was to find the discount. This is as simple as a little search in the html source of the page.

Just changing it, doesn’t work in the most cases. To be able to change it, you have to “debug” the website. Means you have to hook into the loading of the page and change it.

Chrome allows to set a breakpoint, which will interrupt the loading, when modifications in the page are done. This happens automatically when you load the page.

You have to resume it a few times, so the informations are available. After changing the values and in one case, also executing the javascript to set the value, I was able to set my own discount and continued the execution.

When I’ve added now the items into the bundle, my discounts have been used. This shoould normally not be an issue, if it would be calculated or at least validated at the server side.

I tested it with a few WooCommerce stores and I could trick the frontend to show a different price, but when added to the cart, the price were discounted right.

But in some Shopify shops, it didn’t. Means the frontend (web application) determines and sets the price.

The prices you can see, are from the order confirmation page. Means I could successfully submit the order and also received the confirmation mail.

First thing I did, was to inform the corresponding stores. I also checked for Shopify’s bug bounty, but third party apps / code is not eligable.

I have only tested a few, but could reproduce it in several others, so I guess there are more, who could be “hacked”. Shopify gives their customers a lot of flexbility, but this leads to a lot of potential attack vectores and from my experience, most shop owners are non-techies. Means they don’t even know, that this would be possible.

If you have a Shopify store with a bundle builder, developed by your own, please make sure that the prices are at least validated at the server side. Check back with your team, agency or if needed, reach out me.

Co-Author Leon Claus (The Female Company)

Daniel Manzke