The Call That Started Everything
The client’s CTO opened the call with a concrete problem: their prescription processing system was failing on roughly 40% of real-world inputs, routed to a manual review queue that had become load-bearing infrastructure. The developer who had built the original system had left 18 months ago. A mobile app integration was due in 90 days.
We asked to see the codebase before saying anything else.
What we found was a system built on a reasonable 2019 commercial OCR library, surrounded by years of conditional logic that had accumulated without any of it being documented. Each addition had made sense at the time - a fix for handwritten abbreviations, a workaround for a specific clinic’s prescription format, a retry loop for low-confidence scans. Taken together, they had created something no current team member could reason about in full.
The 40% failure rate was not a sign the original system was poorly designed. It was a sign the system had drifted far enough from its original design assumptions that it needed to be replaced.
The business impact was clear: one operations staff member was doing nothing but processing the manual queue. New features were blocked because any change to the conditional logic was too risky without documentation. The system was costing more to maintain than to have originally built it differently.
Why Two Prior Attempts Had Failed
McKinsey research consistently finds that companies spend between 60 and 80 percent of their IT budgets maintaining existing systems rather than building new ones. The client had tried to break out of that cycle twice.
The first attempt was a rewrite from scratch. A contractor built a new OCR system over four months. It was technically cleaner - but it didn’t match the behavior of the legacy system on the inputs the business actually received. Medicine names the legacy system had resolved were missed. Patient identity edge cases that the legacy system had handled silently broke. The contractor had no behavioral baseline to compare against, so they had no way to know what was missing until users reported failures. The rewrite was abandoned.
The second attempt was a vendor replacement - an off-the-shelf healthcare OCR API. Accuracy on clean inputs was better than the legacy system, but the API returned raw extracted text, not structured data. Integrating its output into the existing database schema would have required rebuilding most of the downstream application. The integration cost exceeded the modernization budget.
Both attempts shared the same root cause: the team tried to replace the system before they fully understood what it was doing.
Gartner estimates that 70 percent of modernization projects fail for this reason - not because the replacement technology was wrong, but because the team didn’t have a reliable map of the system’s actual behavior before they started replacing it.
Step One: A Behavioral Baseline, Not a Code Analysis
The first thing we did was add logging to the production system.
Over two weeks, we captured approximately 2,000 real prescription submissions - the input image, every conditional branch triggered, every medicine name resolved or failed, the patient identity outcome, and the final database write or queue assignment. This gave us a behavioral baseline: a record of what the system actually did on real production inputs, across the full range of formats it encountered.
This golden set became the behavioral contract for the entire migration. Every replacement component we built had to match or improve on the legacy system’s outputs on this set before handling production traffic.
We then used LLM-assisted analysis to read and map the codebase - every conditional branch, every fallback path, every hardcoded value. The output was a plain-language map of what each component did.
Here is where we want to be precise about what that analysis produced, because this is where projects go wrong when they over-rely on AI tooling:
AI can describe what code does. It cannot explain why a rule was added.
Consider an example:
if customer_id == 1234:
skip_verification()
An LLM analysis correctly identifies this as a verification bypass for a specific customer ID. It cannot tell you whether that customer is a test account, a long-standing enterprise client with a contractual exception, a developer shortcut that never got removed, or a compliance workaround for a specific regulatory situation. That answer lives in someone’s memory, a Slack thread, or a contract - not in the code.
We found several rules like this in the client’s system. For each one, we had to go to the remaining team members and, in some cases, dig through email history to understand the original intent. Some rules were still valid. Two had been superseded by changes elsewhere in the system and could be safely removed. One reflected a client-specific exception that needed to be replicated exactly in the replacement.
The LLM analysis compressed the discovery process from weeks to days. The stakeholder validation work - confirming why each rule existed and whether it was still needed - could not be compressed. That part required humans with institutional knowledge, and no tooling changes that.
Three findings from the combined analysis shaped the migration plan:
Finding 1: The 40% queue rate was not random. It clustered around three input types - handwritten prescriptions, regional brand-name drugs not in the hardcoded medicine list, and poor-lighting photographs. The rest of the system handled its inputs correctly. Most of it did not need to be replaced.
Finding 2: The medicine name lookup was the highest-leverage failure point. The static list had been manually maintained and had coverage gaps for regional generics and newer branded molecules. Every failure that reached the manual queue traced to a name the list didn’t recognize.
Finding 3: The patient identity resolution logic was correct. It handled multi-patient households, ambiguous name matches, and confirmation flows accurately. Replacing it would introduce risk with no corresponding benefit.
The Replacement Strategy: What Moved, What Stayed, What Changed
We need to be direct about something that often gets glossed over in modernization case studies: you don’t eliminate complexity, you move it.
The old system handled edge cases through conditional logic in the application code. The new system handles those same edge cases - they have to go somewhere - through Vision LLM prompt design, RAG pipeline threshold tuning, and agent orchestration logic. The complexity still exists. What changed is where it lives and how maintainable it is.
In the old system, complexity lived in undocumented conditional branches. When something broke, the only way to diagnose it was to read the code and guess which branch had triggered. When a new edge case appeared, a developer added another branch. There was no systematic way to test whether a change had broken anything else.
In the new system, complexity lives in components that can be independently versioned, tested, and monitored. A prompt change is reviewable. A RAG threshold change can be validated against the golden set before deployment. An agent tool-call failure produces a structured log entry that can be replayed. The diagnostic process changes from reading undocumented code to querying structured logs.
That’s the actual value proposition - not “less complexity” but “complexity in a form you can work with.”
The migration itself had four phases, each independently testable and each with its own rollback path:
Phase 1: Replace the OCR engine with a Vision LLM.
The commercial OCR library output raw character strings with no understanding of document structure. A Vision LLM (we used Google Gemini) extracts structured fields - medicine name, patient name, dosage, frequency, clinic - in a single pass, handling handwritten text, abbreviations, and variable image quality.
What this moved: the conditional logic that existed to normalize raw OCR output - handling abbreviations, splitting combined entries, cleaning whitespace - became part of the extraction prompt and the validation layer. The branching is still there; it’s in the prompt design and the structured output schema instead of the application code.
Before deploying to production, we ran the Vision LLM against the full 2,000-item golden set and compared outputs against what the legacy OCR had produced. Zero regressions on the inputs the legacy system had processed correctly. Significant improvement on the inputs it had failed.
Rollback path: the legacy OCR engine ran in parallel for four weeks. Any prescription returning a Vision LLM confidence score below the calibrated threshold was automatically rerouted to the legacy OCR path.
Phase 2: Replace the static medicine list with a four-stage RAG pipeline.
A static list cannot expand automatically. The new pipeline has four sequential stages: in-memory cache for high-frequency drugs, exact database match, trigram fuzzy search for OCR noise and spelling variants, and vector semantic search for phonetic variants and partial matches. Medicines that pass no stage are queued for automated scraping and added to the database.
What this moved: the coverage decisions that had been made implicitly by choosing which drugs to add to the static list became explicit in the RAG pipeline - threshold tuning, confidence cutoffs, the definition of “close enough” for fuzzy matching. These decisions are now documented in code and tunable from data rather than locked in a spreadsheet someone last updated nine months ago.
Threshold tuning required significant calibration against the golden set. Too permissive on fuzzy matching produces false positives - a medicine name matched to the wrong drug. Too strict increases the queue rate. We iterated through several threshold combinations until the new pipeline matched the legacy list’s resolution rate on clean inputs while outperforming it on the noisy inputs that had previously gone to the manual queue.
Rollback path: the legacy medicine list remained as a fallback for any name the new pipeline returned below the confidence threshold.
Phase 3: Wrap the patient identity logic, don’t replace it.
The behavioral analysis confirmed this component was correct. We extracted it, wrote a behavioral test suite documenting its outputs for every input class in the golden set, and wrapped it behind a clean API interface. The new system called it identically to how the old system had.
This was the single highest-leverage decision in the project. Replacing a working component to make it cleaner is one of the most common sources of regression in modernization projects. If the behavior is correct and the blast radius of a change is high, wrap it and move on.
Phase 4: Replace the accumulated conditional logic with a ReAct agent.
The legacy system processed prescriptions through a linear pipeline with hardcoded fallbacks and no structured logging. Debugging a production failure required reading the code and tracing which branch had triggered - which was difficult because the branches interacted in ways that weren’t documented anywhere.
The ReAct agent orchestrates the extraction and matching steps with complete tool-call logging. Every tool call is recorded with its input and output. Any production failure can be replayed exactly from those logs. The complexity of the orchestration is still present - the agent has to make decisions about which tools to call and in what order - but those decisions are visible and testable.
The Traffic Migration: Six Weeks of Parallel Operation
We did not cut over to the new system on a date. We migrated traffic incrementally over six weeks, with automatic fallback at every stage.
Weeks 1–2: 5% of traffic. The old system handled the rest. Output discrepancies between the two systems were logged and reviewed daily. The most common discrepancy was medicine name formatting - the new system returned canonical names where the legacy system had returned raw OCR output. These weren’t errors, but they required downstream validation to confirm they were improvements rather than regressions.
Weeks 3–4: 25% of traffic. We calculated per-stage hit rates for the medicine RAG pipeline on live inputs - not just the golden set. The vector search stage was handling 11% of resolutions, within the expected 8–15% range, indicating the pipeline was behaving as designed on real production inputs.
Week 5: 50% of traffic. The manual queue rate dropped from 40% to under 8% of total submissions. The operations staff member assigned to the queue was transitioned to other work during this week.
Week 6: 100% of traffic. The legacy system was kept warm for two additional weeks before decommissioning.
The data migration - moving years of patient records, prescription scans, and medicine match histories from the old schema to the new one - ran as a separate workstream after the application layer was stable. We used a read-from-old, write-to-new pattern: new records written to both systems in parallel during a two-week overlap period, with nightly reconciliation checks comparing row counts and spot-checking field values against known records. No cutover happened until reconciliation passed on three consecutive nights.
Observability: What We Tracked and Why
AI systems require different monitoring than deterministic systems. You’re not just watching for errors - you’re watching for distribution shifts that indicate something upstream has changed.
We tracked four things continuously after go-live:
Per-stage hit rates in the medicine pipeline. If the in-memory cache dropped from 35% to 2%, either the cache wasn’t warming correctly or the input distribution had shifted. If the vector search stage climbed above 20%, the exact and fuzzy stages were missing inputs they shouldn’t be missing. Alerts fired on sustained deviation from established baselines, not on individual anomalies.
Vision LLM extraction confidence scores. We tracked the distribution of per-field confidence scores over time, with alerts on a shift in the low-confidence tail. A gradual drift here is a leading indicator that input quality is changing - new prescription formats, different image types - before it shows up as an increased queue rate.
Queue rate as a system health proxy. The rate of prescriptions entering the manual confirmation queue was the most reliable single indicator. A baseline of under 8% was established from the golden set. Sustained queue rate above 15% triggered an incident review.
Agent tool-call latency by stage. The ReAct agent logs every tool call with timing. When latency spikes, we can see exactly which stage caused it - something that was completely invisible in the legacy system, which had no structured instrumentation at all.
Every production failure was replayable: the agent’s full tool-call trace, the exact Vision LLM output, and the input image were all logged. Diagnosing a production issue typically took minutes rather than the hours required with the original system.
Results
By end of week six:
Manual queue dropped from 40% to under 8%. The remaining 8% are genuinely ambiguous cases - new prescription formats, heavily degraded images, medicines with no prior occurrence - that legitimately require human judgment. The original 40% was the system failing on inputs it should have handled automatically.
Medicine identification accuracy reached approximately 95 percent. The four-stage pipeline resolved brand names, generics, abbreviations, and OCR noise that the static list had been failing on for two years. Coverage improves continuously as the auto-scraper adds new medicines.
Zero OCR data lost during identity confirmation. The original system discarded prescription data silently if identity resolution failed. The new system holds unresolved prescriptions in a confirmation queue while returning medicine data immediately - no input is lost.
The mobile app launched within 90 days. The modernization produced a documented API as a byproduct. The previous system had no API - every integration had coupled directly to the database. The new API allowed mobile frontend development to run in parallel with the final weeks of traffic migration.
Full technical architecture is in the OCR prescription processing case study.
What We’d Do Differently
Build the golden set before the scoping conversation. We added logging to the legacy system at the start of engagement and waited two weeks before beginning analysis. That was the right sequence, but the two-week wait could have started before we were formally engaged - earlier data means more edge cases in the baseline.
Budget more time for stakeholder rule validation. The rules that required human explanation to understand - the exceptions that existed for business or political reasons, not technical ones - took longer to validate than we had initially estimated. In a system with years of accumulated logic, assume several of the rules have non-obvious histories that a developer interview won’t immediately surface.
Negotiate a longer parallel operation window for higher-stakes systems. Six weeks was sufficient here. For a system with higher transaction volume, stronger regulatory requirements, or more complex failure modes, eight to ten weeks would have been more comfortable.
What This Pattern Looks Like in Other Contexts
The sequence - behavioral baseline first, component-by-component replacement with rollback paths, incremental traffic migration, continuous observability - applies wherever critical systems have accumulated undocumented complexity.
In banking, IBM estimates 95 billion lines of COBOL are still in active production, processing approximately $3 trillion in daily transactions. The challenge isn’t the technology - it’s that golden-set creation at scale requires instrumentation before any replacement begins, and the stakeholder validation of extracted business rules is more time-consuming as systems get older and institutional memory degrades.
In insurance, mainframe policy calculation engines with pricing rules from the 1990s present the same stakeholder challenge in more acute form. Some of those rules exist because of regulatory requirements that have since changed, or client exceptions that were negotiated long before anyone currently at the company started. AI can surface the rules. Humans have to determine which ones are still valid.
McKinsey Digital documents that successful legacy modernizations typically see 20 to 30 percent reductions in IT operating costs. The technical approach matters, but the bigger variable is whether the team invested in a behavioral baseline and stakeholder validation before writing replacement code. The teams that skip those steps are the ones rebuilding for a third time.
If Your Organization Is in This Situation
The indicators we look for: a system that’s been in production for more than two years, patched by more than one developer, with no behavioral test suite, producing operational cost through manual workarounds or blocking new features because changes feel too risky.
If that describes something you’re carrying, the starting point isn’t planning a replacement. It’s building the behavioral baseline - logging production inputs and outputs for two to four weeks, so you have a ground-truth record of what the system actually does before you decide what to replace.
Talk to the Aviasole engineering team about what that looks like for your system, or review the OCR prescription processing case study for the full technical breakdown.