The Assumption Everyone Gets Wrong About RAG
Open any RAG tutorial published in the last two years. There’s almost always an LLM in the retrieval loop - an LLM to rerank results, an LLM to judge relevance, an LLM to reformulate the query before the actual search happens.
The assumption baked into all of this: retrieval is too fuzzy to be trusted on its own. You need a model to make judgment calls.
For general knowledge bases, casual Q&A, or open-domain chat assistants - sure, that’s reasonable. But for financial documents? That assumption is backwards.
When an analyst asks “Why did Progressive’s combined ratio improve in Q3 2024?” - there is a right answer. It’s in a specific filing. From a specific quarter. From a specific company. The answer either matches the query metadata or it doesn’t. There’s no fuzzy middle ground where “close enough” is acceptable. Wrong company means wrong answer. Wrong quarter means outdated answer. Either way, the number you cite in a board meeting is wrong.
This is exactly why we built FinRAG - a deterministic, hybrid retrieval pipeline for US insurance and managed-care financial documents that retrieves accurate context with zero LLM calls in the retrieval path, no API keys, and no nondeterminism.
The source code is public on GitHub. This post explains the design decisions behind it, what makes it different, and when you should (and shouldn’t) use this approach.
Why Financial Documents Break Standard RAG
Before diving into the architecture, it’s worth being specific about what goes wrong when you apply standard RAG to financial filings.
These aren’t edge cases. They’re the norm in financial document retrieval. The industry terminology is homogeneous across competitors. The time dimension is semantically invisible to embedding models. And multi-word domain concepts get fragmented by tokenizers built for general English.
Standard RAG wasn’t designed for this. FinRAG was.
The Architecture: Four Stages, Zero Hallucination Risk
FinRAG retrieves in four deterministic stages. There’s no trained model, no API call, no randomness. Run the same query twice and you get the same results.
What makes this architecture interesting isn’t any single stage - it’s the combination, and what each stage is not doing that everyone else does.
Stage 1: Hard Filtering (Not Soft Penalties)
This is the most important design decision in the whole pipeline. Most retrieval systems apply soft penalties for metadata mismatches - multiply the score by 0.7 if the company doesn’t match, by 0.8 if the year is off. The intent is to surface wrong-company results at lower ranks rather than discarding them entirely.
The problem is that soft penalties still allow wrong answers to appear. They just appear lower. And in financial analysis, rank 4 is still the wrong answer if it’s from the wrong company.
FinRAG hard-filters. A Progressive query excludes every non-Progressive chunk before any scoring begins. There is no Progressive Q2 result ranked fifth. There is no Allstate result anywhere in the list. The corpus is reduced to only matching candidates, then BM25 and vector search compete within that filtered set.
The technical challenge is doing this reliably. Alias resolution needs to be collision-safe: “american” could match American International Group or American Financial Group. FinRAG’s two-pass approach registers an alias only when it uniquely matches one company across word-boundary regex on all canonicals. Short ambiguous aliases aren’t registered at all - the query falls back to full-corpus search with no company filter rather than silently resolving to the wrong entity.
Stage 2a: Dual-Field BM25 (Fixing Length Normalization)
Standard BM25 implementations break on domain-specific corpora. Here’s why.
When you store both unigrams and bigrams in a single BM25 field, a 10-word chunk appears to have ~18 tokens (10 unigrams + 8 bigrams). The average document length inflates by roughly 2×, which breaks the length normalization term in the BM25 formula. Chunks with domain phrases like “net written premiums” score lower than they should because their effective length looks artificially long.
FinRAG maintains two completely separate fields with separate average document lengths and separate document frequency tables:
- Unigrams field: individual tokens, weight 1.0×
- Phrases field: bigrams + trigrams, weight 1.5×
The final BM25 score is the sum across both fields. Insurance phrases - combined ratio, loss ratio, net written premiums, medical loss ratio - score as phrases with proper length normalization. A chunk that contains the full three-word phrase scores measurably higher than one that contains the words scattered across different sentences.
The k1=1.5, b=0.75 parameters are Robertson & Zaragoza defaults. We didn’t tune them - they’re empirically validated across many retrieval benchmarks and work well out of the box for this corpus size.
Stage 2b: Vector Search (Designed to Be Swapped)
The vector retriever ships with HashingSemanticEmbedder - a local, zero-dependency embedder built from token hashing + 9 financial concept dimensions. It’s demo-grade: fast, deterministic, no external dependencies, but not a production embedding model.
The critical design decision here is that the embedder is a drop-in interface: anything that implements embed(text: str) -> list[float] works. Replace it with bge-small-en-v1.5, text-embedding-3-small, FinBERT, or E5-small in a single line:
pipeline = HybridRetrievalPipeline(chunks, embedder=MyEmbedder())
For the financial concept signal, the built-in embedder has 9 dedicated dimensions: profitability, growth, decline, revenue, cash flow, debt, insurance ratios, claims, and capital. Matches are scored with sqrt(match_count) weighting so multi-mention doesn’t dominate. It’s a hand-crafted financial ontology, not learned - which is exactly why it’s deterministic and interpretable.
Stage 3: Reciprocal Rank Fusion (No Training Required)
After both retrievers return their ranked lists, RRF merges them:
RRF_score(chunk) = Σ 1 / (k + rank_in_list)
With k=60 (the Cormack et al. 2009 default). Chunks appearing in both lists get contributions from both terms. A chunk that ranks 3rd in keyword search and 5th in vector search scores 1/(60+3) + 1/(60+5) = 0.0159 + 0.0154 = 0.0313. A chunk that only appears in one list at rank 3 scores 0.0159.
Chunks retrieved by both methods - matching exact terms and semantic meaning - naturally rank at the top. This is the right signal: double retrieval is the strongest evidence of relevance.
The alternative - a learned cross-encoder reranker - needs training data, adds 50–200ms latency, and introduces nondeterminism. For a corpus with stable schema and known domain vocabulary, RRF matches or beats learned rerankers at this scale according to the original paper, and it does it in microseconds.
The Stopword Problem Nobody Talks About
There’s a subtle bug that almost every financial RAG system ships with, and it bit us during development.
Query: “all combined ratio Q3 2024”
Without stopword expansion, the word “all” is a common English word but also appears constantly in Allstate filings: “across all segments”, “approved in all 47 states”, “all P&C lines”. A BM25 scorer without stopwords will rank Allstate chunks above Progressive for a query that contains no company name at all.
FinRAG’s tokenizer removes ~35 stop words specifically selected to prevent this class of false positive - including natural language filler that happens to overlap with company names or financial entity keywords. The tokenizer is shared between BM25 and the metadata filter, so stripping happens consistently at both stages.
The metadata filter also strips its extracted terms from the ranking query before it reaches BM25 and vector search. “Progressive combined ratio Q3 2024” becomes just “combined ratio” after metadata extraction. This prevents the company name from artificially boosting chunks that mention the company frequently in boilerplate text rather than in relevant financial data.
Query Tracing: Full Transparency
One thing that matters enormously in financial applications is explainability. When a retrieval system surfaces a chunk that says Progressive’s combined ratio was 89.4%, the user needs to know: Was this the right filter? Did it search the right corpus?
Every search response in FinRAG includes a trace object:
response = pipeline.search("Progressive combined ratio Q3 2024")
print(response.trace.filters)
# QueryFilters(companies={'Progressive Corporation'}, years={2024}, quarters={'Q3'})
print(response.trace.candidate_count)
# 11 ← chunks that survived hard filtering
print(response.trace.ranking_query)
# "combined ratio" ← what BM25/vector actually searched
for result in response.results:
print(f"Rank {result.rank} | score={result.score:.4f} | sources={result.contributing_sources}")
# contributing_sources = {"keyword", "vector"} or {"keyword"} or {"vector"}
The contributing_sources field tells you whether a chunk was found by keyword search, vector search, or both. Both means high confidence. Single-source results are lower confidence, even if they rank well due to strong signal in one modality.
This traceability is what makes the system auditable - important for anything touching regulatory filings, investor communications, or compliance reporting.
What Lives in the Corpus
The built-in dataset covers 47 chunks across 7 major US insurers and managed-care companies: Progressive, Allstate, UnitedHealth, AIG, MetLife, Cigna, and Travelers - spanning 2023–2025 with Q1–Q4 coverage for most.
Metrics indexed: combined ratio, loss ratio, expense ratio, net written premiums, earned premiums, medical loss ratio (MLR for managed care), RBC capital ratio, investment income, catastrophe losses, reserve development, EPS, book value, and capital returns.
The Streamlit UI accepts PDF uploads directly - drag a 10-Q filing or earnings release, and the pdf_ingestor chunks it, auto-detects company from the filename, extracts year and quarter from either the filename or body text, and tags each chunk with a section label (underwriting, investment, capital, claims, revenue, profitability) based on vocabulary density in that window.
For live data, the SEC EDGAR ingestion script pulls directly from official filings:
python scripts\ingest_sec_filings.py `
--tickers PGR ALL UNH AIG MET CI TRV `
--forms 10-Q `
--limit 1 `
--output data\sec_live_chunks.jsonl `
--user-agent "YourOrg/0.1 contact@yourdomain.com"
When to Use This vs. LLM-Based Retrieval
FinRAG is the right choice in specific circumstances. It’s not a replacement for LLM-augmented retrieval in general - it’s the right tool when your domain has these properties:
The hybrid approach - FinRAG-style hard filtering + semantic retrieval + no LLM in the retrieval path - is underused. Most teams default to LLM-augmented pipelines because the demos are impressive. But for structured domains with strong metadata signals, deterministic retrieval is faster, cheaper, more reproducible, and harder to game.
Production Path
The current pipeline is clean and ready for production workloads with a few targeted upgrades:
| Layer | Current (demo) | Production recommendation |
|---|---|---|
| Embedder | Local hashing | bge-small-en-v1.5 or text-embedding-3-small |
| Corpus storage | JSONL file scan | Qdrant, Weaviate, or Pinecone |
| Metadata filtering | In-memory | Push where filters into the vector DB |
| BM25 | In-process | Elasticsearch or OpenSearch for millions of chunks |
| LLM generation | Not included | Feed top-k chunks as context to GPT-4o, Claude, or Llama 3 |
The most important upgrade is the embedder swap. Everything else scales horizontally. The embedder is the only component that meaningfully affects retrieval quality at production corpus sizes.
For teams building on top of FinRAG for enterprise financial applications - or extending the deterministic retrieval pattern to other structured domains (legal contracts, clinical notes, product catalogs) - our data engineering and AI infrastructure team has built production versions of this pattern for financial services clients. The hard parts are always corpus-specific: alias resolution for your entity set, phrase vocabulary for your domain, and the metadata schema that makes filtering reliable.
The Bigger Pattern: Structured Data Needs Structured Retrieval
FinRAG isn’t just a financial search tool - it’s an argument for a broader principle that gets lost in the excitement around LLM-based retrieval.
Not every retrieval problem benefits from language model judgment. For structured corpora where ground truth is deterministic - financial filings, regulatory documents, legal contracts, product specifications, medical codes - adding an LLM to the retrieval path doesn’t improve accuracy. It introduces variability where none is needed.
The best generative AI systems we’ve built separate retrieval precision from generation fluency. Let the deterministic system handle what it’s good at: finding the right chunk, from the right source, with the right metadata. Let the LLM handle what it’s good at: synthesizing that context into a useful answer.
FinRAG handles the first part. Your LLM of choice handles the second. The interface between them is top-k chunks - clean, traceable, auditable, and reproducible.
This architecture is what makes AI systems trustworthy enough to use in financial decision-making, not just impressive enough to demo.
Try It Yourself
The full codebase is on GitHub: aviasoletechnologies/finrag
No API keys, no build step, no external dependencies for the core pipeline. Clone the repo and run:
git clone https://github.com/aviasoletechnologies/finrag.git
cd finrag
python demo.py "Progressive combined ratio Q3 2024"
Or spin up the Streamlit UI with your own PDF filings:
pip install streamlit pdfplumber
streamlit run app.py
The test suite covers 42 cases including alias collision, stopword guard, cross-year queries, and RRF tie-breaking. Read the tests to understand the edge cases the pipeline is designed to handle.
Further Reading
- Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods (Cormack et al., 2009)
- BM25 and Beyond (Robertson & Zaragoza, 2009) - the foundational paper on Okapi BM25 parameters
- BGE Embedding Models - the recommended embedder swap for production
Related Aviasole Reading
- Building Production-Ready RAG Pipelines - full guide on semantic chunking, hybrid search, and re-ranking
- Generative AI for Enterprises
- Data Engineering & AI Infrastructure
- Agentic AI Systems