Most tools add features.
This fixes failures.
Most RAG systems fail for the same reasons: weak retrieval, no evaluation, and no way to debug what went wrong. This fixes that.
Built after debugging real RAG failures in production pipelines — not invented from a spec.
Retrieval pipeline
- Context rewriting
- HyDE generation
- Multi-query variants
- Vector search
- BM25 keyword search
- RRF fusion
- Cross-encoder score
- Threshold filtering
- Top-K selection
- System prompt + ctx
- Stream to client
- Token tracking
Relevant answers get buried. This makes sure they don't.
Pure vector search ranks by similarity — not by whether a chunk actually answers the question. BM25 catches exact terms that embeddings miss. RRF merges both into one ranked list so the right context reaches the model, not just the closest-sounding one.
- text-embedding-3-small via LlamaIndex
- rank_bm25 keyword retrieval
- Reciprocal Rank Fusion (RRF) merging
- Dual-threshold score filtering
Retrieval
Your queries are weak. This rewrites them so retrieval actually works.
A vague or ambiguous query returns empty results — or worse, confidently wrong ones. HyDE generates a hypothetical answer to improve semantic matching. Multi-query creates 4 paraphrase variants and deduplicates. Both run simultaneously so nothing gets missed.
- Hypothetical Document Embeddings (HyDE)
- 4-variant multi-query paraphrasing
- Automatic context-aware query rewriting
- Chat history contextualization
Query expansion
Similar doesn't mean correct. This ranks what actually answers the question.
Vector similarity gets you close — but the most semantically similar chunk is often not the one that answers the query. A cross-encoder re-scores every retrieved chunk against your exact question, promoting what's actually relevant and cutting what's just nearby.
- BAAI/bge-reranker-base neural reranker
- Cross-encoder scoring architecture
- Configurable per project
- Works with or without hybrid search
Example output
Query: "What is the refund policy for annual plans?"
Before — vector similarity only
After — cross-encoder reranking
Correct answer promoted from rank 2 → rank 1
See exactly why your AI gave a wrong answer.
Every response includes a debug panel exposing the full retrieval chain: query variants generated, which chunks were pulled with their scores, how they were re-ranked, and what made it into LLM context. 90% of tools don't give you this — you're expected to guess.
- Query rewrite + HyDE hypothesis visible
- Retrieved chunks with vector & BM25 scores
- Post-reranking order + cutoff reason
- Context quality signal (high / low / none)
Observability
Stop guessing if it got better. Now you'll know.
Every prompt tweak is a guess without ground-truth evaluation. Upload expected Q/A pairs, run automated eval sets, and get retrieval hit-rate and answer similarity scores you can show — not 'it feels better', but actual numbers across your real knowledge base.
- Ground-truth Q/A evaluation sets
- Retrieval hit-rate scoring
- Answer similarity via embedding cosine
- Per-case results with latency
Evaluation
Find out what's failing — before your users do.
Query failures, context quality breakdowns, token costs, top unanswered questions — all measured and surfaced. Not just logged. Know exactly where to fix first instead of waiting for complaints.
- 30-day query volume chart
- Context quality breakdown
- Top 10 most-asked questions
- Token usage (input + output)
Analytics
The pieces most teams forget —
but need in production.
Says 'I don't know' instead of hallucinating
When no relevant context is found, the system responds with a graceful fallback — not a confident wrong answer that damages user trust.
Turn bad answers into a training signal
Every thumbs-down is a retrieval failure worth diagnosing. User reactions are persisted and surfaced in analytics — not discarded.
Roll back when a new prompt breaks things
Save, label, and restore system prompt versions. Never ship a regression you can't undo in 30 seconds.
Works with any document you throw at it
PDF, DOCX, TXT, MD, and scanned images. GPT-4o Vision handles documents that text parsers can't — tested on mixed-format real-world datasets.
Go live on any site with one line of code
Single script tag. No backend plumbing, no auth required. Fully styled and ready for real usage out of the box.
Tune how documents are split per project
Sentence, recursive, sentence window, parent document — four chunking strategies, configurable per project. Not a fixed default.
Stop guessing.
Start debugging.
Free plan. No credit card. Full access to all retrieval features — hybrid search, reranking, evaluation, RAG Inspector.