Building the Core Engine Behind unrag

I've been building RAG systems for a while now. Every single time, I'd end up in the same place: staring at some framework's internals, wondering why it made a decision I couldn't override, wrapped in an abstraction I couldn't see through.

The core intuition behind unrag was simple. I was building complex RAG pipelines for very different applications, and each time I had to rebuild too much from scratch. I wanted speed, but I didn't want to give up composability and control.

So I built unrag. This is the story of how the core engine came together.

the problem with every rag framework

Here's the thing about most RAG frameworks: they're black boxes. You install them, call some magic function, and hope the retrieval quality is good enough. When it's not (and it always eventually isn't) you're stuck. You can't see the chunking logic. You can't swap the embedding strategy without rewriting half your pipeline. You definitely can't audit what's happening to your data.

I kept running into this with production systems. A PDF extractor would silently fail. Chunks would be split mid-sentence because someone decided 1000 characters was a good boundary (spoiler: it's not). And debugging? Good luck. The whole pipeline is a function that takes text in and spits embeddings out, with zero visibility in between.

I wanted something different. Something where every piece of the pipeline is code I can read, modify, and delete if I want to.

the shadcn moment

If you've used shadcn/ui, you know the feeling. Instead of importing components from a package you don't control, you vendor them into your project. They're your files now. You can change anything.

That's what unrag does for RAG. When you run npx unrag init, it doesn't install a dependency. It copies source files into your project. Your chunker, your embedding adapter, your vector store, all sitting in your codebase, fully readable, fully yours.

$ npx unrag init

The CLI asks you a few questions (which vector store, which embedding provider) and then copies the exact files you need. No more, no less. It even rewrites the internal imports to match your project's alias:

// What the registry source looks like internally:
import { Chunk } from '@registry/core/types'

// What lands in your project:
import { Chunk } from '@unrag/core/types'

Simple string replacement. No AST transforms, no babel plugins. Just replaceAll('@registry/', '@unrag/') and move on.

the context engine

At the heart of unrag sits the ContextEngine. It's embarrassingly simple, just a class that wires together your embedding provider, vector store, chunker, and optional extractors:

export class ContextEngine {
  private readonly config: ResolvedContextEngineConfig

  constructor(config: ContextEngineConfig) {
    this.config = resolveConfig(config)
  }

  async ingest(input: IngestInput): Promise<IngestResult>
  async retrieve(input: RetrieveInput): Promise<RetrieveResult>
  async rerank(input: RerankInput): Promise<RerankResult>
  async delete(input: DeleteInput): Promise<void>
}

Four methods. That's the entire public API. Ingest documents, retrieve chunks, optionally rerank them, delete when done. Everything else is configuration.

I deliberately kept this surface area tiny. The complexity lives in the modules you compose, not in the engine itself. Want a different chunking strategy? Swap the chunker. Want to add PDF extraction? Drop in an extractor. The engine doesn't care, it just orchestrates.

token-aware chunking (the part i'm most proud of)

This is where most RAG systems get it wrong. They chunk by character count. 1000 characters, 500 characters, whatever. The problem? Characters don't map to how language models see text. A 1000-character chunk might be 200 tokens or 400 tokens depending on the content. You're flying blind.

unrag chunks by tokens. Specifically, using js-tiktoken with the o200k_base encoding, the same tokenizer used by GPT-4o and newer models. Every chunk size, overlap, and minimum is specified in tokens:

export const defaultChunkingOptions: ChunkingOptions = {
  chunkSize: 512,       // tokens, not characters
  chunkOverlap: 50,     // token overlap for context continuity
  minChunkSize: 24,     // merge anything smaller
  separators: DEFAULT_SEPARATORS
}

The chunking algorithm is recursive. It tries to split on natural boundaries (paragraphs first, then sentences, then words) and only falls back to raw token splitting as a last resort:

const DEFAULT_SEPARATORS = [
  '\n\n',    // paragraphs
  '\n',      // lines
  '. ',      // sentences
  '? ',      // questions
  '! ',      // exclamations
  '; ',      // clauses
  ': ',      // colons
  ', ',      // commas
  ' ',       // words
  ''         // characters (last resort)
]

The recursion is the clever bit. If a paragraph is too large, it tries splitting by lines. If a line is too large, it tries sentences. All the way down to individual characters. And at each level, it merges small fragments back together so you don't end up with a chunk that's just the word "the".

const recursiveSplit = (
  text: string,
  separators: string[],
  chunkSize: number,
  chunkOverlap: number,
  minChunkSize: number
): string[] => {
  const textTokens = countTokens(text)

  // fits in one chunk? done.
  if (textTokens <= chunkSize) {
    return text.trim() ? [text.trim()] : []
  }

  // find the coarsest separator that exists in this text
  let separatorToUse = ''
  let remainingSeparators = separators

  for (let i = 0; i < separators.length; i++) {
    if (separators[i] === '' || text.includes(separators[i])) {
      separatorToUse = separators[i]
      remainingSeparators = separators.slice(i + 1)
      break
    }
  }

  // split, then recursively handle oversized pieces
  const splits = splitWithSeparator(text, separatorToUse)

  for (const split of splits) {
    if (countTokens(split) <= chunkSize) {
      goodSplits.push(split)
    } else if (remainingSeparators.length > 0) {
      goodSplits.push(
        ...recursiveSplit(split, remainingSeparators, ...)
      )
    } else {
      goodSplits.push(...forceSplitByTokens(split, ...))
    }
  }

  return mergeSplits(goodSplits, chunkSize, chunkOverlap, minChunkSize)
}

I spent way too long on this. But the difference in retrieval quality between token-aware and character-based chunking is night and day. Your chunks actually align with how the model processes text.

And if you don't like my chunking? There's a plugin system. Install unrag add chunker:semantic or chunker:markdown or chunker:code, each one is a separate vendored module you can inspect and modify.

the ingestion pipeline

Ingestion is the most complex part of the engine. It handles text, PDFs, images, audio, video, anything you throw at it. The flow looks like this:

Chunk the main document text using whatever chunker is configured
Process assets, for each attached file:
- Fetch the bytes (with size limits and timeout guards)
- Find matching extractors
- Run extractors with a fallback chain
- Chunk the extracted text
Embed everything with concurrent embedding and batching support
Store and persist to your vector database

The extractor fallback chain for PDFs is my favorite piece of pragmatic engineering:

1. Try text layer extraction (fast, free)
2. If empty → try LLM extraction (expensive, accurate)
3. If disabled → try OCR (for scanned documents)
4. If all fail → emit a specific warning code

Each extractor is a simple interface. A supports function and an extract function. You can write your own in 20 lines:

export type AssetExtractor = {
  name: string
  supports: (args: { asset: AssetInput }) => boolean
  extract: (args: { asset: AssetInput }) => Promise<{
    texts?: Array<{
      content: string
      label?: string       // "Page 1", "Slide 2"
      confidence?: number  // 0-1
    }>
  }>
}

Nothing magic. No class hierarchies, no abstract factories. Just a function that says "I can handle this" and another that does the work.

concurrency without the headache

One thing that bit me early was rate limits. Embedding providers have them. Vector databases have them. If you fire off 500 concurrent embedding requests, you're going to have a bad time.

So I wrote a tiny concurrency controller:

const mapWithConcurrency = async <T, R>(
  items: T[],
  concurrency: number,
  fn: (item: T, idx: number) => Promise<R>
): Promise<R[]> => {
  const results: R[] = new Array(items.length)
  let nextIdx = 0

  const workers = Array.from(
    { length: Math.min(concurrency, items.length) },
    async () => {
      while (true) {
        const i = nextIdx++
        if (i >= items.length) break
        results[i] = await fn(items[i], i)
      }
    }
  )

  await Promise.all(workers)
  return results
}

It maintains exactly N workers pulling from a shared queue. No external dependencies, no semaphore libraries. Just a counter and a while loop. Sometimes the best code is the most boring code.

retrieval is the easy part

After all the complexity of ingestion, retrieval is refreshingly simple:

export const retrieve = async (config, input) => {
  // 1. embed the query
  const queryEmbedding = await config.embedding.embed({
    text: input.query
  })

  // 2. vector similarity search
  const chunks = await config.store.query({
    embedding: queryEmbedding,
    topK: input.topK ?? 8
  })

  return { chunks }
}

Embed the query, search the vector store, return the top K results. That's it. The quality of your retrieval is determined by everything that happened before this point: how you chunked, how you embedded, what you stored.

But sometimes vector similarity isn't enough. That's where reranking comes in, a second pass that uses a cross-encoder model to re-score candidates:

// retrieve broadly, then rerank precisely
const retrieved = await engine.retrieve({ query, topK: 30 })
const reranked = await engine.rerank({
  query,
  candidates: retrieved.chunks,
  topK: 8,
  onMissingReranker: 'skip'  // graceful degradation
})

The onMissingReranker: 'skip' bit is important. If you haven't configured a reranker, it just returns the original results instead of throwing. Graceful degradation over hard failures, always.

privacy as a first-class concern

One thing I felt strongly about: not every system should store the original text. Some data is sensitive. Some compliance frameworks require that you don't persist raw documents.

So unrag has explicit privacy controls:

storage: {
  storeChunkContent: false,     // don't persist chunk text
  storeDocumentContent: false   // don't persist original docs
}

With both set to false, only embeddings and metadata are stored. The vectors are still searchable, but the raw text is gone. You can always re-ingest if you need it back.

This was a deliberate architectural decision from day one, not something bolted on after a compliance audit.

what i learned

Building unrag taught me a few things:

Token-aware chunking matters more than you think. The difference between splitting on characters vs. tokens is the difference between chunks that make semantic sense and chunks that cut off mid-thought. It's a small change in implementation but a massive change in retrieval quality.

Composability beats configuration. Instead of a config file with 200 options, give people modules they can swap. A chunker is just a function. An extractor is just an interface. A store is just three methods. When the pieces are small and independent, the system is easy to reason about.

Own your dependencies. The vendoring approach felt radical at first. But every time I debug a production RAG issue, I'm glad the code is right there in the project. No node_modules spelunking, no version mismatches, no "works on my machine."

If you're building RAG systems and you're tired of fighting your framework, give unrag a shot. Or just read the source, it's all there in your project after npx unrag init.

And one more thing: I built unrag with Subhajit. It wouldn't have been possible to ship unrag without his contributions. He's one of the best people I've worked with, a great person, and a great engineer.