Open Source VoIP & ICT Solutions for Businesses Worldwide

LLM Orchestration and RAG Platforms

#4 of 20 Innovations

LLM Orchestration and RAG Platforms

RAG – retrieval-augmented generation – is probably the most widely deployed AI pattern in enterprise software right now. The idea is straightforward: instead of hoping an LLM memorised your company’s documents during training (it didn’t), you retrieve the relevant pieces at query time and pass them as context. The model answers from your actual knowledge base, not from its training weights. That’s the whole thing. But building a RAG system that works reliably in production turns out to be surprisingly involved.

The RAG Query PipelineUser QueryEmbeddingModelVector DBHNSW SearchRetrievedChunks+ MetadataLLM withContextAn-swerHybrid Search StrategyDense Vector Score+BM25 SparseRRF FusionFinal Ranked Results

Hybrid search combining vector similarity + BM25 keyword scoring consistently outperforms either approach alone.

The orchestration stack has three layers you’ll need to think about. The retrieval layer indexes your documents in a vector database – Pinecone (managed), Weaviate or Qdrant (open-source), or pgvector if you’re already running PostgreSQL and don’t want another database to operate. Your incoming queries get embedded and matched against the index, and the top-k chunks come back. The context layer assembles those chunks, conversation history, system instructions, and any tool schemas into the prompt for the LLM. The routing layer decides which model to call, handles fallbacks, and enforces cost controls. LangChain and LlamaIndex are the most popular open-source frameworks for wiring these layers together – they’re not perfect, but they handle the connector plumbing so you don’t have to write custom integrations for every vector store and LLM API. LangSmith and Langfuse add observability, which you’ll need once you’re debugging why certain queries return poor answers.

Chunking Strategies for RAG PipelinesFixed-SizeSplit every N tokensPros: fast, simpleCons: cuts sentencesmid-thoughtWhen: simple docs,fast prototypesToken overlap: ~10-15%SentenceSplit on punctuationPros: readable chunksCons: variable sizecan be too smallWhen: conversationaltext, Q&A docsTypical: 1-5 sentencesSemanticGroup by topic shiftPros: coherent meaningCons: expensive tocompute at index timeWhen: long reports,research papersUses embedding similaritySliding WindowOverlapping windowsPros: no context lossCons: duplicate data,larger index sizeWhen: dense technicaldocs, code filesOverlap: 10-20% stride

Most production RAG systems combine strategies — use sentence chunking by default, add sliding windows for code and tables.

Here’s where it gets interesting – and where a lot of teams trip up. Chunking strategy matters more than most people expect. Naive fixed-size splitting loses context at boundaries. Semantic chunking groups text by meaning but costs more compute at index time. And pure vector search (just embedding similarity) consistently underperforms hybrid search – which combines vector scores with BM25 keyword matching – on enterprise document retrieval benchmarks, typically by 10-20% on recall. Re-ranking with a cross-encoder model as a second-pass filter is now standard practice at companies like Cohere and Jina AI and lifts answer quality noticeably. The other thing that usually comes up late and causes rework: access-control filtering at retrieval. In multi-tenant systems, you need to ensure users can only retrieve documents they’re authorised to see. That’s not a nice-to-have – it’s a data breach waiting to happen if you skip it.

Frequently Asked Questions

What is retrieval-augmented generation and why does it matter?

RAG connects an LLM to an external knowledge base so it can retrieve relevant documents before generating an answer. The model gives accurate, up-to-date responses without retraining, and it can cite sources – which is critical for trust in enterprise and regulated-industry applications.

When should you use RAG instead of fine-tuning?

Use RAG when your knowledge changes frequently, when you need source citations, or when your documents are too large to fit in a prompt. Use fine-tuning when you need the model to adopt a specific style, follow a consistent output format, or master a reasoning pattern that’s hard to convey through retrieved context alone.

What vector databases are most commonly used in production RAG systems?

Pinecone is the most popular managed option. Weaviate and Qdrant are leading open-source choices. pgvector has gained significant traction because it adds vector search to an existing PostgreSQL database without introducing a new infrastructure dependency. Chroma is popular for local development and prototyping.

How do you evaluate whether a RAG pipeline is working well?

You measure retrieval quality (did you get the right documents?) and generation quality (did the model use them correctly?) separately. Frameworks like RAGAS automate this evaluation by scoring context precision, recall, faithfulness, and answer relevance against ground-truth question-answer pairs.