← Back to Projects

Multimodal RAG + Document Retrieval

Built a document-QA pipeline for PDFs using OCR + embeddings + vector search, with tenant-aware filtering for isolation.

RAGQdrantOCREmbeddingsLLMs

Role

ML / Software Engineer (Project)

Timeline

2025

Stack

Input

PDFs → page chunks

Retrieval

Vector search (Qdrant)

Safety

Tenant-aware filtering

What I did

  • Parsed PDFs into page-level text and generated embeddings for semantic retrieval.
  • Integrated OCR for scanned pages and routed extracted text into the same indexing pipeline.
  • Queried Qdrant for top-k pages and added tenant-aware filtering to prevent cross-tenant leakage.
  • Used an LLM to synthesize answers grounded in retrieved pages (RAG pattern).