Multimodal RAG + Document Retrieval

Built a document-QA pipeline for PDFs using OCR + embeddings + vector search, with tenant-aware filtering for isolation.

RAGQdrantOCREmbeddingsLLMs

Role

ML / Software Engineer (Project)

Timeline

2025

Stack

—

Input

PDFs → page chunks

Retrieval

Vector search (Qdrant)

Safety

Tenant-aware filtering

What I did

Parsed PDFs into page-level text and generated embeddings for semantic retrieval.
Integrated OCR for scanned pages and routed extracted text into the same indexing pipeline.
Queried Qdrant for top-k pages and added tenant-aware filtering to prevent cross-tenant leakage.
Used an LLM to synthesize answers grounded in retrieved pages (RAG pattern).