ColPali: Document Retrieval with Vision Language Models ✨

Oct 6, 2024

<aside> 🏹

We will be diving deep into the paper: [Arxiv] ColPali: Efficient Document Retrieval with Vision Language Models

</aside>

Document retrieval has always been a key component of systems like search engines and information retrieval. Traditional document retrieval methods rely heavily on text-based methods (like OCR and text segmentation), often missing crucial visual cues like layouts, images, and tables.

ColPali addresses this by using Vision-Language Models (VLMs) to understand and retrieve visually rich documents, capturing both textual and visual information. ColPali's architecture allows direct encoding of document images into a common embedding space, eliminating the need for time-consuming text extraction and segmentation.

In this blog, we’ll explore the technicalities behind ColPali covering the topics:

Architecture of ColPali
Training techniques
- Contrastive Loss for query-document matching.
- Training on 127,460 query-document pairs (real + synthetic data).
How it transforms embeddings using techniques like
- BiSigLIP for visual-textual embeddings.
- BiPali for pairing image patches with PaliGemma.
- Late Interaction to achieve state-of-the-art performance.

Before we dive into the technical architecture and training of ColPali, let’s walk through the intuition behind how it works.

Intuition Behind ColPali: How It Simplifies Document Retrieval

Step 1: Treating the Document as an Image

Imagine we have a PDF document. Normally, we would extract text from the document using OCR (Optical Character Recognition), segment it into different sections, and then use these segments for searching. ColPali simplifies this process by treating the entire document page as an image, bypassing the need for complex text extraction.

Think of it as taking a photograph of each page of the document. No need to convert the text, handle different languages, or worry about complex layouts. Just treat the page like a picture.

Step 2: Splitting the Image into Patches

Once ColPali has this "image" of the document, it divides the page into small, uniform pieces called patches (for eg, 16x16 pixels).

Each patch captures a tiny portion of the page—it might contain a few words, a piece of a graph, or part of an image. This division helps the model to focus on small, detailed parts of the document rather than trying to understand the whole page at once.