https://twitter.com/doesdatmaksense

Oct 31, 2024

Imagine trying to navigate through hundreds of pages in a dense document, filled with tables, charts, and paragraphs. For a human, finding a specific figure or analyzing a trend would be challenging enough; now imagine building a system to do it. Traditional document retrieval systems often rely heavily on text extraction, losing critical context provided by visuals like the layout of tables or balance sheets.

What if instead of relying on the traditional approach of OCR + layout detection + chunking + text embedding, we directly just embed the whole page in a document capturing their full visual structure—tables, images, headings, and all. ColQwen, an advanced multimodal retrieval model in the ColPali family, does just that.

<aside> 💫

If you haven’t read the ColPali blog yet, I would highly recommend reading that blog first as there I went into the technical details about ColPali. I won’t be going into much architectural detail here. ColQwen is similar to ColPali in terms of technical architecture just that the VLM model in ColQwen is Qwen-2-VL and in ColPali the VLM model is Paligemma. ColPali blog: ColPali: Document Retrieval with Vision Language Models ✨

</aside>

image.png

Figure: Standard retrieval method showing OCR + layout detection (process takes 7.22 sec per page)

image.png

Figure: Using the ColPali method of embedding the whole image as embedding (process takes 0.39 sec per page)

ColQwen

ColQwen is an advanced multimodal retrieval model that uses a Vision Language Model (VLM) approach that processes entire document pages as images. Using multi-vector embeddings, it creates a richly layered representation of the page that maintains each page’s structure and context. It is built specifically to simplify and enhance document retrieval, especially for visually dense documents.

Why ColQwen is Different?

Traditional systems break a document down to basics—OCR pulls out the text, which is then split, processed, and embedded. While this may be fine for text-only files, it limits how much detail can be retrieved from complex documents where layout matters. In financial reports or research papers, the information lies not only in the words but in how they are structured visually—where headings, numbers, and summaries are positioned in relation to each other.

With ColQwen, the process is simple. Instead of reducing pages to smaller text chunks, ColQwen’s multi-vector embeddings capture the whole page image, preserving both text and visual cues.

Multivector embeddings:

<aside> 💡

We have been talking about multi-vector embeddings**, but what exactly are multi-vector embeddings, and how have they been used here?**

</aside>

Unlike traditional single-vector embeddings that use a single dense representation for an entire document text chunk, multi-vector embeddings create multiple, focused embeddings - one for each query token. Originally developed for models like ColBERT (for text-based retrieval with “late interaction”), multi-vector embeddings ensure that each query token can interact with the most relevant portions of a document.

In ColQwen (and ColPali), this technique is adapted for visually complex documents. Each page image is divided into patches, with each section—whether a table, heading, or figure - receiving its own embedding. When a query is made, each token in the query can search across these patches, surfacing the most relevant visual and textual parts of the page. This way, ColQwen retrieves exactly the right content, structured and contextually aware.